summaryrefslogtreecommitdiff
path: root/doc/source
diff options
context:
space:
mode:
Diffstat (limited to 'doc/source')
-rw-r--r--doc/source/index.rst52
1 files changed, 37 insertions, 15 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst
index 69976fe..44d16d3 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -2494,21 +2494,7 @@ on distributing your script to other people, or running it on multiple
machines, you should specify a parser in the ``BeautifulSoup``
constructor. That will reduce the chances that your users parse a
document differently from the way you parse it.
-
-Line numbers
-------------
-
-The html.parser parser will keep track of where in the original
-document it found each Tag. You can access this information as
-``Tag.lineno`` (line number) and ``Tag.offset`` (position of the start
-tag within a line)::
-
- soup = BeautifulSoup("<p>Paragraph 1</p>\n <p>Paragraph 2</p>", 'html.parser')
- for tag in soup.find_all('p'):
- print(tag.lineno, tag.offset, tag.string)
- # (1, 0, u'Paragraph 1')
- # (2, 3, u'Paragraph 2')
-
+
Encodings
=========
@@ -2758,6 +2744,42 @@ document is Windows-1252, and the document will come out looking like
``UnicodeDammit.detwingle()`` is new in Beautiful Soup 4.1.0.
+Line numbers
+============
+
+The ``html.parser` and ``html5lib`` parsers can keep track of where in
+the original document each Tag was found. You can access this
+information as ``Tag.sourceline`` (line number) and ``Tag.sourcepos``
+(position of the start tag within a line)::
+
+ markup = "<p\n>Paragraph 1</p>\n <p>Paragraph 2</p>"
+ soup = BeautifulSoup(markup, 'html.parser')
+ for tag in soup.find_all('p'):
+ print(tag.sourceline, tag.sourcepos, tag.string)
+ # (1, 0, u'Paragraph 1')
+ # (2, 3, u'Paragraph 2')
+
+Note that the two parsers mean slightly different things by
+``sourceline`` and ``sourcepos``. For html.parser, these numbers
+represent the position of the initial less-than sign. For html5lib,
+these numbers represent the position of the final greater-than sign.
+
+ soup = BeautifulSoup(markup, 'html5lib')
+ for tag in soup.find_all('p'):
+ print(tag.sourceline, tag.sourcepos, tag.string)
+ # (2, 1, u'Paragraph 1')
+ # (3, 7, u'Paragraph 2')
+
+You can shut off this feature by passing ``store_line_numbers=False`
+into the ``BeautifulSoup`` constructor::
+
+ markup = "<p\n>Paragraph 1</p>\n <p>Paragraph 2</p>"
+ soup = BeautifulSoup(markup, 'html.parser', store_line_numbers=False)
+ soup.p.sourceline
+ # None
+
+This feature is new in 4.8.1, and the parsers based on lxml don't
+support it.
Comparing objects for equality
==============================