From 880c5c4fe3668fe505d9c03c8c80f35be997574b Mon Sep 17 00:00:00 2001
From: Leonard Richardson The Dormouse's story Once upon a time there were three little sisters; and their names were
+ Elsie,
+ Lacie and
+ Tillie;
+ and they lived at the bottom of a well. ...
+ #
+ # The Dormouse's story
+ #
+ #
+ # Once upon a time there were three little sisters; and their names were
+ #
+ # Elsie
+ #
+ # ,
+ #
+ # Lacie
+ #
+ # and
+ #
+ # Tillie
+ #
+ # ; and they lived at the bottom of a well.
+ #
+ # ...
+ # The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were
+ Elsie,
+ Lacie and
+ Tillie;
+ and they lived at the bottom of a well. ... The Dormouse's story
+tag", and so on. Beautiful Soup offers tools for reconstructing the
+initial parse of the document.
+
+.. _element-generators:
+
+``.next_element`` and ``.previous_element``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``.next_element`` attribute of a string or tag points to whatever
+was parsed immediately afterwards. It might be the same as
+``.next_sibling``, but it's usually drastically different.
+
+Here's the final tag in the "three sisters" document. Its
+``.next_sibling`` is a string: the conclusion of the sentence that was
+interrupted by the start of the tag.::
+
+ last_a_tag = soup.find("a", id="link3")
+ last_a_tag
+ # Tillie
+
+ last_a_tag.next_sibling
+ # '; and they lived at the bottom of a well.'
+
+But the ``.next_element`` of that tag, the thing that was parsed
+immediately after the tag, is `not` the rest of that sentence:
+it's the word "Tillie"::
+
+ last_a_tag.next_element
+ # u'Tillie'
+
+That's because in the original markup, the word "Tillie" appeared
+before that semicolon. The parser encountered an tag, then the
+word "Tillie", then the closing tag, then the semicolon and rest of
+the sentence. The semicolon is on the same level as the tag, but the
+word "Tillie" was encountered first.
+
+The ``.previous_element`` attribute is the exact opposite of
+``.next_element``. It points to whatever element was parsed
+immediately before this one::
+
+ last_a_tag.previous_element
+ # u' and\n'
+ last_a_tag.previous_element.next_element
+ # Tillie
+
+``.next_elements`` and ``.previous_elements``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You should get the idea by now. You can use these iterators to move
+forward or backward in the document as it was parsed::
+
+ for element in last_a_tag.next_elements:
+ print(repr(element))
+ # u'Tillie'
+ # u';\nand they lived at the bottom of a well.'
+ # u'\n\n'
+ # ... The Dormouse's story Once upon a time there were three little sisters; and their names were
+ Elsie,
+ Lacie and
+ Tillie;
+ and they lived at the bottom of a well. ...
+tags::
+
+ soup.find_all(has_class_but_no_id)
+ # [ The Dormouse's story Once upon a time there were... ... tags. It doesn't pick up the
+tags, because those tags define both "class" and "id". It doesn't pick
+up tags like and The Dormouse's story tag with the CSS class "title"?
+Let's look at the arguments to ``find_all()``.
+
+.. _name:
+
+The ``name`` argument
+^^^^^^^^^^^^^^^^^^^^^
+
+Pass in a value for ``name`` and you'll tell Beautiful Soup to only
+consider tags with certain names. Text strings will be ignored, as
+will tags whose names that don't match.
+
+This is the simplest usage::
+
+ soup.find_all("title")
+ # [ Once upon a time there were three little sisters; and their names were
+ # Elsie,
+ # Lacie and
+ # Tillie;
+ # and they lived at the bottom of a well. tags is an
+indirect parent of the string, and our search finds that as
+well. There's a tag with the CSS class "title" `somewhere` in the
+document, but it's not one of this string's parents, so we can't find
+it with ``find_parents()``.
+
+You may have made the connection between ``find_parent()`` and
+``find_parents()``, and the `.parent`_ and `.parents`_ attributes
+mentioned earlier. The connection is very strong. These search methods
+actually use ``.parents`` to iterate over all the parents, and check
+each one against the provided filter to see if it matches.
+
+``find_next_siblings()`` and ``find_next_sibling()``
+----------------------------------------------------
+
+Signature: find_next_siblings(:ref:`name ... The Dormouse's story ... tag in the document showed up, even though it's not in
+the same part of the tree as the tag we started from. For these
+methods, all that matters is that an element match the filter, and
+show up later in the document than the starting element.
+
+``find_all_previous()`` and ``find_previous()``
+-----------------------------------------------
+
+Signature: find_all_previous(:ref:`name Once upon a time there were three little sisters; ... The Dormouse's story tag that contains the tag we started
+with. This shouldn't be too surprising: we're looking at all the tags
+that show up earlier in the document than the one we started with. A
+ tag that contains an tag must have shown up earlier in the
+document.
+
+Modifying the tree
+==================
+
+Beautiful Soup's main strength is in searching the parse tree, but you
+can also modify the tree and write your changes as a new HTML or XML
+document.
+
+Changing tag names and attributes
+---------------------------------
+
+I covered this earlier, in `Attributes`_, but it bears repeating. You
+can rename a tag, change the values of its attributes, add new
+attributes, and delete attributes::
+
+ soup = BeautifulSoup('Extremely bold')
+ tag = soup.b
+
+ tag.name = "blockquote"
+ tag['class'] = 'verybold'
+ tag['id'] = 1
+ tag
+ # Extremely bold
+
+Attributes
+^^^^^^^^^^
+
+A tag may have any number of attributes. The tag ```` has an attribute "class" whose value is
+"boldest". You can access a tag's attributes by treating the tag like
+a dictionary::
+
+ tag['class']
+ # u'boldest'
+
+You can access that dictionary directly as ``.attrs``::
+
+ tag.attrs
+ # {u'class': u'boldest'}
+
+You can add, remove, and modify a tag's attributes. Again, this is
+done by treating the tag as a dictionary::
+
+ tag['class'] = 'verybold'
+ tag['id'] = 1
+ tag
+ # Extremely bold
+
+ del tag['class']
+ del tag['id']
+ tag
+ # Extremely bold
+
+``NavigableString``
+-------------------
+
+A string corresponds to a bit of text within a tag. Beautiful Soup
+defines the ``NavigableString`` class to contain these bits of text::
+
+ tag.string
+ # u'Extremely bold'
+ type(tag.string)
+ # No longer bold
+
+``NavigableString`` supports most of the features described in
+`Navigating the tree`_ and `Searching the tree`_, but not all of
+them. In particular, since a string can't contain anything (the way a
+tag may contain a string or another tag), strings don't support the
+``.contents`` or ``.string`` attributes, or the `find()` method.
+
+``BeautifulSoup``
+-----------------
+
+The ``BeautifulSoup`` object itself represents the document as a
+whole. For most purposes, you can treat it as a :ref:`Tag`
+object. This means it supports most of the methods described in
+`Navigating the tree`_ and `Searching the tree`_.
+
+Since the ``BeautifulSoup`` object doesn't correspond to an actual
+HTML or XML tag, it has no name and no attributes. But sometimes it's
+useful to look at its ``.name``, so it's been given the special
+``.name`` "[document]"::
+
+ soup.name
+ # u'[document]'
+
+Comments and other special strings
+----------------------------------
+
+``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost
+everything you'll see in an HTML or XML file, but there are a few
+leftover bits. The only one you'll probably ever need to worry about
+is the comment::
+
+ markup = ""
+ soup = BeautifulSoup(markup)
+ comment = soup.b.string
+ type(comment)
+ # Extremely bold
+
+ del tag['class']
+ del tag['id']
+ tag
+ # Extremely bold
+
+
+Modifying ``.string``
+---------------------
+
+If you set a tag's ``.string`` attribute, the tag's contents are
+replaced with the string you give::
+
+ markup = 'I linked to example.com'
+ soup = BeautifulSoup(markup)
+
+ tag = soup.a
+ tag.string = "New link text."
+ tag
+ # New link text.
+
+Be careful: if the tag contained other tags, they and all their
+contents will be destroyed.
+
+``append()``
+------------
+
+You can add to a tag's contents with ``Tag.append()``. It works just
+like calling ``.append()`` on a Python list::
+
+ soup = BeautifulSoup("Foo")
+ soup.a.append("Bar")
+
+ soup
+ #
tag. This parser also adds an empty
tag to the +document. + +Here's the same document parsed with Python's built-in HTML +parser:: + + BeautifulSoup("", "html.parser") + # + +Like html5lib, this parser ignores the closing tag. Unlike +html5lib, this parser makes no attempt to create a well-formed HTML +document by adding a tag. Unlike lxml, it doesn't even bother +to add an tag. + +Since the document "" is invalid, none of these techniques is +the "correct" way to handle it. The html5lib parser uses techniques +that are part of the HTML5 standard, so it has the best claim on being +the "correct" way, but all three techniques are leigtimate. + +Differences between parsers can affect your script. If you're planning +on distributing your script to other people, you might want to specify +in the ``BeautifulSoup`` constructor which parser you used during +development. That will reduce the chances that your users parse a +document differently from the way you parse it. + + +Encodings +========= + +Any HTML or XML document is written in a specific encoding like ASCII +or UTF-8. But when you load that document into Beautiful Soup, you'll +discover it's been converted to Unicode:: + + markup = "Sacr\xe9 bleu!
+ + + ''' + + soup = BeautifulSoup(markup) + print(soup.prettify()) + # + # + # + # + # + #+ # Sacré bleu! + #
+ # + # + +Note that the tag has been rewritten to reflect the fact that +the document is now in UTF-8. + +If you don't want UTF-8, you can pass an encoding into ``prettify()``:: + + print(soup.prettify("latin-1")) + # + # + # + # ... + +You can also call encode() on the ``BeautifulSoup`` object, or any +element in the soup, just as if it were a Python string:: + + soup.p.encode("latin-1") + # 'Sacr\xe9 bleu!
' + + soup.p.encode("utf-8") + # 'Sacr\xc3\xa9 bleu!
' + +Unicode, Dammit +--------------- + +You can use Unicode, Dammit without using Beautiful Soup. It's useful +whenever you have data in an unknown encoding and you just want it to +become Unicode:: + + from bs4 import UnicodeDammit + dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") + print(dammit.unicode_markup) + # Sacré bleu! + dammit.original_encoding + # 'utf-8' + +The more data you give Unicode, Dammit, the more accurately it will +guess. If you have your own suspicions as to what the encoding might +be, you can pass them in as a list:: + + dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) + print(dammit.unicode_markup) + # Sacré bleu! + dammit.original_encoding + # 'latin-1' + +Unicode, Dammit has one special feature that Beautiful Soup doesn't +use. You can use it to convert Microsoft smart quotes to HTML or XML +entities:: + + markup = b"I just \x93love\x94 Microsoft Word
" + + UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup + # u'I just “love” Microsoft Word
' + + UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup + # u'I just “love” Microsoft Word
' + +You might find this feature useful, but Beautiful Soup doesn't use +it. Beautiful Soup prefers the default behavior, which is toconvert +Microsoft smart quotes to Unicode characters along with everything +else:: + + UnicodeDammit(markup, ["windows-1252"]).unicode_markup + # u'I just \u201clove\u201d Microsoft Word
' + +Parsing only part of a document +=============================== + +Let's say you want to use Beautiful Soup look at a document's +tags. It's a waste of time and memory to parse the entire document and +then go over it again looking for tags. It would be much faster to +ignore everthing that wasn't an tag in the first place. The +``SoupStrainer`` class allows you to choose which parts of an incoming +document are parsed. You just create a ``SoupStrainer`` and pass it in +to the ``BeautifulSoup`` constructor as the ``parse_only`` argument. + +(Note that *this feature won't work if you're using the html5lib +parser*. If you use html5lib, the whole document will be parsed, no +matter what. In the examples below, I'll be forcing Beautiful Soup to +use Python's built-in parser.) + +``SoupStrainer`` +---------------- + +The ``SoupStrainer`` class takes the same arguments as a typical +method from `Searching the tree`_: :ref:`nameThe Dormouse's story
+ +Once upon a time there were three little sisters; and their names were + Elsie, + Lacie and + Tillie; + and they lived at the bottom of a well.
+ +...
+ """ + + print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) + # + # Elsie + # + # + # Lacie + # + # + # Tillie + # + + print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) + # + # Lacie + # + + print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) + # Elsie + # , + # Lacie + # and + # Tillie + # ... + # + +You can also pass a ``SoupStrainer`` into any of the methods covered +in `Searching the tree`_. This probably isn't terribly useful, but I +thought I'd mention it:: + + soup = BeautifulSoup(html_doc) + soup.find_all(only_short_strings) + # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', + # u'\n\n', u'...', u'\n'] + +Troubleshooting +=============== + +Parsing XML +----------- + +By default, Beautiful Soup parses documents as HTML. To parse a +document as XML, pass in "xml" as the second argument to the +``BeautifulSoup`` constructor:: + + soup = BeautifulSoup(markup, "xml") + +You'll need to :ref:`have lxml installed