From b577f965d28031628406091cc36466353795ced3 Mon Sep 17 00:00:00 2001
From: Leonard Richardson The Dormouse's story Once upon a time there were three little sisters; and their names were
+ Elsie,
+ Lacie and
+ Tillie;
+ and they lived at the bottom of a well. ...
+ #
+ # The Dormouse's story
+ #
+ #
+ # Once upon a time there were three little sisters; and their names were
+ #
+ # Elsie
+ #
+ # ,
+ #
+ # Lacie
+ #
+ # and
+ #
+ # Tillie
+ #
+ # ; and they lived at the bottom of a well.
+ #
+ # ...
+ # The Dormouse's story Back to the homepage Back to the homepage The Dormouse's story Once upon a time there were three little sisters; and their names were
+ Elsie,
+ Lacie and
+ Tillie;
+ and they lived at the bottom of a well. ... The Dormouse's story
+tag", and so on. Beautiful Soup offers tools for reconstructing the
+initial parse of the document.
+
+.. _element-generators:
+
+``.next_element`` and ``.previous_element``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``.next_element`` attribute of a string or tag points to whatever
+was parsed immediately afterwards. It might be the same as
+``.next_sibling``, but it's usually drastically different.
+
+Here's the final tag in the "three sisters" document. Its
+``.next_sibling`` is a string: the conclusion of the sentence that was
+interrupted by the start of the tag.::
+
+ last_a_tag = soup.find("a", id="link3")
+ last_a_tag
+ # Tillie
+
+ last_a_tag.next_sibling
+ # '; and they lived at the bottom of a well.'
+
+But the ``.next_element`` of that tag, the thing that was parsed
+immediately after the tag, is `not` the rest of that sentence:
+it's the word "Tillie"::
+
+ last_a_tag.next_element
+ # u'Tillie'
+
+That's because in the original markup, the word "Tillie" appeared
+before that semicolon. The parser encountered an tag, then the
+word "Tillie", then the closing tag, then the semicolon and rest of
+the sentence. The semicolon is on the same level as the tag, but the
+word "Tillie" was encountered first.
+
+The ``.previous_element`` attribute is the exact opposite of
+``.next_element``. It points to whatever element was parsed
+immediately before this one::
+
+ last_a_tag.previous_element
+ # u' and\n'
+ last_a_tag.previous_element.next_element
+ # Tillie
+
+``.next_elements`` and ``.previous_elements``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You should get the idea by now. You can use these iterators to move
+forward or backward in the document as it was parsed::
+
+ for element in last_a_tag.next_elements:
+ print(repr(element))
+ # u'Tillie'
+ # u';\nand they lived at the bottom of a well.'
+ # u'\n\n'
+ # ... The Dormouse's story Once upon a time there were three little sisters; and their names were
+ Elsie,
+ Lacie and
+ Tillie;
+ and they lived at the bottom of a well. ...
+tags::
+
+ soup.find_all(has_class_but_no_id)
+ # [ The Dormouse's story Once upon a time there were... ... tags. It doesn't pick up the
+tags, because those tags define both "class" and "id". It doesn't pick
+up tags like and The Dormouse's story tag with the CSS class "title"?
+Let's look at the arguments to ``find_all()``.
+
+.. _name:
+
+The ``name`` argument
+^^^^^^^^^^^^^^^^^^^^^
+
+Pass in a value for ``name`` and you'll tell Beautiful Soup to only
+consider tags with certain names. Text strings will be ignored, as
+will tags whose names that don't match.
+
+This is the simplest usage::
+
+ soup.find_all("title")
+ # [ The Dormouse's story Once upon a time there were three little sisters; and their names were
+ # Elsie,
+ # Lacie and
+ # Tillie;
+ # and they lived at the bottom of a well. tags is an
+indirect parent of the string, and our search finds that as
+well. There's a tag with the CSS class "title" `somewhere` in the
+document, but it's not one of this string's parents, so we can't find
+it with ``find_parents()``.
+
+You may have made the connection between ``find_parent()`` and
+``find_parents()``, and the `.parent`_ and `.parents`_ attributes
+mentioned earlier. The connection is very strong. These search methods
+actually use ``.parents`` to iterate over all the parents, and check
+each one against the provided filter to see if it matches.
+
+``find_next_siblings()`` and ``find_next_sibling()``
+----------------------------------------------------
+
+Signature: find_next_siblings(:ref:`name ... The Dormouse's story ... tag in the document showed up, even though it's not in
+the same part of the tree as the tag we started from. For these
+methods, all that matters is that an element match the filter, and
+show up later in the document than the starting element.
+
+``find_all_previous()`` and ``find_previous()``
+-----------------------------------------------
+
+Signature: find_all_previous(:ref:`name Once upon a time there were three little sisters; ... The Dormouse's story tag that contains the tag we started
+with. This shouldn't be too surprising: we're looking at all the tags
+that show up earlier in the document than the one we started with. A
+ tag that contains an tag must have shown up earlier in the
+document.
+
+CSS selectors
+-------------
+
+Beautiful Soup supports a subset of the `CSS selector standard
+ The law firm of Dewey, Cheatem, & Howe The law firm of Dewey, Cheatem, & Howe Il a dit <<Sacré bleu!>>
+ # Il a dit <<Sacré bleu!>>
+ #
+ # Il a dit <<Sacré bleu!>>
+ #
+ # Il a dit <
+ # IL A DIT <
+ # IL A DIT <<SACRÉ BLEU!>>
+ # tag. This parser also adds an empty Extremely bold
+
+Attributes
+^^^^^^^^^^
+
+A tag may have any number of attributes. The tag ```` has an attribute "class" whose value is
+"boldest". You can access a tag's attributes by treating the tag like
+a dictionary::
+
+ tag['class']
+ # u'boldest'
+
+You can access that dictionary directly as ``.attrs``::
+
+ tag.attrs
+ # {u'class': u'boldest'}
+
+You can add, remove, and modify a tag's attributes. Again, this is
+done by treating the tag as a dictionary::
+
+ tag['class'] = 'verybold'
+ tag['id'] = 1
+ tag
+ # Extremely bold
+
+ del tag['class']
+ del tag['id']
+ tag
+ # Extremely bold
+
+.. _multivalue:
+
+Multi-valued attributes
+&&&&&&&&&&&&&&&&&&&&&&&
+
+HTML 4 defines a few attributes that can have multiple values. HTML 5
+removes a couple of them, but defines a few more. The most common
+multi-valued attribute is ``class`` (that is, a tag can have more than
+one CSS class). Others include ``rel``, ``rev``, ``accept-charset``,
+``headers``, and ``accesskey``. Beautiful Soup presents the value(s)
+of a multi-valued attribute as a list::
+
+ css_soup = BeautifulSoup('')
+ css_soup.p['class']
+ # ["body", "strikeout"]
+
+ css_soup = BeautifulSoup('')
+ css_soup.p['class']
+ # ["body"]
+
+If an attribute `looks` like it has more than one value, but it's not
+a multi-valued attribute as defined by any version of the HTML
+standard, Beautiful Soup will leave the attribute alone::
+
+ id_soup = BeautifulSoup('')
+ id_soup.p['id']
+ # 'my id'
+
+When you turn a tag back into a string, multiple attribute values are
+consolidated::
+
+ rel_soup = BeautifulSoup('No longer bold
+
+``NavigableString`` supports most of the features described in
+`Navigating the tree`_ and `Searching the tree`_, but not all of
+them. In particular, since a string can't contain anything (the way a
+tag may contain a string or another tag), strings don't support the
+``.contents`` or ``.string`` attributes, or the `find()` method.
+
+``BeautifulSoup``
+-----------------
+
+The ``BeautifulSoup`` object itself represents the document as a
+whole. For most purposes, you can treat it as a :ref:`Tag`
+object. This means it supports most of the methods described in
+`Navigating the tree`_ and `Searching the tree`_.
+
+Since the ``BeautifulSoup`` object doesn't correspond to an actual
+HTML or XML tag, it has no name and no attributes. But sometimes it's
+useful to look at its ``.name``, so it's been given the special
+``.name`` "[document]"::
+
+ soup.name
+ # u'[document]'
+
+Comments and other special strings
+----------------------------------
+
+``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost
+everything you'll see in an HTML or XML file, but there are a few
+leftover bits. The only one you'll probably ever need to worry about
+is the comment::
+
+ markup = ""
+ soup = BeautifulSoup(markup)
+ comment = soup.b.string
+ type(comment)
+ # Extremely bold
+
+ del tag['class']
+ del tag['id']
+ tag
+ # Extremely bold
+
+
+Modifying ``.string``
+---------------------
+
+If you set a tag's ``.string`` attribute, the tag's contents are
+replaced with the string you give::
+
+ markup = 'I linked to example.com'
+ soup = BeautifulSoup(markup)
+
+ tag = soup.a
+ tag.string = "New link text."
+ tag
+ # New link text.
+
+Be careful: if the tag contained other tags, they and all their
+contents will be destroyed.
+
+``append()``
+------------
+
+You can add to a tag's contents with ``Tag.append()``. It works just
+like calling ``.append()`` on a Python list::
+
+ soup = BeautifulSoup("Foo")
+ soup.a.append("Bar")
+
+ soup
+ # FooBar
+ soup.a.contents
+ # [u'Foo', u'Bar']
+
+``BeautifulSoup.new_string()`` and ``.new_tag()``
+-------------------------------------------------
+
+If you need to add a string to a document, no problem--you can pass a
+Python string in to ``append()``, or you can call the factory method
+``BeautifulSoup.new_string()``::
+
+ soup = BeautifulSoup("")
+ tag = soup.b
+ tag.append("Hello")
+ new_string = soup.new_string(" there")
+ tag.append(new_string)
+ tag
+ # Hello there.
+ tag.contents
+ # [u'Hello', u' there']
+
+What if you need to create a whole new tag? The best solution is to
+call the factory method ``BeautifulSoup.new_tag()``::
+
+ soup = BeautifulSoup("")
+ original_tag = soup.b
+
+ new_tag = soup.new_tag("a", href="http://www.example.com")
+ original_tag.append(new_tag)
+ original_tag
+ #
+
+ new_tag.string = "Link text."
+ original_tag
+ # Link text.
+
+Only the first argument, the tag name, is required.
+
+``insert()``
+------------
+
+``Tag.insert()`` is just like ``Tag.append()``, except the new element
+doesn't necessarily go at the end of its parent's
+``.contents``. It'll be inserted at whatever numeric position you
+say. It works just like ``.insert()`` on a Python list::
+
+ markup = 'I linked to example.com'
+ soup = BeautifulSoup(markup)
+ tag = soup.a
+
+ tag.insert(1, "but did not endorse ")
+ tag
+ # I linked to but did not endorse example.com
+ tag.contents
+ # [u'I linked to ', u'but did not endorse', example.com]
+
+``insert_before()`` and ``insert_after()``
+------------------------------------------
+
+The ``insert_before()`` method inserts a tag or string immediately
+before something else in the parse tree::
+
+ soup = BeautifulSoup("stop")
+ tag = soup.new_tag("i")
+ tag.string = "Don't"
+ soup.b.string.insert_before(tag)
+ soup.b
+ # Don'tstop
+
+The ``insert_after()`` method moves a tag or string so that it
+immediately follows something else in the parse tree::
+
+ soup.b.i.insert_after(soup.new_string(" ever "))
+ soup.b
+ # Don't ever stop
+ soup.b.contents
+ # [Don't, u' ever ', u'stop']
+
+``clear()``
+-----------
+
+``Tag.clear()`` removes the contents of a tag::
+
+ markup = 'I linked to example.com'
+ soup = BeautifulSoup(markup)
+ tag = soup.a
+
+ tag.clear()
+ tag
+ #
+
+``extract()``
+-------------
+
+``PageElement.extract()`` removes a tag or string from the tree. It
+returns the tag or string that was extracted::
+
+ markup = 'I linked to example.com'
+ soup = BeautifulSoup(markup)
+ a_tag = soup.a
+
+ i_tag = soup.i.extract()
+
+ a_tag
+ # I linked to
+
+ i_tag
+ # example.com
+
+ print(i_tag.parent)
+ None
+
+At this point you effectively have two parse trees: one rooted at the
+``BeautifulSoup`` object you used to parse the document, and one rooted
+at the tag that was extracted. You can go on to call ``extract`` on
+a child of the element you extracted::
+
+ my_string = i_tag.string.extract()
+ my_string
+ # u'example.com'
+
+ print(my_string.parent)
+ # None
+ i_tag
+ #
+
+
+``decompose()``
+---------------
+
+``Tag.decompose()`` removes a tag from the tree, then `completely
+destroys it and its contents`::
+
+ markup = 'I linked to example.com'
+ soup = BeautifulSoup(markup)
+ a_tag = soup.a
+
+ soup.i.decompose()
+
+ a_tag
+ # I linked to
+
+
+.. _replace_with:
+
+``replace_with()``
+------------------
+
+``PageElement.replace_with()`` removes a tag or string from the tree,
+and replaces it with the tag or string of your choice::
+
+ markup = 'I linked to example.com'
+ soup = BeautifulSoup(markup)
+ a_tag = soup.a
+
+ new_tag = soup.new_tag("b")
+ new_tag.string = "example.net"
+ a_tag.i.replace_with(new_tag)
+
+ a_tag
+ # I linked to example.net
+
+``replace_with()`` returns the tag or string that was replaced, so
+that you can examine it or add it back to another part of the tree.
+
+``replace_with_children()``
+---------------------------
+
+``Tag.replace_with_children()`` replaces a tag with whatever's inside
+that tag. It's good for stripping out markup::
+
+ markup = 'I linked to example.com'
+ soup = BeautifulSoup(markup)
+ a_tag = soup.a
+
+ a_tag.i.replace_with_children()
+ a_tag
+ # I linked to example.com
+
+Like ``replace_with()``, ``replace_with_children()`` returns the tag
+that was replaced.
+
+Output
+======
+
+Pretty-printing
+---------------
+
+The ``prettify()`` method will turn a Beautiful Soup parse tree into a
+nicely formatted bytestring, with each HTML/XML tag on its own line::
+
+ markup = 'I linked to example.com'
+ soup = BeautifulSoup(markup)
+ soup.prettify()
+ # '\n \n \n \n \n...'
+
+ print(soup.prettify())
+ #
+ #
+ #
+ #
+ #
+ # I linked to
+ #
+ # example.com
+ #
+ #
+ #
+ #
+
+You can call ``prettify()`` on the top-level ``BeautifulSoup`` object,
+or on any of its ``Tag`` objects::
+
+ print(soup.a.prettify())
+ #
+ # I linked to
+ #
+ # example.com
+ #
+ #
+
+Non-pretty printing
+-------------------
+
+If you just want a string, with no fancy formatting, you can call
+``unicode()`` or ``str()`` on a ``BeautifulSoup`` object, or a ``Tag``
+within it::
+
+ str(soup)
+ # 'I linked to example.com'
+
+ unicode(soup.a)
+ # u'I linked to example.com'
+
+The ``str()`` function returns a string encoded in UTF-8. See
+`Encodings`_ for other options.
+
+You can also call ``encode()`` to get a bytestring, and ``decode()``
+to get Unicode.
+
+Output formatters
+-----------------
+
+If you give Beautiful Soup a document that contains HTML entities like
+"&lquot;", they'll be converted to Unicode characters::
+
+ soup = BeautifulSoup("“Dammit!” he said.")
+ unicode(soup)
+ # u'\u201cDammit!\u201d he said.'
+
+If you then convert the document to a string, the Unicode characters
+will be encoded as UTF-8. You won't get the HTML entities back::
+
+ str(soup)
+ # '\xe2\x80\x9cDammit!\xe2\x80\x9d he said.'
+
+By default, the only characters that are escaped upon output are bare
+ampersands and angle brackets. These get turned into "&", "<",
+and ">", so that Beautiful Soup doesn't inadvertently generate
+invalid HTML or XML::
+
+ soup = BeautifulSoup("
Sacr\xe9 bleu!
+ + + ''' + + soup = BeautifulSoup(markup) + print(soup.prettify()) + # + # + # + # + # + #+ # Sacré bleu! + #
+ # + # + +Note that the tag has been rewritten to reflect the fact that +the document is now in UTF-8. + +If you don't want UTF-8, you can pass an encoding into ``prettify()``:: + + print(soup.prettify("latin-1")) + # + # + # + # ... + +You can also call encode() on the ``BeautifulSoup`` object, or any +element in the soup, just as if it were a Python string:: + + soup.p.encode("latin-1") + # 'Sacr\xe9 bleu!
' + + soup.p.encode("utf-8") + # 'Sacr\xc3\xa9 bleu!
' + +Any characters that can't be represented in your chosen encoding will +be converted into numeric XML entity references. For instance, here's +a document that includes the Unicode character SNOWMAN:: + + markup = u"\N{SNOWMAN}" + snowman_soup = BeautifulSoup(markup) + tag = snowman_soup.b + +The SNOWMAN character can be part of a UTF-8 document (it looks like +☃), but there's no representation for that character in ISO-Latin-1 or +ASCII, so it's converted into "☃" for those encodings:: + + print(tag.encode("utf-8")) + # ☃ + + print tag.encode("latin-1") + # ☃ + + print tag.encode("ascii") + # ☃ + +Unicode, Dammit +--------------- + +You can use Unicode, Dammit without using Beautiful Soup. It's useful +whenever you have data in an unknown encoding and you just want it to +become Unicode:: + + from bs4 import UnicodeDammit + dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") + print(dammit.unicode_markup) + # Sacré bleu! + dammit.original_encoding + # 'utf-8' + +The more data you give Unicode, Dammit, the more accurately it will +guess. If you have your own suspicions as to what the encoding might +be, you can pass them in as a list:: + + dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) + print(dammit.unicode_markup) + # Sacré bleu! + dammit.original_encoding + # 'latin-1' + +Unicode, Dammit has one special feature that Beautiful Soup doesn't +use. You can use it to convert Microsoft smart quotes to HTML or XML +entities:: + + markup = b"I just \x93love\x94 Microsoft Word
" + + UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup + # u'I just “love” Microsoft Word
' + + UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup + # u'I just “love” Microsoft Word
' + +You might find this feature useful, but Beautiful Soup doesn't use +it. Beautiful Soup prefers the default behavior, which is to convert +Microsoft smart quotes to Unicode characters along with everything +else:: + + UnicodeDammit(markup, ["windows-1252"]).unicode_markup + # u'I just \u201clove\u201d Microsoft Word
' + +Parsing only part of a document +=============================== + +Let's say you want to use Beautiful Soup look at a document's +tags. It's a waste of time and memory to parse the entire document and +then go over it again looking for tags. It would be much faster to +ignore everthing that wasn't an tag in the first place. The +``SoupStrainer`` class allows you to choose which parts of an incoming +document are parsed. You just create a ``SoupStrainer`` and pass it in +to the ``BeautifulSoup`` constructor as the ``parse_only`` argument. + +(Note that *this feature won't work if you're using the html5lib +parser*. If you use html5lib, the whole document will be parsed, no +matter what. In the examples below, I'll be forcing Beautiful Soup to +use Python's built-in parser.) + +``SoupStrainer`` +---------------- + +The ``SoupStrainer`` class takes the same arguments as a typical +method from `Searching the tree`_: :ref:`nameThe Dormouse's story
+ +Once upon a time there were three little sisters; and their names were + Elsie, + Lacie and + Tillie; + and they lived at the bottom of a well.
+ +...
+ """ + + print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) + # + # Elsie + # + # + # Lacie + # + # + # Tillie + # + + print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) + # + # Lacie + # + + print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) + # Elsie + # , + # Lacie + # and + # Tillie + # ... + # + +You can also pass a ``SoupStrainer`` into any of the methods covered +in `Searching the tree`_. This probably isn't terribly useful, but I +thought I'd mention it:: + + soup = BeautifulSoup(html_doc) + soup.find_all(only_short_strings) + # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', + # u'\n\n', u'...', u'\n'] + +Troubleshooting +=============== + +Parsing XML +----------- + +By default, Beautiful Soup parses documents as HTML. To parse a +document as XML, pass in "xml" as the second argument to the +``BeautifulSoup`` constructor:: + + soup = BeautifulSoup(markup, "xml") + +You'll need to :ref:`have lxml installed