From 43aeaf51780466e023418f7dfd1f456614c061e2 Mon Sep 17 00:00:00 2001
From: Leonard Richardson The Dormouse's story Once upon a time there were three little sisters; and their names were
- Elsie,
- Lacie and
- Tillie;
- and they lived at the bottom of a well. ...
- #
- # The Dormouse's story
- #
- #
- # Once upon a time there were three little sisters; and their names were
- #
- # Elsie
- #
- # ,
- #
- # Lacie
- #
- # and
- #
- # Tillie
- #
- # ; and they lived at the bottom of a well.
- #
- # ...
- # The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were
- Elsie,
- Lacie and
- Tillie;
- and they lived at the bottom of a well. ... The Dormouse's story
-tag", and so on. Beautiful Soup offers tools for reconstructing the
-initial parse of the document.
-
-.. _element-generators:
-
-``.next_element`` and ``.previous_element``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The ``.next_element`` attribute of a string or tag points to whatever
-was parsed immediately afterwards. It might be the same as
-``.next_sibling``, but it's usually drastically different.
-
-Here's the final tag in the "three sisters" document. Its
-``.next_sibling`` is a string: the conclusion of the sentence that was
-interrupted by the start of the tag.::
-
- last_a_tag = soup.find("a", id="link3")
- last_a_tag
- # Tillie
-
- last_a_tag.next_sibling
- # '; and they lived at the bottom of a well.'
-
-But the ``.next_element`` of that tag, the thing that was parsed
-immediately after the tag, is `not` the rest of that sentence:
-it's the word "Tillie"::
-
- last_a_tag.next_element
- # u'Tillie'
-
-That's because in the original markup, the word "Tillie" appeared
-before that semicolon. The parser encountered an tag, then the
-word "Tillie", then the closing tag, then the semicolon and rest of
-the sentence. The semicolon is on the same level as the tag, but the
-word "Tillie" was encountered first.
-
-The ``.previous_element`` attribute is the exact opposite of
-``.next_element``. It points to whatever element was parsed
-immediately before this one::
-
- last_a_tag.previous_element
- # u' and\n'
- last_a_tag.previous_element.next_element
- # Tillie
-
-``.next_elements`` and ``.previous_elements``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-You should get the idea by now. You can use these iterators to move
-forward or backward in the document as it was parsed::
-
- for element in last_a_tag.next_elements:
- print(repr(element))
- # u'Tillie'
- # u';\nand they lived at the bottom of a well.'
- # u'\n\n'
- # ... The Dormouse's story Once upon a time there were three little sisters; and their names were
- Elsie,
- Lacie and
- Tillie;
- and they lived at the bottom of a well. ...
-tags::
-
- soup.find_all(has_class_but_no_id)
- # [ The Dormouse's story Once upon a time there were... ... tags. It doesn't pick up the
-tags, because those tags define both "class" and "id". It doesn't pick
-up tags like and The Dormouse's story tag with the CSS class "title"?
-Let's look at the arguments to ``find_all()``.
-
-.. _name:
-
-The ``name`` argument
-^^^^^^^^^^^^^^^^^^^^^
-
-Pass in a value for ``name`` and you'll tell Beautiful Soup to only
-consider tags with certain names. Text strings will be ignored, as
-will tags whose names that don't match.
-
-This is the simplest usage::
-
- soup.find_all("title")
- # [ Once upon a time there were three little sisters; and their names were
- # Elsie,
- # Lacie and
- # Tillie;
- # and they lived at the bottom of a well. tags is an
-indirect parent of the string, and our search finds that as
-well. There's a tag with the CSS class "title" `somewhere` in the
-document, but it's not one of this string's parents, so we can't find
-it with ``find_parents()``.
-
-You may have made the connection between ``find_parent()`` and
-``find_parents()``, and the `.parent`_ and `.parents`_ attributes
-mentioned earlier. The connection is very strong. These search methods
-actually use ``.parents`` to iterate over all the parents, and check
-each one against the provided filter to see if it matches.
-
-``find_next_siblings()`` and ``find_next_sibling()``
-----------------------------------------------------
-
-Signature: find_next_siblings(:ref:`name ... The Dormouse's story ... tag in the document showed up, even though it's not in
-the same part of the tree as the tag we started from. For these
-methods, all that matters is that an element match the filter, and
-show up later in the document than the starting element.
-
-``find_all_previous()`` and ``find_previous()``
------------------------------------------------
-
-Signature: find_all_previous(:ref:`name Once upon a time there were three little sisters; ... The Dormouse's story tag that contains the tag we started
-with. This shouldn't be too surprising: we're looking at all the tags
-that show up earlier in the document than the one we started with. A
- tag that contains an tag must have shown up earlier in the
-document.
-
-Modifying the tree
-==================
-
-Beautiful Soup's main strength is in searching the parse tree, but you
-can also modify the tree and write your changes as a new HTML or XML
-document.
-
-Changing tag names and attributes
----------------------------------
-
-I covered this earlier, in `Attributes`_, but it bears repeating. You
-can rename a tag, change the values of its attributes, add new
-attributes, and delete attributes::
-
- soup = BeautifulSoup('Extremely bold')
- tag = soup.b
-
- tag.name = "blockquote"
- tag['class'] = 'verybold'
- tag['id'] = 1
- tag
- # The law firm of Dewey, Cheatem, & Howe The law firm of Dewey, Cheatem, & Howe Il a dit <<Sacré bleu!>>
- # Il a dit <<Sacré bleu!>>
- #
- # Il a dit <<Sacré bleu!>>
- #
- # Il a dit <
- # IL A DIT <
- # IL A DIT <<SACRÉ BLEU!>>
- # Extremely bold
-
-Attributes
-^^^^^^^^^^
-
-A tag may have any number of attributes. The tag ```` has an attribute "class" whose value is
-"boldest". You can access a tag's attributes by treating the tag like
-a dictionary::
-
- tag['class']
- # u'boldest'
-
-You can access that dictionary directly as ``.attrs``::
-
- tag.attrs
- # {u'class': u'boldest'}
-
-You can add, remove, and modify a tag's attributes. Again, this is
-done by treating the tag as a dictionary::
-
- tag['class'] = 'verybold'
- tag['id'] = 1
- tag
- # Extremely bold
-
- del tag['class']
- del tag['id']
- tag
- # Extremely bold
-
-``NavigableString``
--------------------
-
-A string corresponds to a bit of text within a tag. Beautiful Soup
-defines the ``NavigableString`` class to contain these bits of text::
-
- tag.string
- # u'Extremely bold'
- type(tag.string)
- # No longer bold
-
-``NavigableString`` supports most of the features described in
-`Navigating the tree`_ and `Searching the tree`_, but not all of
-them. In particular, since a string can't contain anything (the way a
-tag may contain a string or another tag), strings don't support the
-``.contents`` or ``.string`` attributes, or the `find()` method.
-
-``BeautifulSoup``
------------------
-
-The ``BeautifulSoup`` object itself represents the document as a
-whole. For most purposes, you can treat it as a :ref:`Tag`
-object. This means it supports most of the methods described in
-`Navigating the tree`_ and `Searching the tree`_.
-
-Since the ``BeautifulSoup`` object doesn't correspond to an actual
-HTML or XML tag, it has no name and no attributes. But sometimes it's
-useful to look at its ``.name``, so it's been given the special
-``.name`` "[document]"::
-
- soup.name
- # u'[document]'
-
-Comments and other special strings
-----------------------------------
-
-``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost
-everything you'll see in an HTML or XML file, but there are a few
-leftover bits. The only one you'll probably ever need to worry about
-is the comment::
-
- markup = ""
- soup = BeautifulSoup(markup)
- comment = soup.b.string
- type(comment)
- # Extremely bold
-
- del tag['class']
- del tag['id']
- tag
- # Extremely bold
-
-
-Modifying ``.string``
----------------------
-
-If you set a tag's ``.string`` attribute, the tag's contents are
-replaced with the string you give::
-
- markup = 'I linked to example.com'
- soup = BeautifulSoup(markup)
-
- tag = soup.a
- tag.string = "New link text."
- tag
- # New link text.
-
-Be careful: if the tag contained other tags, they and all their
-contents will be destroyed.
-
-``append()``
-------------
-
-You can add to a tag's contents with ``Tag.append()``. It works just
-like calling ``.append()`` on a Python list::
-
- soup = BeautifulSoup("Foo")
- soup.a.append("Bar")
-
- soup
- #
tag. This parser also adds an empty
tag to the -document. - -Here's the same document parsed with Python's built-in HTML -parser:: - - BeautifulSoup("", "html.parser") - # - -Like html5lib, this parser ignores the closing tag. Unlike -html5lib, this parser makes no attempt to create a well-formed HTML -document by adding a tag. Unlike lxml, it doesn't even bother -to add an tag. - -Since the document "" is invalid, none of these techniques is -the "correct" way to handle it. The html5lib parser uses techniques -that are part of the HTML5 standard, so it has the best claim on being -the "correct" way, but all three techniques are leigtimate. - -Differences between parsers can affect your script. If you're planning -on distributing your script to other people, you might want to specify -in the ``BeautifulSoup`` constructor which parser you used during -development. That will reduce the chances that your users parse a -document differently from the way you parse it. - - -Encodings -========= - -Any HTML or XML document is written in a specific encoding like ASCII -or UTF-8. But when you load that document into Beautiful Soup, you'll -discover it's been converted to Unicode:: - - markup = "Sacr\xe9 bleu!
- - - ''' - - soup = BeautifulSoup(markup) - print(soup.prettify()) - # - # - # - # - # - #- # Sacré bleu! - #
- # - # - -Note that the tag has been rewritten to reflect the fact that -the document is now in UTF-8. - -If you don't want UTF-8, you can pass an encoding into ``prettify()``:: - - print(soup.prettify("latin-1")) - # - # - # - # ... - -You can also call encode() on the ``BeautifulSoup`` object, or any -element in the soup, just as if it were a Python string:: - - soup.p.encode("latin-1") - # 'Sacr\xe9 bleu!
' - - soup.p.encode("utf-8") - # 'Sacr\xc3\xa9 bleu!
' - -Unicode, Dammit ---------------- - -You can use Unicode, Dammit without using Beautiful Soup. It's useful -whenever you have data in an unknown encoding and you just want it to -become Unicode:: - - from bs4 import UnicodeDammit - dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") - print(dammit.unicode_markup) - # Sacré bleu! - dammit.original_encoding - # 'utf-8' - -The more data you give Unicode, Dammit, the more accurately it will -guess. If you have your own suspicions as to what the encoding might -be, you can pass them in as a list:: - - dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) - print(dammit.unicode_markup) - # Sacré bleu! - dammit.original_encoding - # 'latin-1' - -Unicode, Dammit has one special feature that Beautiful Soup doesn't -use. You can use it to convert Microsoft smart quotes to HTML or XML -entities:: - - markup = b"I just \x93love\x94 Microsoft Word
" - - UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup - # u'I just “love” Microsoft Word
' - - UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup - # u'I just “love” Microsoft Word
' - -You might find this feature useful, but Beautiful Soup doesn't use -it. Beautiful Soup prefers the default behavior, which is to convert -Microsoft smart quotes to Unicode characters along with everything -else:: - - UnicodeDammit(markup, ["windows-1252"]).unicode_markup - # u'I just \u201clove\u201d Microsoft Word
' - -Parsing only part of a document -=============================== - -Let's say you want to use Beautiful Soup look at a document's -tags. It's a waste of time and memory to parse the entire document and -then go over it again looking for tags. It would be much faster to -ignore everthing that wasn't an tag in the first place. The -``SoupStrainer`` class allows you to choose which parts of an incoming -document are parsed. You just create a ``SoupStrainer`` and pass it in -to the ``BeautifulSoup`` constructor as the ``parse_only`` argument. - -(Note that *this feature won't work if you're using the html5lib -parser*. If you use html5lib, the whole document will be parsed, no -matter what. In the examples below, I'll be forcing Beautiful Soup to -use Python's built-in parser.) - -``SoupStrainer`` ----------------- - -The ``SoupStrainer`` class takes the same arguments as a typical -method from `Searching the tree`_: :ref:`nameThe Dormouse's story
- -Once upon a time there were three little sisters; and their names were - Elsie, - Lacie and - Tillie; - and they lived at the bottom of a well.
- -...
- """ - - print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) - # - # Elsie - # - # - # Lacie - # - # - # Tillie - # - - print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) - # - # Lacie - # - - print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) - # Elsie - # , - # Lacie - # and - # Tillie - # ... - # - -You can also pass a ``SoupStrainer`` into any of the methods covered -in `Searching the tree`_. This probably isn't terribly useful, but I -thought I'd mention it:: - - soup = BeautifulSoup(html_doc) - soup.find_all(only_short_strings) - # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', - # u'\n\n', u'...', u'\n'] - -Troubleshooting -=============== - -Parsing XML ------------ - -By default, Beautiful Soup parses documents as HTML. To parse a -document as XML, pass in "xml" as the second argument to the -``BeautifulSoup`` constructor:: - - soup = BeautifulSoup(markup, "xml") - -You'll need to :ref:`have lxml installed