Beautiful Soup Documentation
============================
.. image:: 6.1.jpg
:align: right
:alt: "The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself."
`Beautiful Soup The Dormouse's story Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well. ...
#
# The Dormouse's story
#
#
# Once upon a time there were three little sisters; and their names were
#
# Elsie
#
# ,
#
# Lacie
#
# and
#
# Tillie
#
# ; and they lived at the bottom of a well.
#
# ...
# The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well. ... The Dormouse's story
tag", and so on. Beautiful Soup offers tools for reconstructing the
initial parse of the document.
.. _element-generators:
``.next_element`` and ``.previous_element``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``.next_element`` attribute of a string or tag points to whatever
was parsed immediately afterwards. It might be the same as
``.next_sibling``, but it's usually drastically different.
Here's the final tag in the "three sisters" document. Its
``.next_sibling`` is a string: the conclusion of the sentence that was
interrupted by the start of the tag.::
last_a_tag = soup.find("a", id="link3")
last_a_tag
# Tillie
last_a_tag.next_sibling
# '; and they lived at the bottom of a well.'
But the ``.next_element`` of that tag, the thing that was parsed
immediately after the tag, is `not` the rest of that sentence:
it's the word "Tillie"::
last_a_tag.next_element
# u'Tillie'
That's because in the original markup, the word "Tillie" appeared
before that semicolon. The parser encountered an tag, then the
word "Tillie", then the closing tag, then the semicolon and rest of
the sentence. The semicolon is on the same level as the tag, but the
word "Tillie" was encountered first.
The ``.previous_element`` attribute is the exact opposite of
``.next_element``. It points to whatever element was parsed
immediately before this one::
last_a_tag.previous_element
# u' and\n'
last_a_tag.previous_element.next_element
# Tillie
``.next_elements`` and ``.previous_elements``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You should get the idea by now. You can use these iterators to move
forward or backward in the document as it was parsed::
for element in last_a_tag.next_elements:
print(repr(element))
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# ... The Dormouse's story Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well. ...
tags::
soup.find_all(has_class_but_no_id)
# [ The Dormouse's story Once upon a time there were... ... tags. It doesn't pick up the
tags, because those tags define both "class" and "id". It doesn't pick
up tags like and The Dormouse's story tag with the CSS class "title"?
Let's look at the arguments to ``find_all()``.
.. _name:
The ``name`` argument
^^^^^^^^^^^^^^^^^^^^^
Pass in a value for ``name`` and you'll tell Beautiful Soup to only
consider tags with certain names. Text strings will be ignored, as
will tags whose names that don't match.
This is the simplest usage::
soup.find_all("title")
# [ Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well. tags is an
indirect parent of the string, and our search finds that as
well. There's a tag with the CSS class "title" `somewhere` in the
document, but it's not one of this string's parents, so we can't find
it with ``find_parents()``.
You may have made the connection between ``find_parent()`` and
``find_parents()``, and the `.parent`_ and `.parents`_ attributes
mentioned earlier. The connection is very strong. These search methods
actually use ``.parents`` to iterate over all the parents, and check
each one against the provided filter to see if it matches.
``find_next_siblings()`` and ``find_next_sibling()``
----------------------------------------------------
Signature: find_next_siblings(:ref:`name ... The Dormouse's story ... tag in the document showed up, even though it's not in
the same part of the tree as the tag we started from. For these
methods, all that matters is that an element match the filter, and
show up later in the document than the starting element.
``find_all_previous()`` and ``find_previous()``
-----------------------------------------------
Signature: find_all_previous(:ref:`name Once upon a time there were three little sisters; ... The Dormouse's story tag that contains the tag we started
with. This shouldn't be too surprising: we're looking at all the tags
that show up earlier in the document than the one we started with. A
tag that contains an tag must have shown up earlier in the
document.
Modifying the tree
==================
Beautiful Soup's main strength is in searching the parse tree, but you
can also modify the tree and write your changes as a new HTML or XML
document.
Changing tag names and attributes
---------------------------------
I covered this earlier, in `Attributes`_, but it bears repeating. You
can rename a tag, change the values of its attributes, add new
attributes, and delete attributes::
soup = BeautifulSoup('Extremely bold')
tag = soup.b
tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1
tag
# Extremely bold
Attributes
^^^^^^^^^^
A tag may have any number of attributes. The tag ```` has an attribute "class" whose value is
"boldest". You can access a tag's attributes by treating the tag like
a dictionary::
tag['class']
# u'boldest'
You can access that dictionary directly as ``.attrs``::
tag.attrs
# {u'class': u'boldest'}
You can add, remove, and modify a tag's attributes. Again, this is
done by treating the tag as a dictionary::
tag['class'] = 'verybold'
tag['id'] = 1
tag
# Extremely bold
del tag['class']
del tag['id']
tag
# Extremely bold
``NavigableString``
-------------------
A string corresponds to a bit of text within a tag. Beautiful Soup
defines the ``NavigableString`` class to contain these bits of text::
tag.string
# u'Extremely bold'
type(tag.string)
# No longer bold
``NavigableString`` supports most of the features described in
`Navigating the tree`_ and `Searching the tree`_, but not all of
them. In particular, since a string can't contain anything (the way a
tag may contain a string or another tag), strings don't support the
``.contents`` or ``.string`` attributes, or the `find()` method.
``BeautifulSoup``
-----------------
The ``BeautifulSoup`` object itself represents the document as a
whole. For most purposes, you can treat it as a :ref:`Tag`
object. This means it supports most of the methods described in
`Navigating the tree`_ and `Searching the tree`_.
Since the ``BeautifulSoup`` object doesn't correspond to an actual
HTML or XML tag, it has no name and no attributes. But sometimes it's
useful to look at its ``.name``, so it's been given the special
``.name`` "[document]"::
soup.name
# u'[document]'
Comments and other special strings
----------------------------------
``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost
everything you'll see in an HTML or XML file, but there are a few
leftover bits. The only one you'll probably ever need to worry about
is the comment::
markup = ""
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# Extremely bold
del tag['class']
del tag['id']
tag
# Extremely bold
Modifying ``.string``
---------------------
If you set a tag's ``.string`` attribute, the tag's contents are
replaced with the string you give::
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
tag = soup.a
tag.string = "New link text."
tag
# New link text.
Be careful: if the tag contained other tags, they and all their
contents will be destroyed.
``append()``
------------
You can add to a tag's contents with ``Tag.append()``. It works just
like calling ``.append()`` on a Python list::
soup = BeautifulSoup("Foo")
soup.a.append("Bar")
soup
#
tag. This parser also adds an empty
tag to the document. Here's the same document parsed with Python's built-in HTML parser:: BeautifulSoup("", "html.parser") # Like html5lib, this parser ignores the closing tag. Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn't even bother to add an tag. Since the document "" is invalid, none of these techniques is the "correct" way to handle it. The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the "correct" way, but all three techniques are leigtimate. Differences between parsers can affect your script. If you're planning on distributing your script to other people, you might want to specify in the ``BeautifulSoup`` constructor which parser you used during development. That will reduce the chances that your users parse a document differently from the way you parse it. Encodings ========= Any HTML or XML document is written in a specific encoding like ASCII or UTF-8. But when you load that document into Beautiful Soup, you'll discover it's been converted to Unicode:: markup = "Sacr\xe9 bleu!
''' soup = BeautifulSoup(markup) print(soup.prettify()) # # # # # ## Sacré bleu! #
# # Note that the tag has been rewritten to reflect the fact that the document is now in UTF-8. If you don't want UTF-8, you can pass an encoding into ``prettify()``:: print(soup.prettify("latin-1")) # # # # ... You can also call encode() on the ``BeautifulSoup`` object, or any element in the soup, just as if it were a Python string:: soup.p.encode("latin-1") # 'Sacr\xe9 bleu!
' soup.p.encode("utf-8") # 'Sacr\xc3\xa9 bleu!
' Unicode, Dammit --------------- You can use Unicode, Dammit without using Beautiful Soup. It's useful whenever you have data in an unknown encoding and you just want it to become Unicode:: from bs4 import UnicodeDammit dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'utf-8' The more data you give Unicode, Dammit, the more accurately it will guess. If you have your own suspicions as to what the encoding might be, you can pass them in as a list:: dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'latin-1' Unicode, Dammit has one special feature that Beautiful Soup doesn't use. You can use it to convert Microsoft smart quotes to HTML or XML entities:: markup = b"I just \x93love\x94 Microsoft Word
" UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup # u'I just “love” Microsoft Word
' UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup # u'I just “love” Microsoft Word
' You might find this feature useful, but Beautiful Soup doesn't use it. Beautiful Soup prefers the default behavior, which is to convert Microsoft smart quotes to Unicode characters along with everything else:: UnicodeDammit(markup, ["windows-1252"]).unicode_markup # u'I just \u201clove\u201d Microsoft Word
' Parsing only part of a document =============================== Let's say you want to use Beautiful Soup look at a document's tags. It's a waste of time and memory to parse the entire document and then go over it again looking for tags. It would be much faster to ignore everthing that wasn't an tag in the first place. The ``SoupStrainer`` class allows you to choose which parts of an incoming document are parsed. You just create a ``SoupStrainer`` and pass it in to the ``BeautifulSoup`` constructor as the ``parse_only`` argument. (Note that *this feature won't work if you're using the html5lib parser*. If you use html5lib, the whole document will be parsed, no matter what. In the examples below, I'll be forcing Beautiful Soup to use Python's built-in parser.) ``SoupStrainer`` ---------------- The ``SoupStrainer`` class takes the same arguments as a typical method from `Searching the tree`_: :ref:`nameThe Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) # # Elsie # # # Lacie # # # Tillie # print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) # # Lacie # print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) # Elsie # , # Lacie # and # Tillie # ... # You can also pass a ``SoupStrainer`` into any of the methods covered in `Searching the tree`_. This probably isn't terribly useful, but I thought I'd mention it:: soup = BeautifulSoup(html_doc) soup.find_all(only_short_strings) # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', # u'\n\n', u'...', u'\n'] Troubleshooting =============== Parsing XML ----------- By default, Beautiful Soup parses documents as HTML. To parse a document as XML, pass in "xml" as the second argument to the ``BeautifulSoup`` constructor:: soup = BeautifulSoup(markup, "xml") You'll need to :ref:`have lxml installed