Beautiful Soup Documentation
============================
.. image:: 6.1.jpg
:align: right
:alt: "The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself."
`Beautiful Soup
The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" Running the "three sisters" document through Beautiful Soup gives us a ``BeautifulSoup`` object, which represents the document as a nested data structure:: from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc) print(soup.prettify()) # # ## # The Dormouse's story # #
## Once upon a time there were three little sisters; and their names were # # Elsie # # , # # Lacie # # and # # Tillie # # ; and they lived at the bottom of a well. #
## ... #
# # Here are some simple ways to navigate that data structure:: soup.title #The Dormouse's story
soup.p['class'] # u'title' soup.a # Elsie soup.find_all('a') # [Elsie, # Lacie, # Tillie] soup.find(id="link3") # Tillie One common task is extracting all the URLs found within a page's tags:: for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie Another common task is extracting all the text from a page:: print(soup.get_text()) # The Dormouse's story # # The Dormouse's story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # ... Does this look like what you need? If so, read on. Installing Beautiful Soup ========================= Beautiful Soup 4 is published through PyPi, so you can install it with `easy_install`. If you're on the Python 2 series, the package name is `beautifulsoup4`. :kbd:`$ easy_install beautifulsoup4` If you're using the Python 3 series, the package name is `beautifulsoup4py3k`. :kbd:`$ easy_install-3.2 install beautifulsoup4py3k` (The `beautifulsoup` package is probably `not` what you want. That's the previous major release, `Beautiful Soup 3`_. Lots of software uses BS3, so it's still available, but if you're writing new code you should install `beautifulsoup4`.) You can also `download the Beautiful Soup 4 source tarballExtremely boldAttributes ^^^^^^^^^^ A tag may have any number of attributes. The tag ```` has an attribute "class" whose value is "boldest". You can access a tag's attributes by treating the tag like a dictionary:: tag['class'] # u'boldest' You can access that dictionary directly as ``.attrs``:: tag.attrs # {u'class': u'boldest'} You can add, remove, and modify a tag's attributes. Again, this is done by treating the tag as a dictionary:: tag['class'] = 'verybold' tag['id'] = 1 tag #
Extremely bolddel tag['class'] del tag['id'] tag #
Extremely bold``NavigableString`` ------------------- A string corresponds to a bit of text within a tag. Beautiful Soup defines the ``NavigableString`` class to contain these bits of text:: tag.string # u'Extremely bold' type(tag.string) #
No longer bold``NavigableString`` supports most of the features described in `Navigating the tree`_ and `Searching the tree`_, but not all of them. In particular, since a string can't contain anything (the way a tag may contain a string or another tag), strings don't support the ``.contents`` or ``.string`` attributes, or the `find()` method. ``BeautifulSoup`` ----------------- The ``BeautifulSoup`` object itself represents the document as a whole. For most purposes, you can treat it as a :ref:`Tag` object. This means it supports most of the methods described in `Navigating the tree`_ and `Searching the tree`_. Since the ``BeautifulSoup`` object doesn't correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it's useful to look at its ``.name``, so it's been given the special ``.name`` "[document]":: soup.name # u'[document]' Comments and other special strings ---------------------------------- ``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost everything you'll see in an HTML or XML file, but there are a few leftover bits. The only one you'll probably ever need to worry about is the comment:: markup = "" soup = BeautifulSoup(markup) comment = soup.b.string type(comment) #
The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc) I'll use this as an example to show you how to move from one part of a document to another. Going down ---------- Tags may contain strings and other tags. These elements are the tag's `children`. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag's children. Note that Beautiful Soup strings don't support any of these attributes, because a string can't have children. Navigating using tag names ^^^^^^^^^^^^^^^^^^^^^^^^^^ The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the tag, just say ``soup.head``:: soup.head #The Dormouse's story
An HTML parser takes this string of characters and turns it into a series of events: "open an tag", "open a tag", "open atag", and so on. Beautiful Soup offers tools for reconstructing the initial parse of the document. .. _element-generators: ``.next_element`` and ``.previous_element`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``.next_element`` attribute of a string or tag points to whatever was parsed immediately afterwards. It might be the same as ``.next_sibling``, but it's usually drastically different. Here's the final tag in the "three sisters" document. Its ``.next_sibling`` is a string: the conclusion of the sentence that was interrupted by the start of the tag.:: last_a_tag = soup.find("a", id="link3") last_a_tag # Tillie last_a_tag.next_sibling # '; and they lived at the bottom of a well.' But the ``.next_element`` of that tag, the thing that was parsed immediately after the tag, is `not` the rest of that sentence: it's the word "Tillie":: last_a_tag.next_element # u'Tillie' That's because in the original markup, the word "Tillie" appeared before that semicolon. The parser encountered an tag, then the word "Tillie", then the closing tag, then the semicolon and rest of the sentence. The semicolon is on the same level as the tag, but the word "Tillie" was encountered first. The ``.previous_element`` attribute is the exact opposite of ``.next_element``. It points to whatever element was parsed immediately before this one:: last_a_tag.previous_element # u' and\n' last_a_tag.previous_element.next_element # Tillie ``.next_elements`` and ``.previous_elements`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You should get the idea by now. You can use these iterators to move forward or backward in the document as it was parsed:: for element in last_a_tag.next_elements: print(repr(element)) # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' #
...
# u'...' # u'\n' # None Searching the tree ================== Beautiful Soup defines a lot of methods for searching the parse tree, but they're all very similar. I'm going to spend a lot of time explain the two most popular methods: ``find()`` and ``find_all()``. The other methods take almost exactly the same arguments, so I'll just cover them briefly. Once again, I'll be using the "three sisters" document as an example:: html_doc = """The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc) By passing in a filter to an argument like ``find_all()``, you can isolate whatever parts of the document you're interested. Kinds of filters ---------------- Before talking in detail about ``find_all()`` and similar methods, I want to show examples of different filters you can pass into these methods. These filters show up again and again, throughout the search API. You can use them to filter based on a tag's name, on its attributes, on the text of a string, or on some combination of these. .. _a string: A string ^^^^^^^^ The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the tags in the document:: soup.find_all('b') # [The Dormouse's story] .. _a regular expression: A regular expression ^^^^^^^^^^^^^^^^^^^^ If you pass in a regular expression object, Beautiful Soup will filter against that regular expression. This code finds all the tags whose names start with the letter "b"; in this case, the tag and the tag:: import re for tag in soup.find_all(re.compile("b.*")): print(tag.name) # body # b .. _a list: A list ^^^^^^ If you pass in a list, Beautiful Soup will allow a string match against `any` item in that list. This code finds all the tags `and` all the tags:: soup.find_all(["a", "b"]) # [The Dormouse's story, # Elsie, # Lacie, # Tillie] .. _the value True: ``True`` ^^^^^^^^ The value ``True`` matches everything it can. This code finds `all` the tags in the document, but none of the text strings:: for tag in soup.find_all(True): print(tag.name) # html # head # title # body # p # b # p # a # a # a # p .. a function: A function ^^^^^^^^^^ If none of the other matches work for you, define a function that takes an element as its only argument. The function should return ``True`` if the argument matches, and ``False`` otherwise. Here's a function that returns ``True`` if a tag defines the "class" attribute but doesn't define the "id" attribute:: def has_class_but_no_id(tag): return tag.has_key('class') and not tag.has_key('id') Pass this function into ``find_all()`` and you'll pick up all thetags:: soup.find_all(has_class_but_no_id) # [
The Dormouse's story
, #Once upon a time there were...
, #...
] This function only picks up the tags. It doesn't pick up the
tags, because those tags define both "class" and "id". It doesn't pick
up tags like and The Dormouse's story tag with the CSS class "title"?
Let's look at the arguments to ``find_all()``.
.. _name:
The ``name`` argument
^^^^^^^^^^^^^^^^^^^^^
Pass in a value for ``name`` and you'll tell Beautiful Soup to only
consider tags with certain names. Text strings will be ignored, as
will tags whose names that don't match.
This is the simplest usage::
soup.find_all("title")
# [ Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well. tags is an
indirect parent of the string, and our search finds that as
well. There's a tag with the CSS class "title" `somewhere` in the
document, but it's not one of this string's parents, so we can't find
it with ``find_parents()``.
You may have made the connection between ``find_parent()`` and
``find_parents()``, and the `.parent`_ and `.parents`_ attributes
mentioned earlier. The connection is very strong. These search methods
actually use ``.parents`` to iterate over all the parents, and check
each one against the provided filter to see if it matches.
``find_next_siblings()`` and ``find_next_sibling()``
----------------------------------------------------
Signature: find_next_siblings(:ref:`name ... The Dormouse's story ... tag in the document showed up, even though it's not in
the same part of the tree as the tag we started from. For these
methods, all that matters is that an element match the filter, and
show up later in the document than the starting element.
``find_all_previous()`` and ``find_previous()``
-----------------------------------------------
Signature: find_all_previous(:ref:`name Once upon a time there were three little sisters; ... The Dormouse's story tag that contains the tag we started
with. This shouldn't be too surprising: we're looking at all the tags
that show up earlier in the document than the one we started with. A
tag that contains an tag must have shown up earlier in the
document.
Modifying the tree
==================
Beautiful Soup's main strength is in searching the parse tree, but you
can also modify the tree and write your changes as a new HTML or XML
document.
Changing tag names and attributes
---------------------------------
I covered this earlier, in `Attributes`_, but it bears repeating. You
can rename a tag, change the values of its attributes, add new
attributes, and delete attributes::
soup = BeautifulSoup('Extremely bold')
tag = soup.b
tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1
tag
# Extremely bold
del tag['class']
del tag['id']
tag
# Extremely bold
Modifying ``.string``
---------------------
If you set a tag's ``.string`` attribute, the tag's contents are
replaced with the string you give::
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
tag = soup.a
tag.string = "New link text."
tag
# New link text.
Be careful: if the tag contained other tags, they and all their
contents will be destroyed.
``append()``
------------
You can add to a tag's contents with ``Tag.append()``. It works just
like calling ``.append()`` on a Python list::
soup = BeautifulSoup("Foo")
soup.a.append("Bar")
soup
#
tag. This parser also adds an empty
tag to the document. Here's the same document parsed with Python's built-in HTML parser:: BeautifulSoup("", "html.parser") # Like html5lib, this parser ignores the closing tag. Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding aSacr\xe9 bleu!
''' soup = BeautifulSoup(markup) print(soup.prettify()) # # # # # ## Sacré bleu! #
# # Note that the tag has been rewritten to reflect the fact that the document is now in UTF-8. If you don't want UTF-8, you can pass an encoding into ``prettify()``:: print(soup.prettify("latin-1")) # # # # ... You can also call encode() on the ``BeautifulSoup`` object, or any element in the soup, just as if it were a Python string:: soup.p.encode("latin-1") # 'Sacr\xe9 bleu!
' soup.p.encode("utf-8") # 'Sacr\xc3\xa9 bleu!
' Unicode, Dammit --------------- You can use Unicode, Dammit without using Beautiful Soup. It's useful whenever you have data in an unknown encoding and you just want it to become Unicode:: from bs4 import UnicodeDammit dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'utf-8' The more data you give Unicode, Dammit, the more accurately it will guess. If you have your own suspicions as to what the encoding might be, you can pass them in as a list:: dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'latin-1' Unicode, Dammit has one special feature that Beautiful Soup doesn't use. You can use it to convert Microsoft smart quotes to HTML or XML entities:: markup = b"I just \x93love\x94 Microsoft Word
" UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup # u'I just “love” Microsoft Word
' UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup # u'I just “love” Microsoft Word
' You might find this feature useful, but Beautiful Soup doesn't use it. Beautiful Soup prefers the default behavior, which is toconvert Microsoft smart quotes to Unicode characters along with everything else:: UnicodeDammit(markup, ["windows-1252"]).unicode_markup # u'I just \u201clove\u201d Microsoft Word
' Parsing only part of a document =============================== Let's say you want to use Beautiful Soup look at a document's tags. It's a waste of time and memory to parse the entire document and then go over it again looking for tags. It would be much faster to ignore everthing that wasn't an tag in the first place. The ``SoupStrainer`` class allows you to choose which parts of an incoming document are parsed. You just create a ``SoupStrainer`` and pass it in to the ``BeautifulSoup`` constructor as the ``parse_only`` argument. (Note that *this feature won't work if you're using the html5lib parser*. If you use html5lib, the whole document will be parsed, no matter what. In the examples below, I'll be forcing Beautiful Soup to use Python's built-in parser.) ``SoupStrainer`` ---------------- The ``SoupStrainer`` class takes the same arguments as a typical method from `Searching the tree`_: :ref:`nameThe Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) # # Elsie # # # Lacie # # # Tillie # print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) # # Lacie # print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) # Elsie # , # Lacie # and # Tillie # ... # You can also pass a ``SoupStrainer`` into any of the methods covered in `Searching the tree`_. This probably isn't terribly useful, but I thought I'd mention it:: soup = BeautifulSoup(html_doc) soup.find_all(only_short_strings) # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', # u'\n\n', u'...', u'\n'] Troubleshooting =============== Parsing XML ----------- By default, Beautiful Soup parses documents as HTML. To parse a document as XML, pass in "xml" as the second argument to the ``BeautifulSoup`` constructor:: soup = BeautifulSoup(markup, "xml") You'll need to :ref:`have lxml installed