diff options
-rw-r--r-- | bs4/doc/source/index.rst | 2559 | ||||
-rw-r--r-- | doc/source/index.rst | 6 |
2 files changed, 3 insertions, 2562 deletions
diff --git a/bs4/doc/source/index.rst b/bs4/doc/source/index.rst deleted file mode 100644 index 0467c00..0000000 --- a/bs4/doc/source/index.rst +++ /dev/null @@ -1,2559 +0,0 @@ -Beautiful Soup Documentation -============================ - -.. image:: 6.1.jpg - :align: right - :alt: "The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself." - -`Beautiful Soup <http://www.crummy.com/software/BeautifulSoup/>`_ is a -Python library for pulling data out of HTML and XML files. It works -with your favorite parser to provide idiomatic ways of navigating, -searching, and modifying the parse tree. It commonly saves programmers -hours or days of work. - -These instructions illustrate all major features of Beautiful Soup 4, -with examples. I show you what the library is good for, how it works, -how to use it, how to make it do what you want, and what to do when it -violates your expectations. - -The examples in this documentation should work the same way in Python -2.7 and Python 3.2. - -You might be looking for the documentation for `Beautiful Soup 3 -<http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_. If -you want to learn about the differences between Beautiful Soup 3 and -Beautiful Soup 4, see `Porting code to BS4`_. - -Getting help ------------- - -If you have questions about Beautiful Soup, or run into problems, -`send mail to the discussion group -<http://groups.google.com/group/beautifulsoup/>`_. - -Quick Start -=========== - -Here's an HTML document I'll be using as an example throughout this -document. It's part of a story from `Alice in Wonderland`:: - - html_doc = """ - <html><head><title>The Dormouse's story</title></head> - - <p class="title"><b>The Dormouse's story</b></p> - - <p class="story">Once upon a time there were three little sisters; and their names were - <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, - <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and - <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; - and they lived at the bottom of a well.</p> - - <p class="story">...</p> - """ - -Running the "three sisters" document through Beautiful Soup gives us a -``BeautifulSoup`` object, which represents the document as a nested -data structure:: - - from bs4 import BeautifulSoup - soup = BeautifulSoup(html_doc) - - print(soup.prettify()) - # <html> - # <head> - # <title> - # The Dormouse's story - # </title> - # </head> - # <body> - # <p class="title"> - # <b> - # The Dormouse's story - # </b> - # </p> - # <p class="story"> - # Once upon a time there were three little sisters; and their names were - # <a class="sister" href="http://example.com/elsie" id="link1"> - # Elsie - # </a> - # , - # <a class="sister" href="http://example.com/lacie" id="link2"> - # Lacie - # </a> - # and - # <a class="sister" href="http://example.com/tillie" id="link2"> - # Tillie - # </a> - # ; and they lived at the bottom of a well. - # </p> - # <p class="story"> - # ... - # </p> - # </body> - # </html> - -Here are some simple ways to navigate that data structure:: - - soup.title - # <title>The Dormouse's story</title> - - soup.title.name - # u'title' - - soup.title.string - # u'The Dormouse's story' - - soup.title.parent.name - # u'head' - - soup.p - # <p class="title"><b>The Dormouse's story</b></p> - - soup.p['class'] - # u'title' - - soup.a - # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> - - soup.find_all('a') - # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] - - soup.find(id="link3") - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> - -One common task is extracting all the URLs found within a page's <a> tags:: - - for link in soup.find_all('a'): - print(link.get('href')) - # http://example.com/elsie - # http://example.com/lacie - # http://example.com/tillie - -Another common task is extracting all the text from a page:: - - print(soup.get_text()) - # The Dormouse's story - # - # The Dormouse's story - # - # Once upon a time there were three little sisters; and their names were - # Elsie, - # Lacie and - # Tillie; - # and they lived at the bottom of a well. - # - # ... - -Does this look like what you need? If so, read on. - -Installing Beautiful Soup -========================= - -Beautiful Soup 4 is published through PyPi, so you can install it with -``easy_install`` or ``pip``. The package name is ``beautifulsoup4``, -and the same package works on Python 2 and Python 3. - -:kbd:`$ easy_install beautifulsoup4` - -:kbd:`$ pip install beautifulsoup4` - -(The ``BeautifulSoup`` package is probably `not` what you want. That's -the previous major release, `Beautiful Soup 3`_. Lots of software uses -BS3, so it's still available, but if you're writing new code you -should install ``beautifulsoup4``.) - -You can also `download the Beautiful Soup 4 source tarball -<http://www.crummy.com/software/BeautifulSoup/download/4.x/>`_ and -install it with ``setup.py``. The license for Beautiful Soup allows -you to package the entire library with your application, allowing you -to copy the ``bs4`` directory into your application's codebase. - -I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it -should work with other recent versions. - -.. _parser-installation: - -Be sure to install a good parser! ---------------------------------- - -Beautiful Soup uses a plugin system that supports a number of popular -Python parsers. If no third-party parsers are installed, Beautiful -Soup uses the HTML parser that comes with Python. In recent releases -of Python (2.7.3 and 3.2.2), this parser is excellent at handling bad -HTML. Unfortunately, in older releases, it's not very good at all. - -Even if you're using a recent release of Python, I recommend you -install the `lxml parser <http://lxml.de/>`_ if you can. Its -reliability is good on both HTML and XML, and it's much faster than -Python's built-in parser. Beautiful Soup will detect that you have -lxml installed, and use it instead of Python's built-in parser. - -Depending on your setup, you might install lxml with one of these commands: - -:kbd:`$ apt-get install python-lxml` - -:kbd:`$ easy_install lxml` - -:kbd:`$ pip install lxml` - -If you're using Python 2, another alternative is the pure-Python -`html5lib parser <http://code.google.com/p/html5lib/>`_, which parses -HTML the way a web browser does. Depending on your setup, you might -install html5lib with one of these commands: - -:kbd:`$ apt-get install python-html5lib` - -:kbd:`$ easy_install html5lib` - -:kbd:`$ pip install html5lib` - -Making the soup -=============== - -To parse a document, pass it into the ``BeautifulSoup`` -constructor. You can pass in a string or an open filehandle:: - - from bs4 import BeautifulSoup - - soup = BeautifulSoup(open("index.html")) - - soup = BeautifulSoup("<html>data</html>") - -First, the document is converted to Unicode, and HTML entities are -converted to Unicode characters:: - - BeautifulSoup("Sacré bleu!") - <html><head></head><body>Sacré bleu!</body></html> - -Beautiful Soup then parses the document using the best available -parser. It will use an HTML parser unless you specifically tell it to -use an XML parser. (See `Choosing a parser`_.) - -Kinds of objects -================ - -Beautiful Soup transforms a complex HTML document into a complex tree -of Python objects. But you'll only ever have to deal with about four -`kinds` of objects. - -.. _Tag: - -``Tag`` -------- - -A ``Tag`` object corresponds to an XML or HTML tag in the original document:: - - soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') - tag = soup.b - type(tag) - # <class 'bs4.element.Tag'> - -Tags have a lot of attributes and methods, and I'll cover most of them -in `Navigating the tree`_ and `Searching the tree`_. For now, the most -important features of a tag are its name and attributes. - -Name -^^^^ - -Every tag has a name, accessible as ``.name``:: - - tag.name - # u'b' - -If you change a tag's name, the change will be reflected in any HTML -markup generated by Beautiful Soup:: - - tag.name = "blockquote" - tag - # <blockquote class="boldest">Extremely bold</blockquote> - -Attributes -^^^^^^^^^^ - -A tag may have any number of attributes. The tag ``<b -class="boldest">`` has an attribute "class" whose value is -"boldest". You can access a tag's attributes by treating the tag like -a dictionary:: - - tag['class'] - # u'boldest' - -You can access that dictionary directly as ``.attrs``:: - - tag.attrs - # {u'class': u'boldest'} - -You can add, remove, and modify a tag's attributes. Again, this is -done by treating the tag as a dictionary:: - - tag['class'] = 'verybold' - tag['id'] = 1 - tag - # <blockquote class="verybold" id="1">Extremely bold</blockquote> - - del tag['class'] - del tag['id'] - tag - # <blockquote>Extremely bold</blockquote> - -.. _multivalue: - -Multi-valued attributes -&&&&&&&&&&&&&&&&&&&&&&& - -HTML 4 defines a few attributes that can have multiple values. HTML 5 -removes a couple of them, but defines a few more. The most common -multi-valued attribute is ``class`` (that is, a tag can have more than -one CSS class). Others include ``rel``, ``rev``, ``accept-charset``, -``headers``, and ``accesskey``. Beautiful Soup presents the value(s) -of a multi-valued attribute as a list:: - - css_soup = BeautifulSoup('<p class="body strikeout"></p>') - css_soup.p['class'] - # ["body", "strikeout"] - - css_soup = BeautifulSoup('<p class="body"></p>') - css_soup.p['class'] - # ["body"] - -If an attribute `looks` like it has more than one value, but it's not -a multi-valued attribute as defined by any version of the HTML -standard, Beautiful Soup will leave the attribute alone:: - - id_soup = BeautifulSoup('<p id="my id"></p>') - id_soup.p['id'] - # 'my id' - -When you turn a tag back into a string, multiple attribute values are -consolidated:: - - rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>') - rel_soup.a['rel'] - # ['index'] - rel_soup.a['rel'] = ['index', 'contents'] - print(rel_soup.p) - # <p>Back to the <a rel="index contents">homepage</a></p> - -If you parse a document as XML, there are no multi-valued attributes:: - - xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml') - xml_soup.p['class'] - # u'body strikeout' - - - -``NavigableString`` -------------------- - -A string corresponds to a bit of text within a tag. Beautiful Soup -defines the ``NavigableString`` class to contain these bits of text:: - - tag.string - # u'Extremely bold' - type(tag.string) - # <class 'bs4.element.NavigableString'> - -A ``NavigableString`` is just like a Python Unicode string, except -that it also supports some of the features described in `Navigating -the tree`_ and `Searching the tree`_. You can convert a -``NavigableString`` to a Unicode string with ``unicode()``:: - - unicode_string = unicode(tag.string) - unicode_string - # u'Extremely bold' - type(unicode_string) - # <type 'unicode'> - -You can't edit a string in place, but you can replace one string with -another, using :ref:`replace_with`:: - - tag.string.replace_with("No longer bold") - tag - # <blockquote>No longer bold</blockquote> - -``NavigableString`` supports most of the features described in -`Navigating the tree`_ and `Searching the tree`_, but not all of -them. In particular, since a string can't contain anything (the way a -tag may contain a string or another tag), strings don't support the -``.contents`` or ``.string`` attributes, or the `find()` method. - -``BeautifulSoup`` ------------------ - -The ``BeautifulSoup`` object itself represents the document as a -whole. For most purposes, you can treat it as a :ref:`Tag` -object. This means it supports most of the methods described in -`Navigating the tree`_ and `Searching the tree`_. - -Since the ``BeautifulSoup`` object doesn't correspond to an actual -HTML or XML tag, it has no name and no attributes. But sometimes it's -useful to look at its ``.name``, so it's been given the special -``.name`` "[document]":: - - soup.name - # u'[document]' - -Comments and other special strings ----------------------------------- - -``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost -everything you'll see in an HTML or XML file, but there are a few -leftover bits. The only one you'll probably ever need to worry about -is the comment:: - - markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" - soup = BeautifulSoup(markup) - comment = soup.b.string - type(comment) - # <class 'bs4.element.Comment'> - -The ``Comment`` object is just a special type of ``NavigableString``:: - - comment - # u'Hey, buddy. Want to buy a used parser' - -But when it appears as part of an HTML document, a ``Comment`` is -displayed with special formatting:: - - print(soup.b.prettify()) - # <b> - # <!--Hey, buddy. Want to buy a used parser?--> - # </b> - -Beautiful Soup defines classes for anything else that might show up in -an XML document: ``CData``, ``ProcessingInstruction``, -``Declaration``, and ``Doctype``. Just like ``Comment``, these classes -are subclasses of ``NavigableString`` that add something extra to the -string. Here's an example that replaces the comment with a CDATA -block:: - - from bs4 import CData - cdata = CData("A CDATA block") - comment.replace_with(cdata) - - print(soup.b.prettify()) - # <b> - # <![CDATA[A CDATA block]]> - # </b> - - -Navigating the tree -=================== - -Here's the "Three sisters" HTML document again:: - - html_doc = """ - <html><head><title>The Dormouse's story</title></head> - - <p class="title"><b>The Dormouse's story</b></p> - - <p class="story">Once upon a time there were three little sisters; and their names were - <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, - <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and - <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; - and they lived at the bottom of a well.</p> - - <p class="story">...</p> - """ - - from bs4 import BeautifulSoup - soup = BeautifulSoup(html_doc) - -I'll use this as an example to show you how to move from one part of -a document to another. - -Going down ----------- - -Tags may contain strings and other tags. These elements are the tag's -`children`. Beautiful Soup provides a lot of different attributes for -navigating and iterating over a tag's children. - -Note that Beautiful Soup strings don't support any of these -attributes, because a string can't have children. - -Navigating using tag names -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The simplest way to navigate the parse tree is to say the name of the -tag you want. If you want the <head> tag, just say ``soup.head``:: - - soup.head - # <head><title>The Dormouse's story</title></head> - - soup.title - # <title>The Dormouse's story</title> - -You can do use this trick again and again to zoom in on a certain part -of the parse tree. This code gets the first <b> tag beneath the <body> tag:: - - soup.body.b - # <b>The Dormouse's story</b> - -Using a tag name as an attribute will give you only the `first` tag by that -name:: - - soup.a - # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> - -If you need to get `all` the <a> tags, or anything more complicated -than the first tag with a certain name, you'll need to use one of the -methods described in `Searching the tree`_, such as `find_all()`:: - - soup.find_all('a') - # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] - -``.contents`` and ``.children`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -A tag's children are available in a list called ``.contents``:: - - head_tag = soup.head - head_tag - # <head><title>The Dormouse's story</title></head> - - head_tag.contents - [<title>The Dormouse's story</title>] - - title_tag = head_tag.contents[0] - title_tag - # <title>The Dormouse's story</title> - title_tag.contents - # [u'The Dormouse's story'] - -The ``BeautifulSoup`` object itself has children. In this case, the -<html> tag is the child of the ``BeautifulSoup`` object.:: - - len(soup.contents) - # 1 - soup.contents[0].name - # u'html' - -A string does not have ``.contents``, because it can't contain -anything:: - - text = title_tag.contents[0] - text.contents - # AttributeError: 'NavigableString' object has no attribute 'contents' - -Instead of getting them as a list, you can iterate over a tag's -children using the ``.children`` generator:: - - for child in title_tag.children: - print(child) - # The Dormouse's story - -``.descendants`` -^^^^^^^^^^^^^^^^ - -The ``.contents`` and ``.children`` attributes only consider a tag's -`direct` children. For instance, the <head> tag has a single direct -child--the <title> tag:: - - head_tag.contents - # [<title>The Dormouse's story</title>] - -But the <title> tag itself has a child: the string "The Dormouse's -story". There's a sense in which that string is also a child of the -<head> tag. The ``.descendants`` attribute lets you iterate over `all` -of a tag's children, recursively: its direct children, the children of -its direct children, and so on:: - - for child in head_tag.descendants: - print(child) - # <title>The Dormouse's story</title> - # The Dormouse's story - -The <head> tag has only one child, but it has two descendants: the -<title> tag and the <title> tag's child. The ``BeautifulSoup`` object -only has one direct child (the <html> tag), but it has a whole lot of -descendants:: - - len(list(soup.children)) - # 1 - len(list(soup.descendants)) - # 25 - -.. _.string: - -``.string`` -^^^^^^^^^^^ - -If a tag has only one child, and that child is a string, the string is -made available as ``.string``:: - - title_tag.string - # u'The Dormouse's story' - -If a tag's only child is another tag, and `that` tag has a -``.string``, then the parent tag is considered to have the same -``.string`` as its child:: - - head_tag.contents - # [<title>The Dormouse's story</title>] - - head_tag.string - # u'The Dormouse's story' - -If a tag contains more than one thing, then it's not clear what -``.string`` should refer to, so ``.string`` is defined to be -``None``:: - - print(soup.html.string) - # None - -.. _string-generators: - -``.strings`` and ``stripped_strings`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -If there's more than one thing inside a tag, you can still look at -just the strings. Use the ``.strings`` generator:: - - for string in soup.strings: - print(repr(string)) - # u"The Dormouse's story" - # u'\n\n' - # u"The Dormouse's story" - # u'\n\n' - # u'Once upon a time there were three little sisters; and their names were\n' - # u'Elsie' - # u',\n' - # u'Lacie' - # u' and\n' - # u'Tillie' - # u';\nand they lived at the bottom of a well.' - # u'\n\n' - # u'...' - # u'\n' - -These strings tend to have a lot of extra whitespace, which you can -remove by using the ``.stripped_strings`` generator instead:: - - for string in soup.stripped_strings: - print(repr(string)) - # u"The Dormouse's story" - # u"The Dormouse's story" - # u'Once upon a time there were three little sisters; and their names were' - # u'Elsie' - # u',' - # u'Lacie' - # u'and' - # u'Tillie' - # u';\nand they lived at the bottom of a well.' - # u'...' - -Here, strings consisting entirely of whitespace are ignored, and -whitespace at the beginning and end of strings is removed. - -Going up --------- - -Continuing the "family tree" analogy, every tag and every string has a -`parent`: the tag that contains it. - -.. _.parent: - -``.parent`` -^^^^^^^^^^^ - -You can access an element's parent with the ``.parent`` attribute. In -the example "three sisters" document, the <head> tag is the parent -of the <title> tag:: - - title_tag = soup.title - title_tag - # <title>The Dormouse's story</title> - title_tag.parent - # <head><title>The Dormouse's story</title></head> - -The title string itself has a parent: the <title> tag that contains -it:: - - title_tag.string.parent - # <title>The Dormouse's story</title> - -The parent of a top-level tag like <html> is the ``BeautifulSoup`` object -itself:: - - html_tag = soup.html - type(html_tag.parent) - # <class 'bs4.BeautifulSoup'> - -And the ``.parent`` of a ``BeautifulSoup`` object is defined as None:: - - print(soup.parent) - # None - -.. _.parents: - -``.parents`` -^^^^^^^^^^^^ - -You can iterate over all of an element's parents with -``.parents``. This example uses ``.parents`` to travel from an <a> tag -buried deep within the document, to the very top of the document:: - - link = soup.a - link - # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> - for parent in link.parents: - if parent is None: - print(parent) - else: - print(parent.name) - # p - # body - # html - # [document] - # None - -Going sideways --------------- - -Consider a simple document like this:: - - sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>") - print(sibling_soup.prettify()) - # <html> - # <body> - # <a> - # <b> - # text1 - # </b> - # <c> - # text2 - # </c> - # </a> - # </body> - # </html> - -The <b> tag and the <c> tag are at the same level: they're both direct -children of the same tag. We call them `siblings`. When a document is -pretty-printed, siblings show up at the same indentation level. You -can also use this relationship in the code you write. - -``.next_sibling`` and ``.previous_sibling`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -You can use ``.next_sibling`` and ``.previous_sibling`` to navigate -between page elements that are on the same level of the parse tree:: - - sibling_soup.b.next_sibling - # <c>text2</c> - - sibling_soup.c.previous_sibling - # <b>text1</b> - -The <b> tag has a ``.next_sibling``, but no ``.previous_sibling``, -because there's nothing before the <b> tag `on the same level of the -tree`. For the same reason, the <c> tag has a ``.previous_sibling`` -but no ``.next_sibling``:: - - print(sibling_soup.b.previous_sibling) - # None - print(sibling_soup.c.next_sibling) - # None - -The strings "text1" and "text2" are `not` siblings, because they don't -have the same parent:: - - sibling_soup.b.string - # u'text1' - - print(sibling_soup.b.string.next_sibling) - # None - -In real documents, the ``.next_sibling`` or ``.previous_sibling`` of a -tag will usually be a string containing whitespace. Going back to the -"three sisters" document:: - - <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> - <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> - <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> - -You might think that the ``.next_sibling`` of the first <a> tag would -be the second <a> tag. But actually, it's a string: the comma and -newline that separate the first <a> tag from the second:: - - link = soup.a - link - # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> - - link.next_sibling - # u',\n' - -The second <a> tag is actually the ``.next_sibling`` of the comma:: - - link.next_sibling.next_sibling - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> - -.. _sibling-generators: - -``.next_siblings`` and ``.previous_siblings`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -You can iterate over a tag's siblings with ``.next_siblings`` or -``.previous_siblings``:: - - for sibling in soup.a.next_siblings: - print(repr(sibling)) - # u',\n' - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> - # u' and\n' - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> - # u'; and they lived at the bottom of a well.' - # None - - for sibling in soup.find(id="link3").previous_siblings: - print(repr(sibling)) - # ' and\n' - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> - # u',\n' - # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> - # u'Once upon a time there were three little sisters; and their names were\n' - # None - -Going back and forth --------------------- - -Take a look at the beginning of the "three sisters" document:: - - <html><head><title>The Dormouse's story</title></head> - <p class="title"><b>The Dormouse's story</b></p> - -An HTML parser takes this string of characters and turns it into a -series of events: "open an <html> tag", "open a <head> tag", "open a -<title> tag", "add a string", "close the <title> tag", "open a <p> -tag", and so on. Beautiful Soup offers tools for reconstructing the -initial parse of the document. - -.. _element-generators: - -``.next_element`` and ``.previous_element`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The ``.next_element`` attribute of a string or tag points to whatever -was parsed immediately afterwards. It might be the same as -``.next_sibling``, but it's usually drastically different. - -Here's the final <a> tag in the "three sisters" document. Its -``.next_sibling`` is a string: the conclusion of the sentence that was -interrupted by the start of the <a> tag.:: - - last_a_tag = soup.find("a", id="link3") - last_a_tag - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> - - last_a_tag.next_sibling - # '; and they lived at the bottom of a well.' - -But the ``.next_element`` of that <a> tag, the thing that was parsed -immediately after the <a> tag, is `not` the rest of that sentence: -it's the word "Tillie":: - - last_a_tag.next_element - # u'Tillie' - -That's because in the original markup, the word "Tillie" appeared -before that semicolon. The parser encountered an <a> tag, then the -word "Tillie", then the closing </a> tag, then the semicolon and rest of -the sentence. The semicolon is on the same level as the <a> tag, but the -word "Tillie" was encountered first. - -The ``.previous_element`` attribute is the exact opposite of -``.next_element``. It points to whatever element was parsed -immediately before this one:: - - last_a_tag.previous_element - # u' and\n' - last_a_tag.previous_element.next_element - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> - -``.next_elements`` and ``.previous_elements`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -You should get the idea by now. You can use these iterators to move -forward or backward in the document as it was parsed:: - - for element in last_a_tag.next_elements: - print(repr(element)) - # u'Tillie' - # u';\nand they lived at the bottom of a well.' - # u'\n\n' - # <p class="story">...</p> - # u'...' - # u'\n' - # None - -Searching the tree -================== - -Beautiful Soup defines a lot of methods for searching the parse tree, -but they're all very similar. I'm going to spend a lot of time explain -the two most popular methods: ``find()`` and ``find_all()``. The other -methods take almost exactly the same arguments, so I'll just cover -them briefly. - -Once again, I'll be using the "three sisters" document as an example:: - - html_doc = """ - <html><head><title>The Dormouse's story</title></head> - - <p class="title"><b>The Dormouse's story</b></p> - - <p class="story">Once upon a time there were three little sisters; and their names were - <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, - <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and - <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; - and they lived at the bottom of a well.</p> - - <p class="story">...</p> - """ - - from bs4 import BeautifulSoup - soup = BeautifulSoup(html_doc) - -By passing in a filter to an argument like ``find_all()``, you can -isolate whatever parts of the document you're interested. - -Kinds of filters ----------------- - -Before talking in detail about ``find_all()`` and similar methods, I -want to show examples of different filters you can pass into these -methods. These filters show up again and again, throughout the -search API. You can use them to filter based on a tag's name, -on its attributes, on the text of a string, or on some combination of -these. - -.. _a string: - -A string -^^^^^^^^ - -The simplest filter is a string. Pass a string to a search method and -Beautiful Soup will perform a match against that exact string. This -code finds all the <b> tags in the document:: - - soup.find_all('b') - # [<b>The Dormouse's story</b>] - -.. _a regular expression: - -A regular expression -^^^^^^^^^^^^^^^^^^^^ - -If you pass in a regular expression object, Beautiful Soup will filter -against that regular expression. This code finds all the tags whose -names start with the letter "b"; in this case, the <body> tag and the -<b> tag:: - - import re - for tag in soup.find_all(re.compile("b.*")): - print(tag.name) - # body - # b - -.. _a list: - -A list -^^^^^^ - -If you pass in a list, Beautiful Soup will allow a string match -against `any` item in that list. This code finds all the <a> tags -`and` all the <b> tags:: - - soup.find_all(["a", "b"]) - # [<b>The Dormouse's story</b>, - # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] - -.. _the value True: - -``True`` -^^^^^^^^ - -The value ``True`` matches everything it can. This code finds `all` -the tags in the document, but none of the text strings:: - - for tag in soup.find_all(True): - print(tag.name) - # html - # head - # title - # body - # p - # b - # p - # a - # a - # a - # p - -.. a function: - -A function -^^^^^^^^^^ - -If none of the other matches work for you, define a function that -takes an element as its only argument. The function should return -``True`` if the argument matches, and ``False`` otherwise. - -Here's a function that returns ``True`` if a tag defines the "class" -attribute but doesn't define the "id" attribute:: - - def has_class_but_no_id(tag): - return tag.has_key('class') and not tag.has_key('id') - -Pass this function into ``find_all()`` and you'll pick up all the <p> -tags:: - - soup.find_all(has_class_but_no_id) - # [<p class="title"><b>The Dormouse's story</b></p>, - # <p class="story">Once upon a time there were...</p>, - # <p class="story">...</p>] - -This function only picks up the <p> tags. It doesn't pick up the <a> -tags, because those tags define both "class" and "id". It doesn't pick -up tags like <html> and <title>, because those tags don't define -"class". - -Here's a function that returns ``True`` if a tag is surrounded by -string objects:: - - from bs4 import NavigableString - def surrounded_by_strings(tag): - return (isinstance(tag.next_element, NavigableString) - and isinstance(tag.previous_element, NavigableString)) - - for tag in soup.find_all(surrounded_by_strings): - print tag.name - # p - # a - # a - # a - # p - -Now we're ready to look at the search methods in detail. - -``find_all()`` --------------- - -Signature: find_all(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive -<recursive>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) - -The ``find_all()`` method looks through a tag's descendants and -retrieves `all` descendants that match your filters. I gave several -examples in `Kinds of filters`_, but here are a few more:: - - soup.find_all("title") - # [<title>The Dormouse's story</title>] - - soup.find_all("p", "title") - # [<p class="title"><b>The Dormouse's story</b></p>] - - soup.find_all("a") - # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] - - soup.find_all(id="link2") - # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] - - import re - soup.find(text=re.compile("sisters")) - # u'Once upon a time there were three little sisters; and their names were\n' - -Some of these should look familiar, but others are new. What does it -mean to pass in a value for ``text``, or ``id``? Why does -``find_all("p", "title")`` find a <p> tag with the CSS class "title"? -Let's look at the arguments to ``find_all()``. - -.. _name: - -The ``name`` argument -^^^^^^^^^^^^^^^^^^^^^ - -Pass in a value for ``name`` and you'll tell Beautiful Soup to only -consider tags with certain names. Text strings will be ignored, as -will tags whose names that don't match. - -This is the simplest usage:: - - soup.find_all("title") - # [<title>The Dormouse's story</title>] - -Recall from `Kinds of filters`_ that the value to ``name`` can be `a -string`_, `a regular expression`_, `a list`_, `a function`_, or `the value -True`_. - -.. _kwargs: - -The keyword arguments -^^^^^^^^^^^^^^^^^^^^^ - -Any argument that's not recognized will be turned into a filter on tag -attributes. If you pass in a value for an argument called ``id``, -Beautiful Soup will filter against the tag's 'id' attribute:: - - soup.find_all(id='link2') - # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] - -If you pass in a value for ``href``, Beautiful Soup will filter -against the tag's 'href' attribute:: - - soup.find_all(href=re.compile("elsie")) - # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] - -You can filter an attribute based on `a string`_, `a regular -expression`_, `a list`_, `a function`_, or `the value True`_. - -This code finds all tags that have an ``id`` attribute, regardless of -what the value is:: - - soup.find_all(id=True) - # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] - -You can filter multiple attributes at once by passing in more than one -keyword argument:: - - soup.find_all(href=re.compile("elsie"), id='link1') - # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>] - -.. _attrs: - -Searching by CSS class -^^^^^^^^^^^^^^^^^^^^^^ - -Instead of using keyword arguments, you can filter tags based on their -attributes by passing a dictionary in for ``attrs``. These two lines of -code are equivalent:: - - soup.find_all(href=re.compile("elsie"), id='link1') - soup.find_all(attrs={'href' : re.compile("elsie"), 'id': 'link1'}) - -The ``attrs`` argument would be a pretty obscure feature were it not for -one thing: CSS. It's very useful to search for a tag that has a -certain CSS class, but the name of the CSS attribute, "class", is also a -Python reserved word. - -You can use ``attrs`` to search by CSS class:: - - soup.find_all("a", { "class" : "sister" }) - # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] - -But that's a lot of code for such a common operation. Instead, you can -pass a string `attrs` instead of a dictionary. The string will be used -to restrict the CSS class:: - - soup.find_all("a", "sister") - # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] - -You can also pass in a regular expression, a function or -True. Anything you pass in for ``attrs`` that's not a dictionary will -be used to search against the CSS class:: - - soup.find_all(attrs=re.compile("itl")) - # [<p class="title"><b>The Dormouse's story</b></p>] - - def has_six_characters(css_class): - return css_class is not None and len(css_class) == 6 - - soup.find_all(attrs=has_six_characters) - # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] - -:ref:`Remember <multivalue>` that a single tag can have multiple -values for its "class" attribute. When you search for a tag that -matches a certain CSS class, you're matching against `any` of its CSS -classes:: - - css_soup = BeautifulSoup('<p class="body strikeout"></p>') - css_soup.find_all("p", "strikeout") - # [<p class="body strikeout"></p>] - - css_soup.find_all("p", "body") - # [<p class="body strikeout"></p>] - -Searching for the string value of the ``class`` attribute won't work:: - - css_soup.find_all("p", "body strikeout") - # [] - -.. _text: - -The ``text`` argument -^^^^^^^^^^^^^^^^^^^^^ - -With ``text`` you can search for strings instead of tags. As with -``name`` and the keyword arguments, you can pass in `a string`_, `a -regular expression`_, `a list`_, `a function`_, or `the value True`_. -Here are some examples:: - - soup.find_all(text="Elsie") - # [u'Elsie'] - - soup.find_all(text=["Tillie", "Elsie", "Lacie"]) - # [u'Elsie', u'Lacie', u'Tillie'] - - soup.find_all(text=re.compile("Dormouse")) - [u"The Dormouse's story", u"The Dormouse's story"] - - def is_the_only_string_within_a_tag(s): - """Return True if this string is the only child of its parent tag.""" - return (s == s.parent.string) - - soup.find_all(text=is_the_only_string_within_a_tag) - # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...'] - -Although ``text`` is for finding strings, you can combine it with -arguments for finding tags, Beautiful Soup will find all tags whose -``.string`` matches your value for ``text``. This code finds the <a> -tags whose ``.string`` is "Elsie":: - - soup.find_all("a", "Elsie") - # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>] - -.. _limit: - -The ``limit`` argument -^^^^^^^^^^^^^^^^^^^^^^ - -``find_all()`` returns all the tags and strings that match your -filters. This can take a while if the document is large. If you don't -need `all` the results, you can pass in a number for ``limit``. This -works just like the LIMIT keyword in SQL. It tells Beautiful Soup to -stop gathering results after it's found a certain number. - -There are three links in the "three sisters" document, but this code -only finds the first two:: - - soup.find_all("a", limit=2) - # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] - -.. _recursive: - -The ``recursive`` argument -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -If you call ``mytag.find_all()``, Beautiful Soup will examine all the -descendants of ``mytag``: its children, its children's children, and -so on. If you only want Beautiful Soup to consider direct children, -you can pass in ``recursive=False``. See the difference here:: - - soup.html.find_all("title") - # [<title>The Dormouse's story</title>] - - soup.html.find_all("title", recursive=False) - # [] - -Here's that part of the document:: - - <html> - <head> - <title> - The Dormouse's story - </title> - </head> - ... - -The <title> tag is beneath the <html> tag, but it's not `directly` -beneath the <html> tag: the <head> tag is in the way. Beautiful Soup -finds the <title> tag when it's allowed to look at all descendants of -the <html> tag, but when ``recursive=False`` restricts it to the -<html> tag's immediate children, it finds nothing. - -Beautiful Soup offers a lot of tree-searching methods (covered below), -and they mostly take the same arguments as ``find_all()``: ``name``, -``attrs``, ``text``, ``limit``, and the keyword arguments. But the -``recursive`` argument is different: ``find_all()`` and ``find()`` are -the only methods that support it. Passing ``recursive=False`` into a -method like ``find_parents()`` wouldn't be very useful. - -Calling a tag is like calling ``find_all()`` --------------------------------------------- - -Because ``find_all()`` is the most popular method in the Beautiful -Soup search API, you can use a shortcut for it. If you treat the -``BeautifulSoup`` object or a ``Tag`` object as though it were a -function, then it's the same as calling ``find_all()`` on that -object. These two lines of code are equivalent:: - - soup.find_all("a") - soup("a") - -These two lines are also equivalent:: - - soup.title.find_all(text=True) - soup.title(text=True) - -``find()`` ----------- - -Signature: find(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive -<recursive>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) - -The ``find_all()`` method scans the entire document looking for -results, but sometimes you only want to find one result. If you know a -document only has one <body> tag, it's a waste of time to scan the -entire document looking for more. Rather than passing in ``limit=1`` -every time you call ``find_all``, you can use the ``find()`` -method. These two lines of code are `nearly` equivalent:: - - soup.find_all('title', limit=1) - # [<title>The Dormouse's story</title>] - - soup.find('title') - # <title>The Dormouse's story</title> - -The only difference is that ``find_all()`` returns a list containing -the single result, and ``find()`` just returns the result. - -If ``find_all()`` can't find anything, it returns an empty list. If -``find()`` can't find anything, it returns ``None``:: - - print(soup.find("nosuchtag")) - # None - -Remember the ``soup.head.title`` trick from `Navigating using tag -names`_? That trick works by repeatedly calling ``find()``:: - - soup.head.title - # <title>The Dormouse's story</title> - - soup.find("head").find("title") - # <title>The Dormouse's story</title> - -``find_parents()`` and ``find_parent()`` ----------------------------------------- - -Signature: find_parents(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) - -Signature: find_parent(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) - -I spent a lot of time above covering ``find_all()`` and -``find()``. The Beautiful Soup API defines ten other methods for -searching the tree, but don't be afraid. Five of these methods are -basically the same as ``find_all()``, and the other five are basically -the same as ``find()``. The only differences are in what parts of the -tree they search. - -First let's consider ``find_parents()`` and -``find_parent()``. Remember that ``find_all()`` and ``find()`` work -their way down the tree, looking at tag's descendants. These methods -do the opposite: they work their way `up` the tree, looking at a tag's -(or a string's) parents. Let's try them out, starting from a string -buried deep in the "three daughters" document:: - - a_string = soup.find(text="Lacie") - a_string - # u'Lacie' - - a_string.find_parents("a") - # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] - - a_string.find_parent("p") - # <p class="story">Once upon a time there were three little sisters; and their names were - # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; - # and they lived at the bottom of a well.</p> - - a_string.find_parents("p", class="title") - # [] - -One of the three <a> tags is the direct parent of the string in -question, so our search finds it. One of the three <p> tags is an -indirect parent of the string, and our search finds that as -well. There's a <p> tag with the CSS class "title" `somewhere` in the -document, but it's not one of this string's parents, so we can't find -it with ``find_parents()``. - -You may have made the connection between ``find_parent()`` and -``find_parents()``, and the `.parent`_ and `.parents`_ attributes -mentioned earlier. The connection is very strong. These search methods -actually use ``.parents`` to iterate over all the parents, and check -each one against the provided filter to see if it matches. - -``find_next_siblings()`` and ``find_next_sibling()`` ----------------------------------------------------- - -Signature: find_next_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) - -Signature: find_next_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) - -These methods use :ref:`.next_siblings <sibling-generators>` to -iterate over the rest of an element's siblings in the tree. The -``find_next_siblings()`` method returns all the siblings that match, -and ``find_next_sibling()`` only returns the first one:: - - first_link = soup.a - first_link - # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> - - first_link.find_next_siblings("a") - # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] - - first_story_paragraph = soup.find("p", "story") - first_story_paragraph.find_next_sibling("p") - # <p class="story">...</p> - -``find_previous_siblings()`` and ``find_previous_sibling()`` ------------------------------------------------------------- - -Signature: find_previous_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) - -Signature: find_previous_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) - -These methods use :ref:`.previous_siblings <sibling-generators>` to iterate over an element's -siblings that precede it in the tree. The ``find_previous_siblings()`` -method returns all the siblings that match, and -``find_previous_sibling()`` only returns the first one:: - - last_link = soup.find("a", id="link3") - last_link - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> - - last_link.find_previous_siblings("a") - # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, - # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] - - first_story_paragraph = soup.find("p", "story") - first_story_paragraph.find_previous_sibling("p") - # <p class="title"><b>The Dormouse's story</b></p> - - -``find_all_next()`` and ``find_next()`` ---------------------------------------- - -Signature: find_all_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) - -Signature: find_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) - -These methods use :ref:`.next_elements <element-generators>` to -iterate over whatever tags and strings that come after it in the -document. The ``find_all_next()`` method returns all matches, and -``find_next()`` only returns the first match:: - - first_link = soup.a - first_link - # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> - - first_link.find_all_next(text=True) - # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', - # u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n'] - - first_link.find_next("p") - # <p class="story">...</p> - -In the first example, the string "Elsie" showed up, even though it was -contained within the <a> tag we started from. In the second example, -the last <p> tag in the document showed up, even though it's not in -the same part of the tree as the <a> tag we started from. For these -methods, all that matters is that an element match the filter, and -show up later in the document than the starting element. - -``find_all_previous()`` and ``find_previous()`` ------------------------------------------------ - -Signature: find_all_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) - -Signature: find_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) - -These methods use :ref:`.previous_elements <element-generators>` to -iterate over the tags and strings that came before it in the -document. The ``find_all_previous()`` method returns all matches, and -``find_previous()`` only returns the first match:: - - first_link = soup.a - first_link - # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> - - first_link.find_all_previous("p") - # [<p class="story">Once upon a time there were three little sisters; ...</p>, - # <p class="title"><b>The Dormouse's story</b></p>] - - first_link.find_previous("title") - # <title>The Dormouse's story</title> - -The call to ``find_all_previous("p")`` found the first paragraph in -the document (the one with class="title"), but it also finds the -second paragraph, the <p> tag that contains the <a> tag we started -with. This shouldn't be too surprising: we're looking at all the tags -that show up earlier in the document than the one we started with. A -<p> tag that contains an <a> tag must have shown up earlier in the -document. - -Modifying the tree -================== - -Beautiful Soup's main strength is in searching the parse tree, but you -can also modify the tree and write your changes as a new HTML or XML -document. - -Changing tag names and attributes ---------------------------------- - -I covered this earlier, in `Attributes`_, but it bears repeating. You -can rename a tag, change the values of its attributes, add new -attributes, and delete attributes:: - - soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') - tag = soup.b - - tag.name = "blockquote" - tag['class'] = 'verybold' - tag['id'] = 1 - tag - # <blockquote class="verybold" id="1">Extremely bold</blockquote> - - del tag['class'] - del tag['id'] - tag - # <blockquote>Extremely bold</blockquote> - - -Modifying ``.string`` ---------------------- - -If you set a tag's ``.string`` attribute, the tag's contents are -replaced with the string you give:: - - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - - tag = soup.a - tag.string = "New link text." - tag - # <a href="http://example.com/">New link text.</a> - -Be careful: if the tag contained other tags, they and all their -contents will be destroyed. - -``append()`` ------------- - -You can add to a tag's contents with ``Tag.append()``. It works just -like calling ``.append()`` on a Python list:: - - soup = BeautifulSoup("<a>Foo</a>") - soup.a.append("Bar") - - soup - # <html><head></head><body><a>FooBar</a></body></html> - soup.a.contents - # [u'Foo', u'Bar'] - -``BeautifulSoup.new_string()`` and ``.new_tag()`` -------------------------------------------------- - -If you need to add a string to a document, no problem--you can pass a -Python string in to ``append()``, or you can call the factory method -``BeautifulSoup.new_string()``:: - - soup = BeautifulSoup("<b></b>") - tag = soup.b - tag.append("Hello") - new_string = soup.new_string(" there") - tag.append(new_string) - tag - # <b>Hello there.</b> - tag.contents - # [u'Hello', u' there'] - -What if you need to create a whole new tag? The best solution is to -call the factory method ``BeautifulSoup.new_tag()``:: - - soup = BeautifulSoup("<b></b>") - original_tag = soup.b - - new_tag = soup.new_tag("a", href="http://www.example.com") - original_tag.append(new_tag) - original_tag - # <b><a href="http://www.example.com"></a></b> - - new_tag.string = "Link text." - original_tag - # <b><a href="http://www.example.com">Link text.</a></b> - -Only the first argument, the tag name, is required. - -``insert()`` ------------- - -``Tag.insert()`` is just like ``Tag.append()``, except the new element -doesn't necessarily go at the end of its parent's -``.contents``. It'll be inserted at whatever numeric position you -say. It works just like ``.insert()`` on a Python list:: - - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - tag = soup.a - - tag.insert(1, "but did not endorse ") - tag - # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a> - tag.contents - # [u'I linked to ', u'but did not endorse', <i>example.com</i>] - -``insert_before()`` and ``insert_after()`` ------------------------------------------- - -The ``insert_before()`` method inserts a tag or string immediately -before something else in the parse tree:: - - soup = BeautifulSoup("<b>stop</b>") - tag = soup.new_tag("i") - tag.string = "Don't" - soup.b.string.insert_before(tag) - soup.b - # <b><i>Don't</i>stop</b> - -The ``insert_after()`` method moves a tag or string so that it -immediately follows something else in the parse tree:: - - soup.b.i.insert_after(soup.new_string(" ever ")) - soup.b - # <b><i>Don't</i> ever stop</b> - soup.b.contents - # [<i>Don't</i>, u' ever ', u'stop'] - -``clear()`` ------------ - -``Tag.clear()`` removes the contents of a tag:: - - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - tag = soup.a - - tag.clear() - tag - # <a href="http://example.com/"></a> - -``extract()`` -------------- - -``PageElement.extract()`` removes a tag or string from the tree. It -returns the tag or string that was extracted:: - - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - a_tag = soup.a - - i_tag = soup.i.extract() - - a_tag - # <a href="http://example.com/">I linked to</a> - - i_tag - # <i>example.com</i> - - print(i_tag.parent) - None - -At this point you effectively have two parse trees: one rooted at the -``BeautifulSoup`` object you used to parse the document, and one rooted -at the tag that was extracted. You can go on to call ``extract`` on -a child of the element you extracted:: - - my_string = i_tag.string.extract() - my_string - # u'example.com' - - print(my_string.parent) - # None - i_tag - # <i></i> - - -``decompose()`` ---------------- - -``Tag.decompose()`` removes a tag from the tree, then `completely -destroys it and its contents`:: - - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - a_tag = soup.a - - soup.i.decompose() - - a_tag - # <a href="http://example.com/">I linked to</a> - - -.. _replace_with: - -``replace_with()`` ------------------- - -``PageElement.replace_with()`` removes a tag or string from the tree, -and replaces it with the tag or string of your choice:: - - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - a_tag = soup.a - - new_tag = soup.new_tag("b") - new_tag.string = "example.net" - a_tag.i.replace_with(new_tag) - - a_tag - # <a href="http://example.com/">I linked to <b>example.net</b></a> - -``replace_with()`` returns the tag or string that was replaced, so -that you can examine it or add it back to another part of the tree. - -``replace_with_children()`` ---------------------------- - -``Tag.replace_with_children()`` replaces a tag with whatever's inside -that tag. It's good for stripping out markup:: - - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - a_tag = soup.a - - a_tag.i.replace_with_children() - a_tag - # <a href="http://example.com/">I linked to example.com</a> - -Like ``replace_with()``, ``replace_with_children()`` returns the tag -that was replaced. - -Output -====== - -Pretty-printing ---------------- - -The ``prettify()`` method will turn a Beautiful Soup parse tree into a -nicely formatted bytestring, with each HTML/XML tag on its own line:: - - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - soup.prettify() - # '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...' - - print(soup.prettify()) - # <html> - # <head> - # </head> - # <body> - # <a href="http://example.com/"> - # I linked to - # <i> - # example.com - # </i> - # </a> - # </body> - # </html> - -You can call ``prettify()`` on the top-level ``BeautifulSoup`` object, -or on any of its ``Tag`` objects:: - - print(soup.a.prettify()) - # <a href="http://example.com/"> - # I linked to - # <i> - # example.com - # </i> - # </a> - -Non-pretty printing -------------------- - -If you just want a string, with no fancy formatting, you can call -``unicode()`` or ``str()`` on a ``BeautifulSoup`` object, or a ``Tag`` -within it:: - - str(soup) - # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>' - - unicode(soup.a) - # u'<a href="http://example.com/">I linked to <i>example.com</i></a>' - -The ``str()`` function returns a string encoded in UTF-8. See -`Encodings`_ for other options. - -You can also call ``encode()`` to get a bytestring, and ``decode()`` -to get Unicode. - -Output formatters ------------------ - -If you give Beautiful Soup a document that contains HTML entities like -"&lquot;", they'll be converted to Unicode characters:: - - soup = BeautifulSoup("“Dammit!” he said.") - unicode(soup) - # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>' - -If you then convert the document to a string, the Unicode characters -will be encoded as UTF-8. You won't get the HTML entities back:: - - str(soup) - # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>' - -By default, the only characters that are escaped upon output are bare -ampersands and angle brackets. These get turned into "&", "<", -and ">", so that Beautiful Soup doesn't inadvertently generate -invalid HTML or XML:: - - soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>") - soup.p - # <p>The law firm of Dewey, Cheatem, & Howe</p> - -You can change this behavior by providing a value for the -``formatter`` argument to ``prettify()``, ``encode()``, or -``decode()``. Beautiful Soup recognizes four possible values for -``formatter``. - -The default is ``formatter="minimal"``. Strings will only be processed -enough to ensure that Beautiful Soup generates valid HTML/XML:: - - french = "<p>Il a dit <<Sacré bleu!>></p>" - soup = BeautifulSoup(french) - print(soup.prettify(formatter="minimal")) - # <html> - # <body> - # <p> - # Il a dit <<Sacré bleu!>> - # </p> - # </body> - # </html> - -If you pass in ``formatter="html"``, Beautiful Soup will convert -Unicode characters to HTML entities whenever possible:: - - print(soup.prettify(formatter="html")) - # <html> - # <body> - # <p> - # Il a dit <<Sacré bleu!>> - # </p> - # </body> - # </html> - -If you pass in ``formatter=None``, Beautiful Soup will not modify -strings at all on output. This is the fastest option, but it may lead -to Beautiful Soup generating invalid HTML/XML, as in this example:: - - print(soup.prettify(formatter=None)) - # <html> - # <body> - # <p> - # Il a dit <<Sacré bleu!>> - # </p> - # </body> - # </html> - - -Finally, if you pass in a function for ``formatter``, Beautiful Soup -will call that function once for every string in the document. You can -do whatever you want in this function. Here's a formatter that -converts strings to uppercase and does absolutely nothing else:: - - def uppercase(str): - return str.upper() - - print(soup.prettify(formatter=uppercase)) - # <html> - # <body> - # <p> - # IL A DIT <<SACRÉ BLEU!>> - # </p> - # </body> - # </html> - -If you're writing your own function, you should know about the -``EntitySubstitution`` class in the ``bs4.dammit`` module. This class -implements Beautiful Soup's standard formatters as class methods: the -"html" formatter is ``EntitySubstitution.substitute_html``, and the -"minimal" formatter is ``EntitySubstitution.substitute_xml``. You can -use these functions to simulate ``formatter=html`` or -``formatter==minimal`` but and then do something in addition. - -Here's an example that converts strings to uppercase, `and` replaces -Unicode characters with HTML entities whenever possible:: - - from bs4.dammit import EntitySubstitution - def uppercase_and_substitute_html_entities(str): - return EntitySubstitution.substitute_html(str.upper()) - - print(soup.prettify(formatter=uppercase_and_substitute_html_entities)) - # <html> - # <body> - # <p> - # IL A DIT <<SACRÉ BLEU!>> - # </p> - # </body> - # </html> - -``get_text()`` --------------- - -If you only want the text part of a document or tag, you can use the -``get_text()`` method. It returns all the text in a document or -beneath a tag, as a single Unicode string:: - - markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>' - soup = BeautifulSoup(markup) - - soup.get_text() - u'\nI linked to example.com\n' - soup.i.get_text() - u'example.com' - -You can specify a string to be used to join the bits of text -together:: - - # soup.get_text("|") - u'\nI linked to |example.com|\n' - -You can tell Beautiful Soup to strip whitespace from the beginning and -end of each bit of text:: - - # soup.get_text("|", strip=True) - u'I linked to|example.com' - -But at that point you might want to use the :ref:`.stripped_strings <string-generators>` -generator instead, and process the text yourself:: - - [text for text in soup.stripped_strings] - # [u'I linked to', u'example.com'] - -Choosing a parser -================= - -If you just need to parse some HTML, you can dump the markup into the -``BeautifulSoup`` constructor, and it'll probably be fine. Beautiful -Soup will pick a parser for you and parse the data. But there are a -few additional arguments you can pass in to the constructor to change -which parser is used. - -The first argument to the ``BeautifulSoup`` constructor is a string or -an open filehandle--the markup you want parsed. The second argument is -`how` you'd like the markup parsed. - -If you don't specify anything, you'll get the best HTML parser that's -installed. Beautiful Soup ranks lxml's parser as being the best, then -html5lib's, then Python's built-in parser. You can override this by -specifying one of the following: - -* What type of markup you want to parse. Currently supported are - "html", "xml", and "html5". - -* The name of the parser library you want to use. Currently supported - options are "lxml", "html5lib", and "html.parser" (Python's - built-in HTML parser). - -Some examples:: - - BeautifulSoup(markup, "lxml") - BeautifulSoup(markup, "xml") - BeautifulSoup(markup, "html5") - -You can specify a list of the parser features you want, instead of -just one. Right now this is mostly useful for distinguishing between -lxml's HTML parser and its XML parser:: - - BeautifulSoup(markup, ["html", "lxml"]) - BeautifulSoup(markup, ["xml", "lxml"]) - -If you don't have an appropriate parser installed, Beautiful Soup will -ignore your request and pick a different parser. For instance, right -now the only supported XML parser is lxml, so if you don't have lxml -installed, asking for an XML parser won't give you one, and asking for -"lxml" won't work either. - -Why would you use one parser over another? Because different parsers -will create different parse trees from the same document. The biggest -differences are between HTML parsers and XML parsers. Here's a short -document, parsed as HTML:: - - BeautifulSoup("<a><b /></a>") - # <html><head></head><body><a><b></b></a></body></html> - -Since an empty <b /> tag is not valid HTML, the parser turns it into a -<b></b> tag pair. - -Here's the same document parsed as XML (running this requires that you -have lxml installed). Note that the empty <b /> tag is left alone, and -that the document is given an XML declaration instead of being put -into an <html> tag.:: - - BeautifulSoup("<a><b /></a>", "xml") - # <?xml version="1.0" encoding="utf-8"> - # <a><b /></a> - -There are also differences between HTML parsers. If you give Beautiful -Soup a perfectly-formed HTML document, these differences won't -matter. One parser may be faster than another, but they'll all give -you a data structure that looks exactly like the original HTML -document. - -But if the document is not perfectly-formed, different parsers will -give different results. Here's a short, invalid document parsed using -lxml's HTML parser. Note that the dangling </p> tag is simply -ignored:: - - BeautifulSoup("<a></p>", "lxml") - # <html><body><a></a></body></html> - -Here's the same document parsed using html5lib:: - - BeautifulSoup("<a></p>", "html5lib") - # <html><head></head><body><a><p></p></a></body></html> - -Instead of ignoring the dangling </p> tag, html5lib pairs it with an -opening <p> tag. This parser also adds an empty <head> tag to the -document. - -Here's the same document parsed with Python's built-in HTML -parser:: - - BeautifulSoup("<a></p>", "html.parser") - # <a></a> - -Like html5lib, this parser ignores the closing </p> tag. Unlike -html5lib, this parser makes no attempt to create a well-formed HTML -document by adding a <body> tag. Unlike lxml, it doesn't even bother -to add an <html> tag. - -Since the document "<a></p>" is invalid, none of these techniques is -the "correct" way to handle it. The html5lib parser uses techniques -that are part of the HTML5 standard, so it has the best claim on being -the "correct" way, but all three techniques are leigtimate. - -Differences between parsers can affect your script. If you're planning -on distributing your script to other people, you might want to specify -in the ``BeautifulSoup`` constructor which parser you used during -development. That will reduce the chances that your users parse a -document differently from the way you parse it. - - -Encodings -========= - -Any HTML or XML document is written in a specific encoding like ASCII -or UTF-8. But when you load that document into Beautiful Soup, you'll -discover it's been converted to Unicode:: - - markup = "<h1>Sacr\xc3\xa9 bleu!</h1>" - soup = BeautifulSoup(markup) - soup.h1 - # <h1>Sacré bleu!</h1> - soup.h1.string - # u'Sacr\xe9 bleu!' - -It's not magic. (That sure would be nice.) Beautiful Soup uses a -sub-library called `Unicode, Dammit`_ to detect a document's encoding -and convert it to Unicode. The autodetected encoding is available as -the ``.original_encoding`` attribute of the ``BeautifulSoup`` object:: - - soup.original_encoding - 'utf-8' - -Unicode, Dammit guesses correctly most of the time, but sometimes it -makes mistakes. Sometimes it guesses correctly, but only after a -byte-by-byte search of the document that takes a very long time. If -you happen to know a document's encoding ahead of time, you can avoid -mistakes and delays by passing it to the ``BeautifulSoup`` constructor -as ``from_encoding``. - -Here's a document written in ISO-8859-8. The document is so short that -Unicode, Dammit can't get a good lock on it, and misidentifies it as -ISO-8859-7:: - - markup = b"<h1>\xed\xe5\xec\xf9</h1>" - soup = BeautifulSoup(markup) - soup.h1 - <h1>νεμω</h1> - soup.original_encoding - 'ISO-8859-7' - -We can fix this by passing in the correct ``from_encoding``:: - - soup = BeautifulSoup(markup, from_encoding="iso-8859-8") - soup.h1 - <h1>םולש</h1> - soup.original_encoding - 'iso8859-8' - -In rare cases (usually when a UTF-8 document contains text written in -a completely different encoding), the only way to get Unicode may be -to replace some characters with the special Unicode character -"REPLACEMENT CHARACTER" (U+FFFD, �). If Unicode, Dammit needs to do -this, it will set the ``.contains_replacement_characters`` attribute -to ``True`` on the ``UnicodeDammit`` or ``BeautifulSoup`` object. This -lets you know that the Unicode representation is not an exact -representation of the original--some data was lost. If a document -contains �, but ``.contains_replacement_characters`` if ``False``, -you'll know that the � was there originally (as it is in this -paragrpah) and doesn't stand in for missing data. - -Output encoding ---------------- - -When you write out a document from Beautiful Soup, you get a UTF-8 -document, even if the document wasn't in UTF-8 to begin with. Here's a -document written in the Latin-1 encoding:: - - markup = b''' - <html> - <head> - <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" /> - </head> - <body> - <p>Sacr\xe9 bleu!</p> - </body> - </html> - ''' - - soup = BeautifulSoup(markup) - print(soup.prettify()) - # <html> - # <head> - # <meta content="text/html; charset=utf-8" http-equiv="Content-type" /> - # </head> - # <body> - # <p> - # Sacré bleu! - # </p> - # </body> - # </html> - -Note that the <meta> tag has been rewritten to reflect the fact that -the document is now in UTF-8. - -If you don't want UTF-8, you can pass an encoding into ``prettify()``:: - - print(soup.prettify("latin-1")) - # <html> - # <head> - # <meta content="text/html; charset=latin-1" http-equiv="Content-type" /> - # ... - -You can also call encode() on the ``BeautifulSoup`` object, or any -element in the soup, just as if it were a Python string:: - - soup.p.encode("latin-1") - # '<p>Sacr\xe9 bleu!</p>' - - soup.p.encode("utf-8") - # '<p>Sacr\xc3\xa9 bleu!</p>' - -Any characters that can't be represented in your chosen encoding will -be converted into numeric XML entity references. For instance, here's -a document that includes the Unicode character SNOWMAN:: - - markup = u"<b>\N{SNOWMAN}</b>" - snowman_soup = BeautifulSoup(markup) - tag = snowman_soup.b - -The SNOWMAN character can be part of a UTF-8 document (it looks like -☃), but there's no representation for that character in ISO-Latin-1 or -ASCII, so it's converted into "☃" for those encodings:: - - print(tag.encode("utf-8")) - # <b>☃</b> - - print tag.encode("latin-1") - # <b>☃</b> - - print tag.encode("ascii") - # <b>☃</b> - -Unicode, Dammit ---------------- - -You can use Unicode, Dammit without using Beautiful Soup. It's useful -whenever you have data in an unknown encoding and you just want it to -become Unicode:: - - from bs4 import UnicodeDammit - dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") - print(dammit.unicode_markup) - # Sacré bleu! - dammit.original_encoding - # 'utf-8' - -The more data you give Unicode, Dammit, the more accurately it will -guess. If you have your own suspicions as to what the encoding might -be, you can pass them in as a list:: - - dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) - print(dammit.unicode_markup) - # Sacré bleu! - dammit.original_encoding - # 'latin-1' - -Unicode, Dammit has one special feature that Beautiful Soup doesn't -use. You can use it to convert Microsoft smart quotes to HTML or XML -entities:: - - markup = b"<p>I just \x93love\x94 Microsoft Word</p>" - - UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup - # u'<p>I just “love” Microsoft Word</p>' - - UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup - # u'<p>I just “love” Microsoft Word</p>' - -You might find this feature useful, but Beautiful Soup doesn't use -it. Beautiful Soup prefers the default behavior, which is to convert -Microsoft smart quotes to Unicode characters along with everything -else:: - - UnicodeDammit(markup, ["windows-1252"]).unicode_markup - # u'<p>I just \u201clove\u201d Microsoft Word</p>' - -Parsing only part of a document -=============================== - -Let's say you want to use Beautiful Soup look at a document's <a> -tags. It's a waste of time and memory to parse the entire document and -then go over it again looking for <a> tags. It would be much faster to -ignore everthing that wasn't an <a> tag in the first place. The -``SoupStrainer`` class allows you to choose which parts of an incoming -document are parsed. You just create a ``SoupStrainer`` and pass it in -to the ``BeautifulSoup`` constructor as the ``parse_only`` argument. - -(Note that *this feature won't work if you're using the html5lib -parser*. If you use html5lib, the whole document will be parsed, no -matter what. In the examples below, I'll be forcing Beautiful Soup to -use Python's built-in parser.) - -``SoupStrainer`` ----------------- - -The ``SoupStrainer`` class takes the same arguments as a typical -method from `Searching the tree`_: :ref:`name <name>`, :ref:`attrs -<attrs>`, :ref:`text <text>`, and :ref:`**kwargs <kwargs>`. Here are -three ``SoupStrainer`` objects:: - - from bs4 import SoupStrainer - - only_a_tags = SoupStrainer("a") - - only_tags_with_id_link2 = SoupStrainer(id="link2") - - def is_short_string(string): - return len(string) < 10 - - only_short_strings = SoupStrainer(text=is_short_string) - -I'm going to bring back the "three sisters" document one more time, -and we'll see what the document looks like when it's parsed with these -three ``SoupStrainer`` objects:: - - html_doc = """ - <html><head><title>The Dormouse's story</title></head> - - <p class="title"><b>The Dormouse's story</b></p> - - <p class="story">Once upon a time there were three little sisters; and their names were - <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, - <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and - <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; - and they lived at the bottom of a well.</p> - - <p class="story">...</p> - """ - - print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) - # <a class="sister" href="http://example.com/elsie" id="link1"> - # Elsie - # </a> - # <a class="sister" href="http://example.com/lacie" id="link2"> - # Lacie - # </a> - # <a class="sister" href="http://example.com/tillie" id="link3"> - # Tillie - # </a> - - print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) - # <a class="sister" href="http://example.com/lacie" id="link2"> - # Lacie - # </a> - - print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) - # Elsie - # , - # Lacie - # and - # Tillie - # ... - # - -You can also pass a ``SoupStrainer`` into any of the methods covered -in `Searching the tree`_. This probably isn't terribly useful, but I -thought I'd mention it:: - - soup = BeautifulSoup(html_doc) - soup.find_all(only_short_strings) - # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', - # u'\n\n', u'...', u'\n'] - -Troubleshooting -=============== - -Parsing XML ------------ - -By default, Beautiful Soup parses documents as HTML. To parse a -document as XML, pass in "xml" as the second argument to the -``BeautifulSoup`` constructor:: - - soup = BeautifulSoup(markup, "xml") - -You'll need to :ref:`have lxml installed <parser-installation>`. - -Improving Performance ---------------------- - -Beautiful Soup will never be as fast as the parsers it sits on top -of. If response time is critical, if you're paying for computer time -by the hour, or if there's any other reason why computer time is more -valuable than programmer time, you should forget about Beautiful Soup -and work directly atop `lxml <http://lxml.de/>`_. - -That said, there are things you can do to speed up Beautiful Soup. If -you're not using lxml as the underlying parser, my advice is to -:ref:`start <parser-installation>`. Beautiful Soup parses documents -significantly faster using lxml than using html.parser or html5lib. - -Sometimes `Unicode, Dammit`_ can only detect the encoding of a file by -doing a byte-by-byte examination of the file. This slows Beautiful -Soup to a crawl. My tests indicate that this only happened on 2.x -versions of Python, and that it happened most often with documents -using Russian or Chinese encodings. If this is happening to you, you -can fix it by using Python 3 for your script. Or, if you happen to -know a document's encoding, you can pass it into the -``BeautifulSoup`` constructor as ``from_encoding``. - -`Parsing only part of a document`_ won't save you much time parsing -the document, but it can save a lot of memory, and it'll make -`searching` the document much faster. - -Beautiful Soup 3 -================ - -Beautiful Soup 3.2.0 is the old version, the last release of the -Beautiful Soup 3 series. It's currently the version packaged with all -major Linux distributions:: - -:kbd:`$ apt-get install python-beautifulsoup` - -It's also published through PyPi as `BeautifulSoup`.:: - -:kbd:`$ easy_install BeautifulSoup` - -:kbd:`$ pip install BeautifulSoup` - -You can also `download a tarball of Beautiful Soup 3.2.0 -<http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz>`_. - -If you ran ``easy_install beautifulsoup`` or ``easy_install -BeautifulSoup``, but your code doesn't work, you installed Beautiful -Soup 3 by mistake. You need to run ``easy_install beautifulsoup4``. - -`The documentation for Beautiful Soup 3 is archived online -<http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_. If -your first language is Chinese, it might be easier for you to read -`the Chinese translation of the Beautiful Soup 3 documentation -<http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html>`_, -then read this document to find out about the changes made in -Beautiful Soup 4. - -Porting code to BS4 -------------------- - -Most code written against Beautiful Soup 3 will work against Beautiful -Soup 4 with one simple change. All you should have to do is change the -package name from ``BeautifulSoup`` to ``bs4``. So this:: - - from BeautifulSoup import BeautifulSoup - -becomes this:: - - from bs4 import BeautifulSoup - -* If you get the ``ImportError`` "No module named BeautifulSoup", your - problem is that you're trying to run Beautiful Soup 3 code, but you - only have Beautiful Soup 4 installed. - -* If you get the ``ImportError`` "No module named bs4", your problem - is that you're trying to run Beautiful Soup 4 code, but you only - have Beautiful Soup 3 installed. - -Although BS4 is mostly backwards-compatible with BS3, most of its -methods have been deprecated and given new names for `PEP 8 compliance -<http://www.python.org/dev/peps/pep-0008/>`_. There are numerous other -renames and changes, and a few of them break backwards compatibility. - -Here's what you'll need to know to convert your BS3 code and habits to BS4: - -You need a parser -^^^^^^^^^^^^^^^^^ - -Beautiful Soup 3 used Python's ``SGMLParser``, a module that was -deprecated and removed in Python 3.0. Beautiful Soup 4 uses -``html.parser`` by default, but you can plug in lxml or html5lib and -use that instead. Until ``html.parser`` is improved to handle -real-world HTML better, that's what I recommend you do. See `Be sure -to install a good parser!`_ - -Method names -^^^^^^^^^^^^ - -* ``replaceWith`` -> ``replace_with`` -* ``replaceWithChildren`` -> ``replace_with_children`` -* ``findAll`` -> ``find_all`` -* ``findAllNext`` -> ``find_all_next`` -* ``findAllPrevious`` -> ``find_all_previous`` -* ``findNext`` -> ``find_next`` -* ``findNextSibling`` -> ``find_next_sibling`` -* ``findNextSiblings`` -> ``find_next_siblings`` -* ``findParent`` -> ``find_parent`` -* ``findParents`` -> ``find_parents`` -* ``findPrevious`` -> ``find_previous`` -* ``findPreviousSibling`` -> ``find_previous_sibling`` -* ``findPreviousSiblings`` -> ``find_previous_siblings`` -* ``nextSibling`` -> ``next_sibling`` -* ``previousSibling`` -> ``previous_sibling`` - -Some arguments to the Beautiful Soup constructor were renamed for the -same reasons: - -* ``BeautifulSoup(parseOnlyThese=...)`` -> ``BeautifulSoup(parse_only=...)`` -* ``BeautifulSoup(fromEncoding=...)`` -> ``BeautifulSoup(from_encoding=...)`` - -I renamed one method for compatibility with Python 3: - -* ``Tag.has_key()`` -> ``Tag.has_attr()`` - -I renamed one attribute to use more accurate terminology: - -* ``Tag.isSelfClosing`` -> ``Tag.is_empty_element`` - -I renamed three attributes to avoid using words that have special -meaning to Python. Unlike the others, these changes are *not backwards -compatible.* If you used these attributes in BS3, your code will break -on BS4 until you change them. - -* ``UnicodeDammit.unicode`` -> ``UnicodeDammit.unicode_markup`` -* ``Tag.next`` -> ``Tag.next_element`` -* ``Tag.previous`` -> ``Tag.previous_element`` - -Generators -^^^^^^^^^^ - -I gave the generators PEP 8-compliant names, and transformed them into -properties: - -* ``childGenerator()`` -> ``children`` -* ``nextGenerator()`` -> ``next_elements`` -* ``nextSiblingGenerator()`` -> ``next_siblings`` -* ``previousGenerator()`` -> ``previous_elements`` -* ``previousSiblingGenerator()`` -> ``previous_siblings`` -* ``recursiveChildGenerator()`` -> ``descendants`` -* ``parentGenerator()`` -> ``parents`` - -So instead of this:: - - for parent in tag.parentGenerator(): - ... - -You can write this:: - - for parent in tag.parents: - ... - -(But the old code will still work.) - -Some of the generators used to yield ``None`` after they were done, and -then stop. That was a bug. Now the generators just stop. - -There are two new generators, :ref:`.strings and -.stripped_strings <string-generators>`. ``.strings`` yields -NavigableString objects, and ``.stripped_strings`` yields Python -strings that have had whitespace stripped. - -XML -^^^ - -There is no longer a ``BeautifulStoneSoup`` class for parsing XML. To -parse XML you pass in "xml" as the second argument to the -``BeautifulSoup`` constructor. For the same reason, the -``BeautifulSoup`` constructor no longer recognizes the ``isHTML`` -argument. - -Beautiful Soup's handling of empty-element XML tags has been -improved. Previously when you parsed XML you had to explicitly say -which tags were considered empty-element tags. The ``selfClosingTags`` -argument to the constructor is no longer recognized. Instead, -Beautiful Soup considers any empty tag to be an empty-element tag. If -you add a child to an empty-element tag, it stops being an -empty-element tag. - -Entities -^^^^^^^^ - -An incoming HTML or XML entity is always converted into the -corresponding Unicode character. Beautiful Soup 3 had a number of -overlapping ways of dealing with entities, which have been -removed. The ``BeautifulSoup`` constructor no longer recognizes the -``smartQuotesTo`` or ``convertEntities`` arguments. (`Unicode, -Dammit`_ still has ``smart_quotes_to``, but its default is now to turn -smart quotes into Unicode.) - -If you want to turn those Unicode characters back into HTML entities -on output, rather than turning them into UTF-8 characters, you need to -use ``.encode``, as described in `Substituting HTML entities`. This -may change before the final release. - -Miscellaneous -^^^^^^^^^^^^^ - -:ref:`Tag.string <.string>` now operates recursively. If tag A -contains a single tag B and nothing else, then A.string is the same as -B.string. (Previously, it was None.) - -`Multi-valued attributes`_ like ``class`` have lists of strings as -their values, not strings. This may affect the way you search by CSS -class. - -If you pass one of the ``find*`` methods both :ref:`text <text>` `and` -a tag-specific argument like :ref:`name <name>`, Beautiful Soup will -search for tags that match your tag-specific criteria and whose -:ref:`Tag.string <.string>` matches your value for :ref:`text -<text>`. It will `not` find the strings themselves. Previously, -Beautiful Soup ignored the tag-specific arguments and looked for -strings. - -The ``BeautifulSoup`` constructor no longer recognizes the -`markupMassage` argument. It's now the parser's responsibility to -handle markup correctly. - -The rarely-used alternate parser classes like -``ICantBelieveItsBeautifulSoup`` and ``BeautifulSOAP`` have been -removed. It's now the parser's decision how to handle ambiguous -markup. diff --git a/doc/source/index.rst b/doc/source/index.rst index 4855840..7d53b2c 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -886,9 +886,9 @@ In real documents, the ``.next_sibling`` or ``.previous_sibling`` of a tag will usually be a string containing whitespace. Going back to the "three sisters" document:: - # <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> - # <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> - # <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> + # <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, + # <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and + # <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; You might think that the ``.next_sibling`` of the first <a> tag would be the second <a> tag. But actually, it's a string: the comma and |