summaryrefslogtreecommitdiff
path: root/doc/source/index.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/source/index.rst')
-rw-r--r--doc/source/index.rst2259
1 files changed, 2259 insertions, 0 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst
new file mode 100644
index 0000000..c9f45ea
--- /dev/null
+++ b/doc/source/index.rst
@@ -0,0 +1,2259 @@
+Beautiful Soup Documentation
+============================
+
+.. image:: 6.1.jpg
+ :align: right
+ :alt: "The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself."
+
+`Beautiful Soup <http://www.crummy.com/software/BeautifulSoup/>`_ is a
+Python library for pulling data out of HTML and XML files. It works
+with your favorite parser to provide idiomatic ways of navigating,
+searching, and modifying the parse tree. It commonly saves programmers
+hours or days of work.
+
+These instructions illustrate all major features of Beautiful Soup 4,
+with examples. I show you what the library is good for, how it works,
+how to use it, how to make it do what you want, and what to do when it
+violates your expectations.
+
+The examples in this documentation should work the same way in Python
+2.7 and Python 3.2.
+
+Quick Start
+===========
+
+Here's an HTML document I'll be using as an example throughout this
+document. It's part of a story from `Alice in Wonderland`::
+
+ html_doc = """
+ <html><head><title>The Dormouse's story</title></head>
+
+ <p class="title"><b>The Dormouse's story</b></p>
+
+ <p class="story">Once upon a time there were three little sisters; and their names were
+ <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
+ <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
+ <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
+ and they lived at the bottom of a well.</p>
+
+ <p class="story">...</p>
+ """
+
+Running the "three sisters" document through Beautiful Soup gives us a
+``BeautifulSoup`` object, which represents the document as a nested
+data structure::
+
+ from bs4 import BeautifulSoup
+ soup = BeautifulSoup(html_doc)
+
+ print(soup.prettify())
+ # <html>
+ # <head>
+ # <title>
+ # The Dormouse's story
+ # </title>
+ # </head>
+ # <body>
+ # <p class="title">
+ # <b>
+ # The Dormouse's story
+ # </b>
+ # </p>
+ # <p class="story">
+ # Once upon a time there were three little sisters; and their names were
+ # <a class="sister" href="http://example.com/elsie" id="link1">
+ # Elsie
+ # </a>
+ # ,
+ # <a class="sister" href="http://example.com/lacie" id="link2">
+ # Lacie
+ # </a>
+ # and
+ # <a class="sister" href="http://example.com/tillie" id="link2">
+ # Tillie
+ # </a>
+ # ; and they lived at the bottom of a well.
+ # </p>
+ # <p class="story">
+ # ...
+ # </p>
+ # </body>
+ # </html>
+
+Here are some simple ways to navigate that data structure::
+
+ soup.title
+ # <title>The Dormouse's story</title>
+
+ soup.title.name
+ # u'title'
+
+ soup.title.string
+ # u'The Dormouse's story'
+
+ soup.title.parent.name
+ # u'head'
+
+ soup.p
+ # <p class="title"><b>The Dormouse's story</b></p>
+
+ soup.p['class']
+ # u'title'
+
+ soup.a
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+
+ soup.find_all('a')
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+ soup.find(id="link3")
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
+
+One common task is extracting all the URLs found within a page's <a> tags::
+
+ for link in soup.find_all('a'):
+ print(link.get('href'))
+ # http://example.com/elsie
+ # http://example.com/lacie
+ # http://example.com/tillie
+
+Another common task is extracting all the text from a page::
+
+ print(soup.get_text())
+ # The Dormouse's story
+ #
+ # The Dormouse's story
+ #
+ # Once upon a time there were three little sisters; and their names were
+ # Elsie,
+ # Lacie and
+ # Tillie;
+ # and they lived at the bottom of a well.
+ #
+ # ...
+
+Does this look like what you need? If so, read on.
+
+Installing Beautiful Soup
+=========================
+
+Beautiful Soup 4 is published through PyPi, so you can install it with
+`easy_install`. If you're on the Python 2 series, the package name
+is `beautifulsoup4`.
+
+:kbd:`$ easy_install beautifulsoup4`
+
+If you're using the Python 3 series, the package name is `beautifulsoup4py3k`.
+
+:kbd:`$ easy_install-3.2 install beautifulsoup4py3k`
+
+(The `beautifulsoup` package is probably `not` what you want. That's
+the previous major release, `Beautiful Soup 3`_. Lots of software uses
+BS3, so it's still available, but if you're writing new code you
+should install `beautifulsoup4`.)
+
+You can also `download the Beautiful Soup 4 source tarball
+<http://www.crummy.com/software/BeautifulSoup/download/BeautifulSoup.tar.gz>`_,
+which includes both the Python 2 and Python 3 versions. The license
+for Beautiful Soup allows you to package the entire library with your
+application, and that might be the easiest thing for you to do.
+
+I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it
+should work with other recent versions.
+
+.. _parser-installation:
+
+Be sure to install a good parser!
+---------------------------------
+
+By default, Beautiful Soup uses the HTML parser that comes with
+Python. Unfortunately, that parser is not very good at handling bad
+HTML. I recommend you install the `lxml parser
+<http://lxml.de/>`_. It's very fast, it works with both Python 2 and
+Python 3, and it parses HTML and XML very well. Beautiful Soup will
+detect that you have lxml installed, and use it instead of Python's
+built-in parser.
+
+Depending on your setup, you might install lxml with one of these commands:
+
+:kbd:`$ sudo apt-get install python-lxml`
+
+:kbd:`$ easy_install lxml`
+
+:kbd:`$ easy_install-3.2 lxml`
+
+If you're using Python 2, another alternative is the pure-Python
+`html5lib parser <http://code.google.com/p/html5lib/>`_, which parses
+HTML the way a web browser does. Depending on your setup, you might
+install html5lib with one of these commands:
+
+:kbd:`$ sudo apt-get install python-html5lib`
+
+:kbd:`$ easy_install html5lib`
+
+Making the soup
+===============
+
+To parse a document, pass it into the ``BeautifulSoup``
+constructor. You can pass in a string or an open filehandle::
+
+ from bs4 import BeautifulSoup
+
+ soup = BeautifulSoup(open("index.html"))
+
+ soup = BeautifulSoup("<html>data</html>")
+
+First, the document is converted to Unicode, and HTML entities are
+converted to Unicode characters::
+
+ BeautifulSoup("Sacr&eacute; bleu!")
+ <html><head></head><body>Sacré bleu!</body></html>
+
+Beautiful Soup then parses the document using the best available
+parser. It will use an HTML parser unless you specifically tell it to
+use an XML parser. (See `Choosing a parser`_.)
+
+Kinds of objects
+================
+
+Beautiful Soup transforms a complex HTML document into a complex tree
+of Python objects. But you'll only ever have to deal with about four
+`kinds` of objects.
+
+.. _Tag:
+
+``Tag``
+-------
+
+A ``Tag`` object corresponds to an XML or HTML tag in the original document::
+
+ soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
+ tag = soup.b
+ type(tag)
+ # <class 'bs4.element.Tag'>
+
+Tags have a lot of attributes and methods, and I'll cover most of them
+in `Navigating the tree`_ and `Searching the tree`_. For now, the most
+important features of a tag are its name and attributes.
+
+Name
+^^^^
+
+Every tag has a name, accessible as ``.name``::
+
+ tag.name
+ # u'b'
+
+If you change a tag's name, the change will be reflected in any HTML
+markup generated by Beautiful Soup::
+
+ tag.name = "blockquote"
+ tag
+ # <blockquote class="boldest">Extremely bold</blockquote>
+
+Attributes
+^^^^^^^^^^
+
+A tag may have any number of attributes. The tag ``<b
+class="boldest">`` has an attribute "class" whose value is
+"boldest". You can access a tag's attributes by treating the tag like
+a dictionary::
+
+ tag['class']
+ # u'boldest'
+
+You can access that dictionary directly as ``.attrs``::
+
+ tag.attrs
+ # {u'class': u'boldest'}
+
+You can add, remove, and modify a tag's attributes. Again, this is
+done by treating the tag as a dictionary::
+
+ tag['class'] = 'verybold'
+ tag['id'] = 1
+ tag
+ # <blockquote class="verybold" id="1">Extremely bold</blockquote>
+
+ del tag['class']
+ del tag['id']
+ tag
+ # <blockquote>Extremely bold</blockquote>
+
+``NavigableString``
+-------------------
+
+A string corresponds to a bit of text within a tag. Beautiful Soup
+defines the ``NavigableString`` class to contain these bits of text::
+
+ tag.string
+ # u'Extremely bold'
+ type(tag.string)
+ # <class 'bs4.element.NavigableString'>
+
+A ``NavigableString`` is just like a Python Unicode string, except
+that it also supports some of the features described in `Navigating
+the tree`_ and `Searching the tree`_. You can convert a
+``NavigableString`` to a Unicode string with ``unicode()``::
+
+ unicode_string = unicode(tag.string)
+ unicode_string
+ # u'Extremely bold'
+ type(unicode_string)
+ # <type 'unicode'>
+
+You can't edit a string in place, but you can replace one string with
+another, using :ref:`replace_with`::
+
+ tag.string.replace_with("No longer bold")
+ tag
+ # <blockquote>No longer bold</blockquote>
+
+``NavigableString`` supports most of the features described in
+`Navigating the tree`_ and `Searching the tree`_, but not all of
+them. In particular, since a string can't contain anything (the way a
+tag may contain a string or another tag), strings don't support the
+``.contents`` or ``.string`` attributes, or the `find()` method.
+
+``BeautifulSoup``
+-----------------
+
+The ``BeautifulSoup`` object itself represents the document as a
+whole. For most purposes, you can treat it as a :ref:`Tag`
+object. This means it supports most of the methods described in
+`Navigating the tree`_ and `Searching the tree`_.
+
+Since the ``BeautifulSoup`` object doesn't correspond to an actual
+HTML or XML tag, it has no name and no attributes. But sometimes it's
+useful to look at its ``.name``, so it's been given the special
+``.name`` "[document]"::
+
+ soup.name
+ # u'[document]'
+
+Comments and other special strings
+----------------------------------
+
+``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost
+everything you'll see in an HTML or XML file, but there are a few
+leftover bits. The only one you'll probably ever need to worry about
+is the comment::
+
+ markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
+ soup = BeautifulSoup(markup)
+ comment = soup.b.string
+ type(comment)
+ # <class 'bs4.element.Comment'>
+
+The ``Comment`` object is just a special type of ``NavigableString``::
+
+ comment
+ # u'Hey, buddy. Want to buy a used parser'
+
+But when it appears as part of an HTML document, a ``Comment`` is
+displayed with special formatting::
+
+ print(soup.b.prettify())
+ # <b>
+ # <!--Hey, buddy. Want to buy a used parser?-->
+ # </b>
+
+Beautiful Soup defines classes for anything else that might show up in
+an XML document: ``CData``, ``ProcessingInstruction``,
+``Declaration``, and ``Doctype``. Just like ``Comment``, these classes
+are subclasses of ``NavigableString`` that add something extra to the
+string. Here's an example that replaces the comment with a CDATA
+block::
+
+ from bs4 import CData
+ cdata = CData("A CDATA block")
+ comment.replace_with(cdata)
+
+ print(soup.b.prettify())
+ # <b>
+ # <![CDATA[A CDATA block]]>
+ # </b>
+
+
+Navigating the tree
+===================
+
+Here's the "Three sisters" HTML document again::
+
+ html_doc = """
+ <html><head><title>The Dormouse's story</title></head>
+
+ <p class="title"><b>The Dormouse's story</b></p>
+
+ <p class="story">Once upon a time there were three little sisters; and their names were
+ <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
+ <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
+ <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
+ and they lived at the bottom of a well.</p>
+
+ <p class="story">...</p>
+ """
+
+ from bs4 import BeautifulSoup
+ soup = BeautifulSoup(html_doc)
+
+I'll use this as an example to show you how to move from one part of
+a document to another.
+
+Going down
+----------
+
+Tags may contain strings and other tags. These elements are the tag's
+`children`. Beautiful Soup provides a lot of different attributes for
+navigating and iterating over a tag's children.
+
+Note that Beautiful Soup strings don't support any of these
+attributes, because a string can't have children.
+
+Navigating using tag names
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The simplest way to navigate the parse tree is to say the name of the
+tag you want. If you want the <head> tag, just say ``soup.head``::
+
+ soup.head
+ # <head><title>The Dormouse's story</title></head>
+
+ soup.title
+ # <title>The Dormouse's story</title>
+
+You can do use this trick again and again to zoom in on a certain part
+of the parse tree. This code gets the first <b> tag beneath the <body> tag::
+
+ soup.body.b
+ # <b>The Dormouse's story</b>
+
+Using a tag name as an attribute will give you only the `first` tag by that
+name::
+
+ soup.a
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+
+If you need to get `all` the <a> tags, or anything more complicated
+than the first tag with a certain name, you'll need to use one of the
+methods described in `Searching the tree`_, such as `find_all()`::
+
+ soup.find_all('a')
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+``.contents`` and ``.children``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A tag's children are available in a list called ``.contents``::
+
+ head_tag = soup.head
+ head_tag
+ # <head><title>The Dormouse's story</title></head>
+
+ head_tag.contents
+ [<title>The Dormouse's story</title>]
+
+ title_tag = head_tag.contents[0]
+ title_tag
+ # <title>The Dormouse's story</title>
+ title_tag.contents
+ # [u'The Dormouse's story']
+
+The ``BeautifulSoup`` object itself has children. In this case, the
+<html> tag is the child of the ``BeautifulSoup`` object.::
+
+ len(soup.contents)
+ # 1
+ soup.contents[0].name
+ # u'html'
+
+A string does not have ``.contents``, because it can't contain
+anything::
+
+ text = title_tag.contents[0]
+ text.contents
+ # AttributeError: 'NavigableString' object has no attribute 'contents'
+
+Instead of getting them as a list, you can iterate over a tag's
+children using the ``.children`` generator::
+
+ for child in title_tag.children:
+ print(child)
+ # The Dormouse's story
+
+``.descendants``
+^^^^^^^^^^^^^^^^
+
+The ``.contents`` and ``.children`` attributes only consider a tag's
+`direct` children. For instance, the <head> tag has a single direct
+child--the <title> tag::
+
+ head_tag.contents
+ # [<title>The Dormouse's story</title>]
+
+But the <title> tag itself has a child: the string "The Dormouse's
+story". There's a sense in which that string is also a child of the
+<head> tag. The ``.descendants`` attribute lets you iterate over `all`
+of a tag's children, recursively: its direct children, the children of
+its direct children, and so on::
+
+ for child in head_tag.descendants:
+ print(child)
+ # <title>The Dormouse's story</title>
+ # The Dormouse's story
+
+The <head> tag has only one child, but it has two descendants: the
+<title> tag and the <title> tag's child. The ``BeautifulSoup`` object
+only has one direct child (the <html> tag), but it has a whole lot of
+descendants::
+
+ len(list(soup.children))
+ # 1
+ len(list(soup.descendants))
+ # 25
+
+.. _.string:
+
+``.string``
+^^^^^^^^^^^
+
+If a tag has only one child, and that child is a string, the string is
+made available as ``.string``::
+
+ title_tag.string
+ # u'The Dormouse's story'
+
+If a tag's only child is another tag, and `that` tag has a
+``.string``, then the parent tag is considered to have the same
+``.string`` as its child::
+
+ head_tag.contents
+ # [<title>The Dormouse's story</title>]
+
+ head_tag.string
+ # u'The Dormouse's story'
+
+If a tag contains more than one thing, then it's not clear what
+``.string`` should refer to, so ``.string`` is defined to be
+``None``::
+
+ print(soup.html.string)
+ # None
+
+.. _string-generators:
+
+``.strings`` and ``stripped_strings``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If there's more than one thing inside a tag, you can still look at
+just the strings. Use the ``.strings`` generator::
+
+ for string in soup.strings:
+ print(repr(string))
+ # u"The Dormouse's story"
+ # u'\n\n'
+ # u"The Dormouse's story"
+ # u'\n\n'
+ # u'Once upon a time there were three little sisters; and their names were\n'
+ # u'Elsie'
+ # u',\n'
+ # u'Lacie'
+ # u' and\n'
+ # u'Tillie'
+ # u';\nand they lived at the bottom of a well.'
+ # u'\n\n'
+ # u'...'
+ # u'\n'
+
+These strings tend to have a lot of extra whitespace, which you can
+remove by using the ``.stripped_strings`` generator instead::
+
+ for string in soup.stripped_strings:
+ print(repr(string))
+ # u"The Dormouse's story"
+ # u"The Dormouse's story"
+ # u'Once upon a time there were three little sisters; and their names were'
+ # u'Elsie'
+ # u','
+ # u'Lacie'
+ # u'and'
+ # u'Tillie'
+ # u';\nand they lived at the bottom of a well.'
+ # u'...'
+
+Here, strings consisting entirely of whitespace are ignored, and
+whitespace at the beginning and end of strings is removed.
+
+Going up
+--------
+
+Continuing the "family tree" analogy, every tag and every string has a
+`parent`: the tag that contains it.
+
+.. _.parent:
+
+``.parent``
+^^^^^^^^^^^
+
+You can access an element's parent with the ``.parent`` attribute. In
+the example "three sisters" document, the <head> tag is the parent
+of the <title> tag::
+
+ title_tag = soup.title
+ title_tag
+ # <title>The Dormouse's story</title>
+ title_tag.parent
+ # <head><title>The Dormouse's story</title></head>
+
+The title string itself has a parent: the <title> tag that contains
+it::
+
+ title_tag.string.parent
+ # <title>The Dormouse's story</title>
+
+The parent of a top-level tag like <html> is the ``BeautifulSoup`` object
+itself::
+
+ html_tag = soup.html
+ type(html_tag.parent)
+ # <class 'bs4.BeautifulSoup'>
+
+And the ``.parent`` of a ``BeautifulSoup`` object is defined as None::
+
+ print(soup.parent)
+ # None
+
+.. _.parents:
+
+``.parents``
+^^^^^^^^^^^^
+
+You can iterate over all of an element's parents with
+``.parents``. This example uses ``.parents`` to travel from an <a> tag
+buried deep within the document, to the very top of the document::
+
+ link = soup.a
+ link
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+ for parent in link.parents:
+ if parent is None:
+ print(parent)
+ else:
+ print(parent.name)
+ # p
+ # body
+ # html
+ # [document]
+ # None
+
+Going sideways
+--------------
+
+Consider a simple document like this::
+
+ sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
+ print(sibling_soup.prettify())
+ # <html>
+ # <body>
+ # <a>
+ # <b>
+ # text1
+ # </b>
+ # <c>
+ # text2
+ # </c>
+ # </a>
+ # </body>
+ # </html>
+
+The <b> tag and the <c> tag are at the same level: they're both direct
+children of the same tag. We call them `siblings`. When a document is
+pretty-printed, siblings show up at the same indentation level. You
+can also use this relationship in the code you write.
+
+``.next_sibling`` and ``.previous_sibling``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You can use ``.next_sibling`` and ``.previous_sibling`` to navigate
+between page elements that are on the same level of the parse tree::
+
+ sibling_soup.b.next_sibling
+ # <c>text2</c>
+
+ sibling_soup.c.previous_sibling
+ # <b>text1</b>
+
+The <b> tag has a ``.next_sibling``, but no ``.previous_sibling``,
+because there's nothing before the <b> tag `on the same level of the
+tree`. For the same reason, the <c> tag has a ``.previous_sibling``
+but no ``.next_sibling``::
+
+ print(sibling_soup.b.previous_sibling)
+ # None
+ print(sibling_soup.c.next_sibling)
+ # None
+
+The strings "text1" and "text2" are `not` siblings, because they don't
+have the same parent::
+
+ sibling_soup.b.string
+ # u'text1'
+
+ print(sibling_soup.b.string.next_sibling)
+ # None
+
+In real documents, the ``.next_sibling`` or ``.previous_sibling`` of a
+tag will usually be a string containing whitespace. Going back to the
+"three sisters" document::
+
+ <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
+ <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
+ <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
+
+You might think that the ``.next_sibling`` of the first <a> tag would
+be the second <a> tag. But actually, it's a string: the comma and
+newline that separate the first <a> tag from the second::
+
+ link = soup.a
+ link
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+
+ link.next_sibling
+ # u',\n'
+
+The second <a> tag is actually the ``.next_sibling`` of the comma::
+
+ link.next_sibling.next_sibling
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
+
+.. _sibling-generators:
+
+``.next_siblings`` and ``.previous_siblings``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You can iterate over a tag's siblings with ``.next_siblings`` or
+``.previous_siblings``::
+
+ for sibling in soup.a.next_siblings:
+ print(repr(sibling))
+ # u',\n'
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
+ # u' and\n'
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
+ # u'; and they lived at the bottom of a well.'
+ # None
+
+ for sibling in soup.find(id="link3").previous_siblings:
+ print(repr(sibling))
+ # ' and\n'
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
+ # u',\n'
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+ # u'Once upon a time there were three little sisters; and their names were\n'
+ # None
+
+Going back and forth
+--------------------
+
+Take a look at the beginning of the "three sisters" document::
+
+ <html><head><title>The Dormouse's story</title></head>
+ <p class="title"><b>The Dormouse's story</b></p>
+
+An HTML parser takes this string of characters and turns it into a
+series of events: "open an <html> tag", "open a <head> tag", "open a
+<title> tag", "add a string", "close the <title> tag", "open a <p>
+tag", and so on. Beautiful Soup offers tools for reconstructing the
+initial parse of the document.
+
+.. _element-generators:
+
+``.next_element`` and ``.previous_element``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``.next_element`` attribute of a string or tag points to whatever
+was parsed immediately afterwards. It might be the same as
+``.next_sibling``, but it's usually drastically different.
+
+Here's the final <a> tag in the "three sisters" document. Its
+``.next_sibling`` is a string: the conclusion of the sentence that was
+interrupted by the start of the <a> tag.::
+
+ last_a_tag = soup.find("a", id="link3")
+ last_a_tag
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
+
+ last_a_tag.next_sibling
+ # '; and they lived at the bottom of a well.'
+
+But the ``.next_element`` of that <a> tag, the thing that was parsed
+immediately after the <a> tag, is `not` the rest of that sentence:
+it's the word "Tillie"::
+
+ last_a_tag.next_element
+ # u'Tillie'
+
+That's because in the original markup, the word "Tillie" appeared
+before that semicolon. The parser encountered an <a> tag, then the
+word "Tillie", then the closing </a> tag, then the semicolon and rest of
+the sentence. The semicolon is on the same level as the <a> tag, but the
+word "Tillie" was encountered first.
+
+The ``.previous_element`` attribute is the exact opposite of
+``.next_element``. It points to whatever element was parsed
+immediately before this one::
+
+ last_a_tag.previous_element
+ # u' and\n'
+ last_a_tag.previous_element.next_element
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
+
+``.next_elements`` and ``.previous_elements``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You should get the idea by now. You can use these iterators to move
+forward or backward in the document as it was parsed::
+
+ for element in last_a_tag.next_elements:
+ print(repr(element))
+ # u'Tillie'
+ # u';\nand they lived at the bottom of a well.'
+ # u'\n\n'
+ # <p class="story">...</p>
+ # u'...'
+ # u'\n'
+ # None
+
+Searching the tree
+==================
+
+Beautiful Soup defines a lot of methods for searching the parse tree,
+but they're all very similar. I'm going to spend a lot of time explain
+the two most popular methods: ``find()`` and ``find_all()``. The other
+methods take almost exactly the same arguments, so I'll just cover
+them briefly.
+
+Once again, I'll be using the "three sisters" document as an example::
+
+ html_doc = """
+ <html><head><title>The Dormouse's story</title></head>
+
+ <p class="title"><b>The Dormouse's story</b></p>
+
+ <p class="story">Once upon a time there were three little sisters; and their names were
+ <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
+ <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
+ <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
+ and they lived at the bottom of a well.</p>
+
+ <p class="story">...</p>
+ """
+
+ from bs4 import BeautifulSoup
+ soup = BeautifulSoup(html_doc)
+
+By passing in a filter to an argument like ``find_all()``, you can
+isolate whatever parts of the document you're interested.
+
+Kinds of filters
+----------------
+
+Before talking in detail about ``find_all()`` and similar methods, I
+want to show examples of different filters you can pass into these
+methods. These filters show up again and again, throughout the
+search API. You can use them to filter based on a tag's name,
+on its attributes, on the text of a string, or on some combination of
+these.
+
+.. _a string:
+
+A string
+^^^^^^^^
+
+The simplest filter is a string. Pass a string to a search method and
+Beautiful Soup will perform a match against that exact string. This
+code finds all the <b> tags in the document::
+
+ soup.find_all('b')
+ # [<b>The Dormouse's story</b>]
+
+.. _a regular expression:
+
+A regular expression
+^^^^^^^^^^^^^^^^^^^^
+
+If you pass in a regular expression object, Beautiful Soup will filter
+against that regular expression. This code finds all the tags whose
+names start with the letter "b"; in this case, the <body> tag and the
+<b> tag::
+
+ import re
+ for tag in soup.find_all(re.compile("b.*")):
+ print(tag.name)
+ # body
+ # b
+
+.. _a list:
+
+A list
+^^^^^^
+
+If you pass in a list, Beautiful Soup will allow a string match
+against `any` item in that list. This code finds all the <a> tags
+`and` all the <b> tags::
+
+ soup.find_all(["a", "b"])
+ # [<b>The Dormouse's story</b>,
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+.. _the value True:
+
+``True``
+^^^^^^^^
+
+The value ``True`` matches everything it can. This code finds `all`
+the tags in the document, but none of the text strings::
+
+ for tag in soup.find_all(True):
+ print(tag.name)
+ # html
+ # head
+ # title
+ # body
+ # p
+ # b
+ # p
+ # a
+ # a
+ # a
+ # p
+
+.. a function:
+
+A function
+^^^^^^^^^^
+
+If none of the other matches work for you, define a function that
+takes an element as its only argument. The function should return
+``True`` if the argument matches, and ``False`` otherwise.
+
+Here's a function that returns ``True`` if a tag defines the "class"
+attribute but doesn't define the "id" attribute::
+
+ def has_class_but_no_id(tag):
+ return tag.has_key('class') and not tag.has_key('id')
+
+Pass this function into ``find_all()`` and you'll pick up all the <p>
+tags::
+
+ soup.find_all(has_class_but_no_id)
+ # [<p class="title"><b>The Dormouse's story</b></p>,
+ # <p class="story">Once upon a time there were...</p>,
+ # <p class="story">...</p>]
+
+This function only picks up the <p> tags. It doesn't pick up the <a>
+tags, because those tags define both "class" and "id". It doesn't pick
+up tags like <html> and <title>, because those tags don't define
+"class".
+
+Here's a function that returns ``True`` if a tag is surrounded by
+string objects::
+
+ from bs4 import NavigableString
+ def surrounded_by_strings(tag):
+ return (isinstance(tag.next_element, NavigableString)
+ and isinstance(tag.previous_element, NavigableString))
+
+ for tag in soup.find_all(surrounded_by_strings):
+ print tag.name
+ # p
+ # a
+ # a
+ # a
+ # p
+
+Now we're ready to look at the search methods in detail.
+
+``find_all()``
+--------------
+
+Signature: find_all(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive
+<recursive>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
+
+The ``find_all()`` method looks through a tag's descendants and
+retrieves `all` descendants that match your filters. I gave several
+examples in `Kinds of filters`_, but here are a few more::
+
+ soup.find_all("title")
+ # [<title>The Dormouse's story</title>]
+
+ soup.find_all("p", "title")
+ # [<p class="title"><b>The Dormouse's story</b></p>]
+
+ soup.find_all("a")
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+ soup.find_all(id="link2")
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
+
+ import re
+ soup.find(text=re.compile("sisters"))
+ # u'Once upon a time there were three little sisters; and their names were\n'
+
+Some of these should look familiar, but others are new. What does it
+mean to pass in a value for ``text``, or ``id``? Why does
+``find_all("p", "title")`` find a <p> tag with the CSS class "title"?
+Let's look at the arguments to ``find_all()``.
+
+.. _name:
+
+The ``name`` argument
+^^^^^^^^^^^^^^^^^^^^^
+
+Pass in a value for ``name`` and you'll tell Beautiful Soup to only
+consider tags with certain names. Text strings will be ignored, as
+will tags whose names that don't match.
+
+This is the simplest usage::
+
+ soup.find_all("title")
+ # [<title>The Dormouse's story</title>]
+
+Recall from `Kinds of filters`_ that the value to ``name`` can be `a
+string`_, `a regular expression`_, `a list`_, `a function`_, or `the value
+True`_.
+
+.. _kwargs:
+
+The keyword arguments
+^^^^^^^^^^^^^^^^^^^^^
+
+Any argument that's not recognized will be turned into a filter on tag
+attributes. If you pass in a value for an argument called ``id``,
+Beautiful Soup will filter against the tag's 'id' attribute::
+
+ soup.find_all(id='link2')
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
+
+If you pass in a value for ``href``, Beautiful Soup will filter
+against the tag's 'href' attribute::
+
+ soup.find_all(href=re.compile("elsie"))
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
+
+You can filter an attribute based on `a string`_, `a regular
+expression`_, `a list`_, `a function`_, or `the value True`_.
+
+This code finds all tags that have an ``id`` attribute, regardless of
+what the value is::
+
+ soup.find_all(id=True)
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+You can filter multiple attributes at once by passing in more than one
+keyword argument::
+
+ soup.find_all(href=re.compile("elsie"), id='link1')
+ # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
+
+.. _attrs:
+
+``attrs`` and searching by CSS class
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Instead of using keyword arguments, you can filter tags based on their
+attributes passing a dictionary in for ``attrs``. These two lines of
+code are equivalent::
+
+ soup.find_all(href=re.compile("elsie"), id='link1')
+ soup.find_all(attrs={'href' : re.compile("elsie"), 'id': 'link1'})
+
+The ``attrs`` argument would be a pretty obscure feature were it not for
+one thing: CSS. It's very useful to search for a tag that has a
+certain CSS class, but the name of the CSS attribute, "class", is also a
+Python reserved word.
+
+You can use ``attrs`` to search by CSS class::
+
+ soup.find_all("a", { "class" : "sister" })
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+But that's a lot of code for such a common operation. Instead, you can
+pass a string for `attrs` instead of a dictionary. The string will be
+used to restrict the CSS class::
+
+ soup.find_all("a", "sister")
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+.. _text:
+
+The ``text`` argument
+^^^^^^^^^^^^^^^^^^^^^
+
+With ``text`` you can search for strings instead of tags. As with
+``name`` and the keyword arguments, you can pass in `a string`_, `a
+regular expression`_, `a list`_, `a function`_, or `the value True`_.
+Here are some examples::
+
+ soup.find_all(text="Elsie")
+ # [u'Elsie']
+
+ soup.find_all(text=["Tillie", "Elsie", "Lacie"])
+ # [u'Elsie', u'Lacie', u'Tillie']
+
+ soup.find_all(text=re.compile("Dormouse"))
+ [u"The Dormouse's story", u"The Dormouse's story"]
+
+ def is_the_only_string_within_a_tag(s):
+ """Return True if this string is the only child of its parent tag."""
+ return (s == s.parent.string)
+
+ soup.find_all(text=is_the_only_string_within_a_tag)
+ # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']
+
+.. _limit:
+
+The ``limit`` argument
+^^^^^^^^^^^^^^^^^^^^^^
+
+``find_all()`` returns all the tags and strings that match your
+filters. This can take a while if the document is large. If you don't
+need `all` the results, you can pass in a number for ``limit``. This
+works just like the LIMIT keyword in SQL. It tells Beautiful Soup to
+stop gathering results after it's found a certain number.
+
+There are three links in the "three sisters" document, but this code
+only finds the first two::
+
+ soup.find_all("a", limit=2)
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
+
+.. _recursive:
+
+The ``recursive`` argument
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If you call ``mytag.find_all()``, Beautiful Soup will examine all the
+descendants of ``mytag``: its children, its children's children, and
+so on. If you only want Beautiful Soup to consider direct children,
+you can pass in ``recursive=False``. See the difference here::
+
+ soup.html.find_all("title")
+ # [<title>The Dormouse's story</title>]
+
+ soup.html.find_all("title", recursive=False)
+ # []
+
+Here's that part of the document::
+
+ <html>
+ <head>
+ <title>
+ The Dormouse's story
+ </title>
+ </head>
+ ...
+
+The <title> tag is beneath the <html> tag, but it's not `directly`
+beneath the <html> tag: the <head> tag is in the way. Beautiful Soup
+finds the <title> tag when it's allowed to look at all descendants of
+the <html> tag, but when ``recursive=False`` restricts it to the
+<html> tag's immediate children, it finds nothing.
+
+Beautiful Soup offers a lot of tree-searching methods (covered below),
+and they mostly take the same arguments as ``find_all()``: ``name``,
+``attrs``, ``text``, ``limit``, and the keyword arguments. But the
+``recursive`` argument is different: ``find_all()`` and ``find()`` are
+the only methods that support it. Passing ``recursive=False`` into a
+method like ``find_parents()`` wouldn't be very useful.
+
+Calling a tag is like calling ``find_all()``
+--------------------------------------------
+
+Because ``find_all()`` is the most popular method in the Beautiful
+Soup search API, you can use a shortcut for it. If you treat the
+``BeautifulSoup`` object or a ``Tag`` object as though it were a
+function, then it's the same as calling ``find_all()`` on that
+object. These two lines of code are equivalent::
+
+ soup.find_all("a")
+ soup("a")
+
+These two lines are also equivalent::
+
+ soup.title.find_all(text=True)
+ soup.title(text=True)
+
+``find()``
+----------
+
+Signature: find(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive
+<recursive>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
+
+The ``find_all()`` method scans the entire document looking for
+results, but sometimes you only want to find one result. If you know a
+document only has one <body> tag, it's a waste of time to scan the
+entire document looking for more. Rather than passing in ``limit=1``
+every time you call ``find_all``, you can use the ``find()``
+method. These two lines of code are `nearly` equivalent::
+
+ soup.find_all('title', limit=1)
+ # [<title>The Dormouse's story</title>]
+
+ soup.find('title')
+ # <title>The Dormouse's story</title>
+
+The only difference is that ``find_all()`` returns a list containing
+the single result, and ``find()`` just returns the result.
+
+If ``find_all()`` can't find anything, it returns an empty list. If
+``find()`` can't find anything, it returns ``None``::
+
+ print(soup.find("nosuchtag"))
+ # None
+
+Remember the ``soup.head.title`` trick from `Navigating using tag
+names`_? That trick works by repeatedly calling ``find()``::
+
+ soup.head.title
+ # <title>The Dormouse's story</title>
+
+ soup.find("head").find("title")
+ # <title>The Dormouse's story</title>
+
+``find_parents()`` and ``find_parent()``
+----------------------------------------
+
+Signature: find_parents(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
+
+Signature: find_parent(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
+
+I spent a lot of time above covering ``find_all()`` and
+``find()``. The Beautiful Soup API defines ten other methods for
+searching the tree, but don't be afraid. Five of these methods are
+basically the same as ``find_all()``, and the other five are basically
+the same as ``find()``. The only differences are in what parts of the
+tree they search.
+
+First let's consider ``find_parents()`` and
+``find_parent()``. Remember that ``find_all()`` and ``find()`` work
+their way down the tree, looking at tag's descendants. These methods
+do the opposite: they work their way `up` the tree, looking at a tag's
+(or a string's) parents. Let's try them out, starting from a string
+buried deep in the "three daughters" document::
+
+ a_string = soup.find(text="Lacie")
+ a_string
+ # u'Lacie'
+
+ a_string.find_parents("a")
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
+
+ a_string.find_parent("p")
+ # <p class="story">Once upon a time there were three little sisters; and their names were
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
+ # and they lived at the bottom of a well.</p>
+
+ a_string.find_parents("p", class="title")
+ # []
+
+One of the three <a> tags is the direct parent of the string in
+question, so our search finds it. One of the three <p> tags is an
+indirect parent of the string, and our search finds that as
+well. There's a <p> tag with the CSS class "title" `somewhere` in the
+document, but it's not one of this string's parents, so we can't find
+it with ``find_parents()``.
+
+You may have made the connection between ``find_parent()`` and
+``find_parents()``, and the `.parent`_ and `.parents`_ attributes
+mentioned earlier. The connection is very strong. These search methods
+actually use ``.parents`` to iterate over all the parents, and check
+each one against the provided filter to see if it matches.
+
+``find_next_siblings()`` and ``find_next_sibling()``
+----------------------------------------------------
+
+Signature: find_next_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
+
+Signature: find_next_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
+
+These methods use :ref:`.next_siblings <sibling-generators>` to
+iterate over the rest of an element's siblings in the tree. The
+``find_next_siblings()`` method returns all the siblings that match,
+and ``find_next_sibling()`` only returns the first one::
+
+ first_link = soup.a
+ first_link
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+
+ first_link.find_next_siblings("a")
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+ first_story_paragraph = soup.find("p", "story")
+ first_story_paragraph.find_next_sibling("p")
+ # <p class="story">...</p>
+
+``find_previous_siblings()`` and ``find_previous_sibling()``
+------------------------------------------------------------
+
+Signature: find_previous_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
+
+Signature: find_previous_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
+
+These methods use :ref:`.previous_siblings <sibling-generators>` to iterate over an element's
+siblings that precede it in the tree. The ``find_previous_siblings()``
+method returns all the siblings that match, and
+``find_previous_sibling()`` only returns the first one::
+
+ last_link = soup.find("a", id="link3")
+ last_link
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
+
+ last_link.find_previous_siblings("a")
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
+
+ first_story_paragraph = soup.find("p", "story")
+ first_story_paragraph.find_previous_sibling("p")
+ # <p class="title"><b>The Dormouse's story</b></p>
+
+
+``find_all_next()`` and ``find_next()``
+---------------------------------------
+
+Signature: find_all_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
+
+Signature: find_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
+
+These methods use :ref:`.next_elements <element-generators>` to
+iterate over whatever tags and strings that come after it in the
+document. The ``find_all_next()`` method returns all matches, and
+``find_next()`` only returns the first match::
+
+ first_link = soup.a
+ first_link
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+
+ first_link.find_all_next(text=True)
+ # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
+ # u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']
+
+ first_link.find_next("p")
+ # <p class="story">...</p>
+
+In the first example, the string "Elsie" showed up, even though it was
+contained within the <a> tag we started from. In the second example,
+the last <p> tag in the document showed up, even though it's not in
+the same part of the tree as the <a> tag we started from. For these
+methods, all that matters is that an element match the filter, and
+show up later in the document than the starting element.
+
+``find_all_previous()`` and ``find_previous()``
+-----------------------------------------------
+
+Signature: find_all_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
+
+Signature: find_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`)
+
+These methods use :ref:`.previous_elements <element-generators>` to
+iterate over the tags and strings that came before it in the
+document. The ``find_all_previous()`` method returns all matches, and
+``find_previous()`` only returns the first match::
+
+ first_link = soup.a
+ first_link
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+
+ first_link.find_all_previous("p")
+ # [<p class="story">Once upon a time there were three little sisters; ...</p>,
+ # <p class="title"><b>The Dormouse's story</b></p>]
+
+ first_link.find_previous("title")
+ # <title>The Dormouse's story</title>
+
+The call to ``find_all_previous("p")`` found the first paragraph in
+the document (the one with class="title"), but it also finds the
+second paragraph, the <p> tag that contains the <a> tag we started
+with. This shouldn't be too surprising: we're looking at all the tags
+that show up earlier in the document than the one we started with. A
+<p> tag that contains an <a> tag must have shown up earlier in the
+document.
+
+Modifying the tree
+==================
+
+Beautiful Soup's main strength is in searching the parse tree, but you
+can also modify the tree and write your changes as a new HTML or XML
+document.
+
+Changing tag names and attributes
+---------------------------------
+
+I covered this earlier, in `Attributes`_, but it bears repeating. You
+can rename a tag, change the values of its attributes, add new
+attributes, and delete attributes::
+
+ soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
+ tag = soup.b
+
+ tag.name = "blockquote"
+ tag['class'] = 'verybold'
+ tag['id'] = 1
+ tag
+ # <blockquote class="verybold" id="1">Extremely bold</blockquote>
+
+ del tag['class']
+ del tag['id']
+ tag
+ # <blockquote>Extremely bold</blockquote>
+
+
+Modifying ``.string``
+---------------------
+
+If you set a tag's ``.string`` attribute, the tag's contents are
+replaced with the string you give::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+
+ tag = soup.a
+ tag.string = "New link text."
+ tag
+ # <a href="http://example.com/">New link text.</a>
+
+Be careful: if the tag contained other tags, they and all their
+contents will be destroyed.
+
+``append()``
+------------
+
+You can add to a tag's contents with ``Tag.append()``. It works just
+like calling ``.append()`` on a Python list::
+
+ soup = BeautifulSoup("<a>Foo</a>")
+ soup.a.append("Bar")
+
+ soup
+ # <html><head></head><body><a>FooBar</a></body></html>
+ soup.a.contents
+ # [u'Foo', u'Bar']
+
+``BeautifulSoup.new_tag()``
+---------------------------
+
+If you need to add a string to a document, no problem--you can just
+pass a Python string in to ``append()``. But what if you need to
+create a whole new tag? The best solution is to call the factory
+method ``BeautifulSoup.new_tag()``::
+
+ soup = BeautifulSoup("<b></b>")
+ original_tag = soup.b
+
+ new_tag = soup.new_tag("a", dict(href="http://www.example.com"))
+ original_tag.append(new_tag)
+ original_tag
+ # <b><a href="http://www.example.com"></a></b>
+
+ new_tag.string = "Link text."
+ original_tag
+ # <b><a href="http://www.example.com">Link text.</a></b>
+
+Only the first argument, the tag name, is required.
+
+``insert()``
+------------
+
+``Tag.insert()`` is just like ``Tag.append()``, except the new element
+doesn't necessarily go at the end of its parent's
+``... contents``. It'll be inserted at whatever numeric position you
+say. It works just like ``.insert()`` on a Python list::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ tag = soup.a
+
+ tag.insert(1, "but did not endorse ")
+ tag
+ # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
+ tag.contents
+ # [u'I linked to ', u'but did not endorse', <i>example.com</i>]
+
+``clear()``
+-----------
+
+``Tag.clear()`` removes the contents of a tag::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ tag = soup.a
+
+ tag.clear()
+ tag
+ # <a href="http://example.com/"></a>
+
+``extract()``
+-------------
+
+``PageElement.extract()`` removes a tag or string from the tree, and
+returns it::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ a_tag = soup.a
+
+ i_tag = soup.i.extract()
+
+ a_tag
+ # <a href="http://example.com/">I linked to</a>
+
+ i_tag
+ # <i>example.com</i>
+
+ print(i_tag.parent)
+ None
+
+At this point you effectively have two parse trees: one rooted at the
+``BeautifulSoup`` object you used to parse the document, and one rooted
+at the tag that was extracted. You can go on to call ``extract`` on
+a child of the element you extracted::
+
+ my_string = i_tag.string.extract()
+ my_string
+ # u'example.com'
+
+ print(my_string.parent)
+ # None
+ i_tag
+ # <i></i>
+
+
+``decompose()``
+---------------
+
+``Tag.decompose()`` removes a tag from the tree, then `completely
+destroys it and its contents`::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ a_tag = soup.a
+
+ soup.i.decompose()
+
+ a_tag
+ # <a href="http://example.com/">I linked to</a>
+
+
+.. _replace_with:
+
+``replace_with()``
+------------------
+
+``PageElement.replace_with()`` removes a tag or string from the tree,
+and replaces it with the tag or string of your choice::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ a_tag = soup.a
+
+ new_tag = soup.new_tag("b")
+ new_tag.string = "example.net"
+ a_tag.i.replace_with(new_tag)
+
+ a_tag
+ # <a href="http://example.com/">I linked to <b>example.net</b></a>
+
+The removed tag or string is gone forever.
+
+``replace_with_children()``
+---------------------------
+
+``Tag.replace_with_children()`` replaces a tag with whatever's inside
+that tag. It's good for stripping out markup::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ a_tag = soup.a
+
+ a_tag.i.replace_with_children()
+ a_tag
+ # <a href="http://example.com/">I linked to example.com</a>
+
+
+Output
+======
+
+Pretty-printing
+---------------
+
+The ``prettify()`` method will turn a Beautiful Soup parse tree into a
+nicely formatted bytestring, with each HTML/XML tag on its own line::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ soup.prettify()
+ # '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...'
+
+ print(soup.prettify())
+ # <html>
+ # <head>
+ # </head>
+ # <body>
+ # <a href="http://example.com/">
+ # I linked to
+ # <i>
+ # example.com
+ # </i>
+ # </a>
+ # </body>
+ # </html>
+
+You can call ``prettify()`` on the top-level ``BeautifulSoup`` object,
+or on any of its ``Tag`` objects::
+
+ print(soup.a.prettify())
+ # <a href="http://example.com/">
+ # I linked to
+ # <i>
+ # example.com
+ # </i>
+ # </a>
+
+Non-pretty printing
+-------------------
+
+If you just want a string, with no fancy formatting, you can call
+``unicode()`` or ``str()`` on a ``BeautifulSoup`` object, or a ``Tag``
+within it::
+
+ str(soup)
+ # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'
+
+ unicode(soup.a)
+ # u'<a href="http://example.com/">I linked to <i>example.com</i></a>'
+
+The ``str()`` function returns a string encoded in UTF-8. See
+`Encodings`_ for other options.
+
+Substituting HTML entities
+--------------------------
+
+If you give Beautiful Soup a document that contains HTML or XML
+entities such as "&lquot;" they'll be converted to Unicode
+characters::
+
+ soup = BeautifulSoup("&ldquo;Hello,&rdquo; he said.")
+ unicode(soup)
+ # u'<html><head></head><body>\u201cHello,\u201d he said.</body></html>'
+
+If you then convert the document to a string, the Unicode characters
+will be encoded as UTF-8::
+
+ str(soup)
+ # '<html><head></head><body>\xe2\x80\x9cHello,\xe2\x80\x9d he said.</body></html>'
+
+You can get the HTML entities back (or create them where they didn't
+exist) by calling ``.encode()`` and passing in
+``substitute_html_entities=True``::
+
+ soup.encode(substitute_html_entities=True)
+ # '<html><head></head><body>&ldquo;Hello,&rdquo; he said.</body></html>'
+
+``get_text()``
+--------------
+
+If you only want the text part of a document or tag, you can use the
+``get_text()`` method. It returns all the text in a document or
+beneath a tag, as a single Unicode string::
+
+ markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
+ soup = BeautifulSoup(markup)
+
+ soup.get_text()
+ u'\nI linked to example.com\n'
+ soup.i.get_text()
+ u'example.com'
+
+You can specify a string to be used to join the bits of text
+together::
+
+ # soup.get_text("|")
+ u'\nI linked to |example.com|\n'
+
+You can tell Beautiful Soup to strip whitespace from the beginning and
+end of each bit of text::
+
+ # soup.get_text("|", strip=True)
+ u'I linked to|example.com'
+
+But at that point you might want to use the :ref:`.stripped_strings <string-generators>`
+generator instead, and process the text yourself::
+
+ [text for text in soup.stripped_strings]
+ # [u'I linked to', u'example.com']
+
+Choosing a parser
+=================
+
+If you just need to parse some HTML, you can dump the markup into the
+``BeautifulSoup`` constructor, and it'll probably be fine. Beautiful
+Soup will pick a parser for you and parse the data. But there are a
+few additional arguments you can pass in to the constructor to change
+which parser is used.
+
+The first argument to the ``BeautifulSoup`` constructor is a string or
+an open filehandle--the markup you want parsed. The second argument is
+`how` you'd like the markup parsed.
+
+If you don't specify anything, you'll get the best HTML parser that's
+installed. Beautiful Soup ranks lxml's parser as being the best, then
+html5lib's, then Python's built-in parser. You can override this by
+specifying one of the following:
+
+* What type of markup you want to parse. Currently supported are
+ "html", "xml", and "html5".
+
+* The name of the parser library you want to use. Currently supported
+ options are "lxml", "html5lib", and "html.parser" (Python's
+ built-in HTML parser).
+
+Some examples::
+
+ BeautifulSoup(markup, "lxml")
+ BeautifulSoup(markup, "xml")
+ BeautifulSoup(markup, "html5")
+
+You can specify a list of the parser features you want, instead of
+just one. Right now this is mostly useful for distinguishing between
+lxml's HTML parser and its XML parser::
+
+ BeautifulSoup(markup, ["html", "lxml"])
+ BeautifulSoup(markup, ["xml", "lxml"])
+
+If you don't have an appropriate parser installed, Beautiful Soup will
+ignore your request and pick a different parser. For instance, right
+now the only supported XML parser is lxml, so if you don't have lxml
+installed, asking for an XML parser won't give you one, and asking for
+"lxml" won't work either.
+
+Why would you use one parser over another? Because different parsers
+will create different parse trees from the same document. The biggest
+differences are between HTML parsers and XML parsers. Here's a short
+document, parsed as HTML::
+
+ BeautifulSoup("<a><b /></a>")
+ # <html><head></head><body><a><b></b></a></body></html>
+
+Since an empty <b /> tag is not valid HTML, the parser turns it into a
+<b></b> tag pair.
+
+Here's the same document parsed as XML (running this requires that you
+have lxml installed). Note that the empty <b /> tag is left alone, and
+that the document is given an XML declaration instead of being put
+into an <html> tag.::
+
+ BeautifulSoup("<a><b /></a>", "xml")
+ # <?xml version="1.0" encoding="utf-8">
+ # <a><b /></a>
+
+There are also differences between HTML parsers. If you give Beautiful
+Soup a perfectly-formed HTML document, these differences won't
+matter. One parser may be faster than another, but they'll all give
+you a data structure that looks exactly like the original HTML
+document.
+
+But if the document is not perfectly-formed, different parsers will
+give different results. Here's a short, invalid document parsed using
+lxml's HTML parser. Note that the dangling </p> tag is simply
+ignored::
+
+ BeautifulSoup("<a></p>", "lxml")
+ # <html><body><a></a></body></html>
+
+Here's the same document parsed using html5lib::
+
+ BeautifulSoup("<a></p>", "html5lib")
+ # <html><head></head><body><a><p></p></a></body></html>
+
+Instead of ignoring the dangling </p> tag, html5lib pairs it with an
+opening <p> tag. This parser also adds an empty <head> tag to the
+document.
+
+Here's the same document parsed with Python's built-in HTML
+parser::
+
+ BeautifulSoup("<a></p>", "html.parser")
+ # <a></a>
+
+Like html5lib, this parser ignores the closing </p> tag. Unlike
+html5lib, this parser makes no attempt to create a well-formed HTML
+document by adding a <body> tag. Unlike lxml, it doesn't even bother
+to add an <html> tag.
+
+Since the document "<a></p>" is invalid, none of these techniques is
+the "correct" way to handle it. The html5lib parser uses techniques
+that are part of the HTML5 standard, so it has the best claim on being
+the "correct" way, but all three techniques are leigtimate.
+
+Differences between parsers can affect your script. If you're planning
+on distributing your script to other people, you might want to specify
+in the ``BeautifulSoup`` constructor which parser you used during
+development. That will reduce the chances that your users parse a
+document differently from the way you parse it.
+
+
+Encodings
+=========
+
+Any HTML or XML document is written in a specific encoding like ASCII
+or UTF-8. But when you load that document into Beautiful Soup, you'll
+discover it's been converted to Unicode::
+
+ markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
+ soup = BeautifulSoup(markup)
+ soup.h1
+ # <h1>Sacré bleu!</h1>
+ soup.h1.string
+ # u'Sacr\xe9 bleu!'
+
+It's not magic. (That sure would be nice.) Beautiful Soup uses a
+sub-library called `Unicode, Dammit`_ to detect a document's encoding
+and convert it to Unicode. The autodetected encoding is available as
+the ``.original_encoding`` attribute of the ``BeautifulSoup`` object::
+
+ soup.original_encoding
+ 'utf-8'
+
+Unicode, Dammit guesses correctly most of the time, but sometimes it
+makes mistakes. Sometimes it guesses correctly, but only after a
+byte-by-byte search of the document that takes a very long time. If
+you happen to know a document's encoding ahead of time, you can avoid
+mistakes and delays by passing it to the ``BeautifulSoup`` constructor
+as ``from_encoding``.
+
+Here's a document written in ISO-8859-8. The document is so short that
+Unicode, Dammit can't get a good lock on it, and misidentifies it as
+ISO-8859-7::
+
+ markup = b"<h1>\xed\xe5\xec\xf9</h1>"
+ soup = BeautifulSoup(markup)
+ soup.h1
+ <h1>νεμω</h1>
+ soup.original_encoding
+ 'ISO-8859-7'
+
+We can fix this by passing in the correct ``from_encoding``::
+
+ soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
+ soup.h1
+ <h1>םולש</h1>
+ soup.original_encoding
+ 'iso8859-8'
+
+Output encoding
+---------------
+
+When you write out a document from Beautiful Soup, you get a UTF-8
+document, even if the document wasn't in UTF-8 to begin with. Here's a
+document written in the Latin-1 encoding::
+
+ markup = b'''
+ <html>
+ <head>
+ <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />
+ </head>
+ <body>
+ <p>Sacr\xe9 bleu!</p>
+ </body>
+ </html>
+ '''
+
+ soup = BeautifulSoup(markup)
+ print(soup.prettify())
+ # <html>
+ # <head>
+ # <meta content="text/html; charset=utf-8" http-equiv="Content-type" />
+ # </head>
+ # <body>
+ # <p>
+ # Sacré bleu!
+ # </p>
+ # </body>
+ # </html>
+
+Note that the <meta> tag has been rewritten to reflect the fact that
+the document is now in UTF-8.
+
+If you don't want UTF-8, you can pass an encoding into ``prettify()``::
+
+ print(soup.prettify("latin-1"))
+ # <html>
+ # <head>
+ # <meta content="text/html; charset=latin-1" http-equiv="Content-type" />
+ # ...
+
+You can also call encode() on the ``BeautifulSoup`` object, or any
+element in the soup, just as if it were a Python string::
+
+ soup.p.encode("latin-1")
+ # '<p>Sacr\xe9 bleu!</p>'
+
+ soup.p.encode("utf-8")
+ # '<p>Sacr\xc3\xa9 bleu!</p>'
+
+Unicode, Dammit
+---------------
+
+You can use Unicode, Dammit without using Beautiful Soup. It's useful
+whenever you have data in an unknown encoding and you just want it to
+become Unicode::
+
+ from bs4 import UnicodeDammit
+ dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")
+ print(dammit.unicode_markup)
+ # Sacré bleu!
+ dammit.original_encoding
+ # 'utf-8'
+
+The more data you give Unicode, Dammit, the more accurately it will
+guess. If you have your own suspicions as to what the encoding might
+be, you can pass them in as a list::
+
+ dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])
+ print(dammit.unicode_markup)
+ # Sacré bleu!
+ dammit.original_encoding
+ # 'latin-1'
+
+Unicode, Dammit has one special feature that Beautiful Soup doesn't
+use. You can use it to convert Microsoft smart quotes to HTML or XML
+entities::
+
+ markup = b"<p>I just \x93love\x94 Microsoft Word</p>"
+
+ UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup
+ # u'<p>I just &ldquo;love&rdquo; Microsoft Word</p>'
+
+ UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup
+ # u'<p>I just &#x201C;love&#x201D; Microsoft Word</p>'
+
+You might find this feature useful, but Beautiful Soup doesn't use
+it. Beautiful Soup prefers the default behavior, which is toconvert
+Microsoft smart quotes to Unicode characters along with everything
+else::
+
+ UnicodeDammit(markup, ["windows-1252"]).unicode_markup
+ # u'<p>I just \u201clove\u201d Microsoft Word</p>'
+
+Parsing only part of a document
+===============================
+
+Let's say you want to use Beautiful Soup look at a document's <a>
+tags. It's a waste of time and memory to parse the entire document and
+then go over it again looking for <a> tags. It would be much faster to
+ignore everthing that wasn't an <a> tag in the first place. The
+``SoupStrainer`` class allows you to choose which parts of an incoming
+document are parsed. You just create a ``SoupStrainer`` and pass it in
+to the ``BeautifulSoup`` constructor as the ``parse_only`` argument.
+
+(Note that *this feature won't work if you're using the html5lib
+parser*. If you use html5lib, the whole document will be parsed, no
+matter what. In the examples below, I'll be forcing Beautiful Soup to
+use Python's built-in parser.)
+
+``SoupStrainer``
+----------------
+
+The ``SoupStrainer`` class takes the same arguments as a typical
+method from `Searching the tree`_: :ref:`name <name>`, :ref:`attrs
+<attrs>`, :ref:`text <text>`, and :ref:`**kwargs <kwargs>`. Here are
+three ``SoupStrainer`` objects::
+
+ from bs4 import SoupStrainer
+
+ only_a_tags = SoupStrainer("a")
+
+ only_tags_with_id_link2 = SoupStrainer(id="link2")
+
+ def is_short_string(string):
+ return len(string) < 10
+
+ only_short_strings = SoupStrainer(text=is_short_string)
+
+I'm going to bring back the "three sisters" document one more time,
+and we'll see what the document looks like when it's parsed with these
+three ``SoupStrainer`` objects::
+
+ html_doc = """
+ <html><head><title>The Dormouse's story</title></head>
+
+ <p class="title"><b>The Dormouse's story</b></p>
+
+ <p class="story">Once upon a time there were three little sisters; and their names were
+ <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
+ <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
+ <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
+ and they lived at the bottom of a well.</p>
+
+ <p class="story">...</p>
+ """
+
+ print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())
+ # <a class="sister" href="http://example.com/elsie" id="link1">
+ # Elsie
+ # </a>
+ # <a class="sister" href="http://example.com/lacie" id="link2">
+ # Lacie
+ # </a>
+ # <a class="sister" href="http://example.com/tillie" id="link3">
+ # Tillie
+ # </a>
+
+ print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify())
+ # <a class="sister" href="http://example.com/lacie" id="link2">
+ # Lacie
+ # </a>
+
+ print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify())
+ # Elsie
+ # ,
+ # Lacie
+ # and
+ # Tillie
+ # ...
+ #
+
+You can also pass a ``SoupStrainer`` into any of the methods covered
+in `Searching the tree`_. This probably isn't terribly useful, but I
+thought I'd mention it::
+
+ soup = BeautifulSoup(html_doc)
+ soup.find_all(only_short_strings)
+ # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
+ # u'\n\n', u'...', u'\n']
+
+Troubleshooting
+===============
+
+Parsing XML
+-----------
+
+By default, Beautiful Soup parses documents as HTML. To parse a
+document as XML, pass in "xml" as the second argument to the
+``BeautifulSoup`` constructor::
+
+ soup = BeautifulSoup(markup, "xml")
+
+You'll need to :ref:`have lxml installed <parser-installation>`.
+
+Improving Performance
+---------------------
+
+Beautiful Soup will never be as fast as the parsers it sits on top
+of. If response time is critical, if you're paying for computer time
+by the hour, or if there's any other reason why computer time is more
+valuable than programmer time, you should forget about Beautiful Soup
+and work directly atop `lxml <http://lxml.de/>`_.
+
+That said, there are things you can do to speed up Beautiful Soup. If
+you're not using lxml as the underlying parser, my advice is to
+:ref:`start <parser-installation>`. Beautiful Soup parses documents
+significantly faster using lxml than using html.parser or html5lib.
+
+Sometimes `Unicode, Dammit` can only detect the encoding of a file by
+doing a byte-by-byte examination of the file. This slows Beautiful
+Soup to a crawl. My tests indicate that this only happened on 2.x
+versions of Python, and that it happened most often with documents
+using Russian or Chinese encodings. If this is happening to you, you
+can fix it by using Python 3 for your script. Or, if you happen to
+know a document's encoding, you can pass it into the
+``BeautifulSoup`` constructor as ``from_encoding``.
+
+`Parsing only part of a document`_ won't save you much time parsing
+the document, but it can save a lot of memory, and it'll make
+`searching` the document much faster.
+
+Beautiful Soup 3
+================
+
+Beautiful Soup 3.2.0 is the old version, the last release of the
+Beautiful Soup 3 series. It's currently the version packaged with all
+major Linux distributions::
+
+ $ sudo apt-get install python-beautifulsoup
+ $ [XXX need syntax for RPM]
+
+It's also published through PyPi as `beautifulsoup`.::
+
+ $ easy_install beautifulsoup
+
+You can also `download a tarball of Beautiful Soup 3.2.0
+<http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.2.0.tar.gz>`_.
+
+If you ran ``easy_install beautifulsoup`` but your code doesn't work,
+you installed Beautiful Soup 3 by mistake. You need to run
+``easy_install beautifulsoup4``.
+
+`The documentation for Beautiful Soup 3 is archived online
+<http://www.crummy.com/software/BeautifulSoup/documentation.3.html>`_. If
+your first language is Chinese, it might be easier for you to read
+`the Chinese translation of the Beautiful Soup 3 documentation
+<http://www.crummy.com/software/BeautifulSoup/documentation.3.zh.html>`_,
+then read this document to find out about the changes made in
+Beautiful Soup 4.
+
+Porting code to BS4
+-------------------
+
+Most code written against Beautiful Soup 3 will work against Beautiful
+Soup 4 with one simple change. All you should have to do is change the
+package name from ``BeautifulSoup`` to ``bs4``. So this::
+
+ from BeautifulSoup import BeautifulSoup
+
+becomes this::
+
+ from bs4 import BeautifulSoup
+
+If you get the ``ImportError`` "No module named BeautifulSoup", your
+problem is that you're trying to run Beautiful Soup 3 code, but you
+only have Beautiful Soup 4 installed.
+
+If you get the ``ImportError`` "No module named bs4", your problem is
+that you're trying to run Beautiful Soup 4 code, but you only have
+Beautiful Soup 3 installed.
+
+Although BS4 is almost entirely backwards-compatible with BS3, most of
+its methods have been deprecated and given new names for PEP 8
+compliance. There are numerous other renames and changes, a few of
+which break backwards compatibility.
+
+Here are the changes:
+
+Method names
+^^^^^^^^^^^^
+
+* ``replaceWith`` -> ``replace_with``
+* ``replaceWithChildren`` -> ``replace_with_children``
+* ``findAll`` -> ``find_all``
+* ``findAllNext`` -> ``find_all_next``
+* ``findAllPrevious`` -> ``find_all_previous``
+* ``findNext`` -> ``find_next``
+* ``findNextSibling`` -> ``find_next_sibling``
+* ``findNextSiblings`` -> ``find_next_siblings``
+* ``findParent`` -> ``find_parent``
+* ``findParents`` -> ``find_parents``
+* ``findPrevious`` -> ``find_previous``
+* ``findPreviousSibling`` -> ``find_previous_sibling``
+* ``findPreviousSiblings`` -> ``find_previous_siblings``
+* ``nextSibling`` -> ``next_sibling``
+* ``previousSibling`` -> ``previous_sibling``
+
+Some arguments to the Beautiful Soup constructor were renamed for the
+same reasons:
+
+* ``BeautifulSoup(parseOnlyThese=...)`` -> ``BeautifulSoup(parse_only=...)``
+* ``BeautifulSoup(fromEncoding=...)`` -> ``BeautifulSoup(from_encoding=...)``
+
+I renamed one method for compatibility with Python 3:
+
+* ``Tag.has_key()`` -> ``Tag.has_attr()``
+
+I renamed one attribute to use more accurate terminology:
+
+* ``Tag.isSelfClosing`` -> ``Tag.is_empty_element``
+
+I renamed three attributes to avoid using words that have special
+meaning to Python. Unlike the others, these changs are *not backwards
+compatible.* If you used these attributes in BS3, your code will break
+on BS4 until you change them.
+
+* ``UnicodeDammit.unicode`` -> ``UnicodeDammit.unicode_markup``
+* ``Tag.next`` -> ``Tag.next_element``
+* ``Tag.previous`` -> ``Tag.previous_element``
+
+Generators
+^^^^^^^^^^
+
+I gave the generators PEP 8-compliant names, and transformed them into
+properties:
+
+* ``childGenerator()`` -> ``children``
+* ``nextGenerator()`` -> ``next_elements``
+* ``nextSiblingGenerator()`` -> ``next_siblings``
+* ``previousGenerator()`` -> ``previous_elements``
+* ``previousSiblingGenerator()`` -> ``previous_siblings``
+* ``recursiveChildGenerator()`` -> ``descendants``
+* ``parentGenerator()`` -> ``parents``
+
+So instead of this::
+
+ for parent in tag.parentGenerator():
+ ...
+
+You can write this::
+
+ for parent in tag.parents:
+ ...
+
+(But the old code will still work.)
+
+Some of the generators used to yield None after they were done, and
+then stop. That was a bug. Now the generators just stop.
+
+There are two new generators, :ref:`.strings and
+.stripped_strings <string-generators>`. ``.strings`` yields
+NavigableString objects, and ``.stripped_strings`` yields Python
+strings that have had whitespace stripped.
+
+XML
+^^^
+
+There is no longer a ``BeautifulStoneSoup`` class for parsing XML. To
+parse XML you pass in "xml" as the second argument to the
+``BeautifulSoup`` constructor. For the same reason, the
+``BeautifulSoup`` constructor no longer recognizes the ``isHTML``
+argument.
+
+Beautiful Soup's handling of empty-element XML tags has been
+improved. Previously when you parsed XML you had to explicitly say
+which tags were considered empty-element tags. The ``selfClosingTags``
+argument to the constructor is no longer recognized. Instead,
+Beautiful Soup considers any empty tag to be an empty-element tag. If
+you add a child to an empty-element tag, it stops being an
+empty-element tag.
+
+Miscellaneous
+^^^^^^^^^^^^^
+
+:ref:`Tag.string <.string>` now operates recursively. If tag A
+contains a single tag B and nothing else, then A.string is the same as
+B.string. (Previously, it was None.)
+
+An incoming HTML or XML entity is always converted into the
+corresponding Unicode character. The ``BeautifulSoup`` constructor no
+longer recognizes the ``smartQuotesTo`` or ``convertEntities``
+arguments. (`Unicode, Dammit`_ still has ``smart_quotes_to``, but its
+default is now to turn smart quotes into Unicode.)
+
+The ``BeautifulSoup`` constructor no longer recognizes the
+`markupMassage` argument. It's now the parser's responsibility to
+handle markup correctly.