>>> soup.find(text="bad")
u'bad'
>>> soup.i
HTML
>>> soup = BeautifulSoup("SomebadXML", "xml")
>>> print soup.prettify()
Some
bad
XML
= About Beautiful Soup 4 =
This is a nearly-complete rewrite that removes Beautiful Soup's custom
HTML parser in favor of a system that lets you write a little glue
code and plug in any HTML or XML parser you want.
Beautiful Soup 4.0 comes with glue code for four parsers:
* Python's standard HTMLParser
* lxml's HTML and XML parsers
* html5lib's HTML parser
HTMLParser is the default, but I recommend you install one of the
other parsers, or you'll have problems handling real-world markup.
== The module name has changed ==
Previously you imported the BeautifulSoup class from a module also
called BeautifulSoup. To save keystrokes and make it clear which
version of the API is in use, the module is now called 'bs4':
>>> from bs4 import BeautifulSoup
== It works with Python 3 ==
Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
so bad that it barely worked at all. Beautiful Soup 4 works with
Python 3, and since its parser is pluggable, you don't sacrifice
quality.
Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3
support to the finish line. Ezio Melotti is also to thank for greatly
improving the HTML parser that comes with Python 3.2.
== Better method names ==
Methods and attributes have been renamed to comply with PEP 8. The old names
still work. Here are the renames:
* replaceWith -> replace_with
* replaceWithChildren -> replace_with_children
* findAll -> find_all
* findAllNext -> find_all_next
* findAllPrevious -> find_all_previous
* findNext -> find_next
* findNextSibling -> find_next_sibling
* findNextSiblings -> find_next_siblings
* findParent -> find_parent
* findParents -> find_parents
* findPrevious -> find_previous
* findPreviousSibling -> find_previous_sibling
* findPreviousSiblings -> find_previous_siblings
* nextSibling -> next_sibling
* previousSibling -> previous_sibling
Methods have been renamed for compatibility with Python 3.
* Tag.has_key() -> Tag.has_attr()
(This was misleading, anyway, because has_key() looked at
a tag's attributes and __in__ looked at a tag's contents.)
Some attributes have also been renamed, mostly to avoid using words
that have meaning to Python, like "unicode" and "next":
* Tag.isSelfClosing -> Tag.is_empty_element (backwards compatible)
* UnicodeDammit.unicode -> UnicodeDammit.unicode_markup
(not backwards compatible)
* Tag.next -> Tag.next_element (not backwards compatible)
* Tag.previous -> Tag.previous_element (not backwards compatible)
So have some arguments to the Beautiful Soup constructor:
* BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...)
* BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...)
You can use the old names, but you'll get a DeprecationError.
== Generators are now properties ==
The generators have been given more sensible (and PEP 8-compliant)
names, and turned into properties:
* childGenerator() -> children
* nextGenerator() -> next_elements
* nextSiblingGenerator() -> next_siblings
* previousGenerator() -> previous_elements
* previousSiblingGenerator() -> previous_siblings
* recursiveChildGenerator() -> recursive_children
* parentGenerator() -> parents
So instead of this:
for parent in tag.parentGenerator():
...
You can write this:
for parent in tag.parents:
...
(But the old code will still work.)
== tag.string is recursive ==
tag.string now operates recursively. If tag A contains a single tag B
and nothing else, then A.string is the same as B.string. So:
foo
The value of a.string used to be None, and now it's "foo".
== Empty-element tags ==
Beautiful Soup's handling of empty-element tags (aka self-closing
tags) has been improved, especially when parsing XML. Previously you
had to explicitly specify a list of empty-element tags when parsing
XML. You can still do that, but if you don't, Beautiful Soup now
considers any empty tag to be an empty-element tag.
The determination of empty-element-ness is now made at runtime rather
than parse time. If you add a child to an empty-element tag, it stops
being an empty-element tag.
== Entities are always converted to Unicode ==
An HTML or XML entity is always converted into the corresponding
Unicode character. There are no longer any smartQuotesTo or
convertEntities arguments. (Unicode, Dammit still has smart_quotes_to,
but its default is now to turn smart quotes into Unicode.)
== CDATA sections are normal text, if they're understood at all. ==
Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
markup:
=>
A future version of html5lib will turn CDATA sections into text nodes,
but only within tags like