= Introduction = >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup("

SomebadHTML") >>> print soup.prettify()

Some bad HTML

>>> soup.find(text="bad") u'bad' >>> soup.i HTML >>> soup = BeautifulSoup("SomebadXML", "xml") >>> print soup.prettify() Some bad XML = About Beautiful Soup 4 = This is a nearly-complete rewrite that removes Beautiful Soup's custom HTML parser in favor of a system that lets you write a little glue code and plug in any HTML or XML parser you want. Beautiful Soup 4.0 comes with glue code for four parsers: * Python's standard HTMLParser * lxml's HTML and XML parsers * html5lib's HTML parser HTMLParser is the default, but I recommend you install one of the other parsers, or you'll have problems handling real-world markup. == The module name has changed == Previously you imported the BeautifulSoup class from a module also called BeautifulSoup. To save keystrokes and make it clear which version of the API is in use, the module is now called 'bs4': >>> from bs4 import BeautifulSoup == It works with Python 3 == Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was so bad that it barely worked at all. Beautiful Soup 4 works with Python 3, and since its parser is pluggable, you don't sacrifice quality. Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3 support to the finish line. Ezio Melotti is also to thank for greatly improving the HTML parser that comes with Python 3.2. == Better method names == Methods and attributes have been renamed to comply with PEP 8. The old names still work. Here are the renames: * replaceWith -> replace_with * replaceWithChildren -> replace_with_children * findAll -> find_all * findAllNext -> find_all_next * findAllPrevious -> find_all_previous * findNext -> find_next * findNextSibling -> find_next_sibling * findNextSiblings -> find_next_siblings * findParent -> find_parent * findParents -> find_parents * findPrevious -> find_previous * findPreviousSibling -> find_previous_sibling * findPreviousSiblings -> find_previous_siblings * nextSibling -> next_sibling * previousSibling -> previous_sibling Methods have been renamed for compatibility with Python 3. * Tag.has_key() -> Tag.has_attr() (This was misleading, anyway, because has_key() looked at a tag's attributes and __in__ looked at a tag's contents.) Some attributes have also been renamed, mostly to avoid using words that have meaning to Python, like "unicode" and "next": * Tag.isSelfClosing -> Tag.is_empty_element (backwards compatible) * UnicodeDammit.unicode -> UnicodeDammit.unicode_markup (not backwards compatible) * Tag.next -> Tag.next_element (not backwards compatible) * Tag.previous -> Tag.previous_element (not backwards compatible) So have some arguments to the Beautiful Soup constructor: * BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...) * BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...) You can use the old names, but you'll get a DeprecationError. == Generators are now properties == The generators have been given more sensible (and PEP 8-compliant) names, and turned into properties: * childGenerator() -> children * nextGenerator() -> next_elements * nextSiblingGenerator() -> next_siblings * previousGenerator() -> previous_elements * previousSiblingGenerator() -> previous_siblings * recursiveChildGenerator() -> recursive_children * parentGenerator() -> parents So instead of this: for parent in tag.parentGenerator(): ... You can write this: for parent in tag.parents: ... (But the old code will still work.) == tag.string is recursive == tag.string now operates recursively. If tag A contains a single tag B and nothing else, then A.string is the same as B.string. So: foo The value of a.string used to be None, and now it's "foo". == Empty-element tags == Beautiful Soup's handling of empty-element tags (aka self-closing tags) has been improved, especially when parsing XML. Previously you had to explicitly specify a list of empty-element tags when parsing XML. You can still do that, but if you don't, Beautiful Soup now considers any empty tag to be an empty-element tag. The determination of empty-element-ness is now made at runtime rather than parse time. If you add a child to an empty-element tag, it stops being an empty-element tag. == Entities are always converted to Unicode == An HTML or XML entity is always converted into the corresponding Unicode character. There are no longer any smartQuotesTo or convertEntities arguments. (Unicode, Dammit still has smart_quotes_to, but its default is now to turn smart quotes into Unicode.) == CDATA sections are normal text, if they're understood at all. == Currently, the lxml and html5lib HTML parsers ignore CDATA sections in markup:

=>

A future version of html5lib will turn CDATA sections into text nodes, but only within tags like and : foo =>

foo

The default XML parser (which uses lxml behind the scenes) turns CDATA sections into ordinary text elements:

=>

foo

In theory it's possible to preserve the CDATA sections when using the XML parser, but I don't see how to get it to work in practice. == Miscellaneous other stuff == If the BeautifulSoup instance has .is_xml set to True, an appropriate XML declaration will be emitted when the tree is transformed into a string: ... The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree builders set it to False. If you want to parse XHTML with an HTML parser, you can set it manually. = Running the unit tests = Here's how to run the tests on Python 2.7: $ cd bs4 $ python2.7 -m unittest discover -s bs4 Here's how to do it with Python 3.2: $ ./convert-py3k $ cd py3k/bs4 $ python3 -m unittest discover -s bs4 The script test-all-versions will run the tests twice, once on Python 2.7 and once on Python 3.