>>> soup.find(text="bad")
u'bad'
>>> soup.i
HTML
>>> soup = BeautifulSoup("SomebadXML", "xml")
>>> print soup.prettify()
Some
bad
XML
= About Beautiful Soup 4 =
This is a nearly-complete rewrite that removes Beautiful Soup's custom
HTML parser in favor of a system that lets you write a little glue
code and plug in any HTML or XML parser you want.
Beautiful Soup 4.0 comes with glue code for four parsers:
* Python's standard HTMLParser
* lxml's HTML and XML parsers
* html5lib's HTML parser
HTMLParser is the default, but I recommend you install one of the
other parsers, or you'll have problems handling real-world markup.
== The module name has changed ==
Previously you imported the BeautifulSoup class from a module also
called BeautifulSoup. To save keystrokes and make it clear which
version of the API is in use, the module is now called 'bs4':
>>> from bs4 import BeautifulSoup
== It works with Python 3 ==
Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
so bad that it barely worked at all. Beautiful Soup 4 works with
Python 3, and since its parser is pluggable, you don't sacrifice
quality.
Special thanks to Thomas Kluyver for getting Python 3 support to the
finish line.
== Better method names ==
Methods and attributes have been renamed to comply with PEP 8. The old names
still work. Here are the renames:
* replaceWith -> replace_with
* replaceWithChildren -> replace_with_children
* findAll -> find_all
* findAllNext -> find_all_next
* findAllPrevious -> find_all_previous
* findNext -> find_next
* findNextSibling -> find_next_sibling
* findNextSiblings -> find_next_siblings
* findParent -> find_parent
* findParents -> find_parents
* findPrevious -> find_previous
* findPreviousSibling -> find_previous_sibling
* findPreviousSiblings -> find_previous_siblings
* nextSibling -> next_sibling
* previousSibling -> previous_sibling
Methods have been renamed for compatibility with Python 3.
* Tag.has_key() -> Tag.has_attr()
(This was misleading, anyway, because has_key() looked at
a tag's attributes and __in__ looked at a tag's contents.)
Some attributes have also been renamed:
* Tag.isSelfClosing -> Tag.is_empty_element
* UnicodeDammit.unicode -> UnicodeDammit.unicode_markup
* Tag.next -> Tag.next_element
* Tag.previous -> Tag.previous_element
So have some arguments to popular methods:
* BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...)
* BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...)
== Generators are now properties ==
The generators have been given more sensible (and PEP 8-compliant)
names, and turned into properties:
* childGenerator() -> children
* nextGenerator() -> next_elements
* nextSiblingGenerator() -> next_siblings
* previousGenerator() -> previous_elements
* previousSiblingGenerator() -> previous_siblings
* recursiveChildGenerator() -> recursive_children
* parentGenerator() -> parents
So instead of this:
for parent in tag.parentGenerator():
...
You can write this:
for parent in tag.parents:
...
(But the old code will still work.)
== tag.string is recursive ==
tag.string now operates recursively. If tag A contains a single tag B
and nothing else, then A.string is the same as B.string. So:
foo
The value of a.string used to be None, and now it's "foo".
== Empty-element tags ==
Beautiful Soup's handling of empty-element tags (aka self-closing
tags) has been improved, especially when parsing XML. Previously you
had to explicitly specify a list of empty-element tags when parsing
XML. You can still do that, but if you don't, Beautiful Soup now
considers any empty tag to be an empty-element tag.
The determination of empty-element-ness is now made at runtime rather
than parse time. If you add a child to an empty-element tag, it stops
being an empty-element tag.
== Entities are always converted to Unicode ==
An HTML or XML entity is always converted into the corresponding
Unicode character. There are no longer any smartQuotesTo or
convertEntities arguments. (Unicode, Dammit still has smart_quotes_to,
but its default is now to turn smart quotes into Unicode.)
== CDATA sections are normal text, if they're understood at all. ==
Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
markup:
=>
A future version of html5lib will turn CDATA sections into text nodes,
but only within tags like