= 4.0 = == Better method names == Methods have been renamed to comply with PEP 8. The old names still work. Here are the renames: * findAll -> find_all * findAllNext -> find_all_next * findAllPrevious -> find_all_previous * findNext -> find_next * findNextSibling -> find_next_sibling * findNextSiblings -> find_next_siblings * findParent -> find_parent * findParents -> find_parents * findPrevious -> find_previous * findPreviousSibling -> find_previous_sibling * findPreviousSiblings -> find_previous_siblings Some attributes have also been renamed: * Tag.isSelfClosing -> Tag.is_empty_element So have some arguments to popular methods: * BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...) * BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...) * Tag.encode(prettyPrint=...) -> Tag.encode(pretty_print=...) == Generators are now properties == The generators have been given more sensible (and PEP 8-compliant) names, and turned into properties: * childGenerator() -> children * nextGenerator() -> next_elements * nextSiblingGenerator() -> next_siblings * previousGenerator() -> previous_elements * previousSiblingGenerator() -> previous_siblings * recursiveChildGenerator() -> recursive_children * parentGenerator() -> parents So instead of this: for parent in tag.parentGenerator(): ... You can write this: for parent in tag.parents: ... (But the old code will still work.) == tag.string is recursive == tag.string now operates recursively. If tag A contains a single tag B and nothing else, then A.string is the same as B.string. So: foo The value of a.string used to be None, and now it's "foo". == Empty-element tags == Beautiful Soup's handling of empty-element tags (aka self-closing tags) has been improved, especially when parsing XML. Previously you had to explicitly specify a list of empty-element tags when parsing XML. You can still do that, but if you don't, Beautiful Soup now considers any empty tag to be an empty-element tag. The determination of empty-element-ness is now made at runtime rather than parse time. If you add a child to an empty-element tag, it stops being an empty-element tag. == Entities are always converted to Unicode == An HTML or XML entity is always converted into the corresponding Unicode character. There are no longer any smartQuotesTo or convert_entities arguments. (Unicode Dammit still has smart_quotes_to, but the default is now to turn smart quotes into Unicode.) == CDATA sections are normal text, if they're understood at all. == Currently, both HTML parsers ignore CDATA sections in markup:

=>

A future version of html5lib will turn CDATA sections into text nodes, but only within tags like and : foo =>

foo

The default XML parser (which uses lxml behind the scenes) turns CDATA sections into ordinary text elements:

=>

foo

In theory it's possible to preserve the CDATA sections when using the XML parser, but I don't see how to get it to work in practice. == Miscellaneous other stuff == If the BeautifulSoup instance has .is_xml set to True, an appropriate XML declaration will be emitted when the tree is transformed into a string: ... The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree builders set it to False. If you want to parse XHTML with an HTML parser, you can set it manually. = 3.1.0 = A hybrid version that supports 2.4 and can be automatically converted to run under Python 3.0. There are three backwards-incompatible changes you should be aware of, but no new features or deliberate behavior changes. 1. str() may no longer do what you want. This is because the meaning of str() inverts between Python 2 and 3; in Python 2 it gives you a byte string, in Python 3 it gives you a Unicode string. The effect of this is that you can't pass an encoding to .__str__ anymore. Use encode() to get a string and decode() to get Unicode, and you'll be ready (well, readier) for Python 3. 2. Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't, usually to do with attribute values that aren't closed or have brackets inside them: baz ', '"> A later version of Beautiful Soup will allow you to plug in different parsers to make tradeoffs between speed and the ability to handle bad HTML. 3. In Python 3 (but not Python 2),HTMLParser converts entities within attributes to the corresponding Unicode characters. In Python 2 it's possible to parse this string and leave the é intact. In Python 3, the é is always converted to \xe9 during parsing. = 3.0.7a = Added an import that makes BS work in Python 2.3. = 3.0.7 = Fixed a UnicodeDecodeError when unpickling documents that contain non-ASCII characters. Fixed a TypeError that occured in some circumstances when a tag contained no text. Jump through hoops to avoid the use of chardet, which can be extremely slow in some circumstances. UTF-8 documents should never trigger the use of chardet. Whitespace is preserved inside
 and