diff options
Diffstat (limited to 'CHANGELOG')
-rw-r--r-- | CHANGELOG | 155 |
1 files changed, 1 insertions, 154 deletions
@@ -1,159 +1,6 @@ = 4.0 = -This is a nearly-complete rewrite that removes Beautiful Soup's custom -HTML parser in favor of a system that lets you write a little glue -code and plug in whatever HTML or XML parser you want. - -Beautiful Soup 4.0 comes with glue code for four parsers: an Python's -HTMLParser, lxml's HTML and XML parsers, and html5lib's HTML -parser. HTMLParser is the default, but I recommend you install one of -the other parsers, or you'll have problems handling real-world HTML. - -== The module name has changed == - -Previously you imported the BeautifulSoup class from a module also -called BeautifulSoup. To save keystrokes and make it clear which -version of the API is in use, the module is now called 'bs4': - - >>> from bs4 import BeautifulSoup - -== It works with Python 3 == - -Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was -so bad that it barely worked at all. Beautiful Soup 4 works with -Python 3, and since its parser is pluggable, you don't sacrifice -quality. - -Special thanks to Thomas Kluyver for getting Python 3 support to the -finish line. - -== Better method names == - -Methods and attributes have been renamed to comply with PEP 8. The old names -still work. Here are the renames: - - * replaceWith -> replace_with - * replaceWithChildren -> replace_with_children - * findAll -> find_all - * findAllNext -> find_all_next - * findAllPrevious -> find_all_previous - * findNext -> find_next - * findNextSibling -> find_next_sibling - * findNextSiblings -> find_next_siblings - * findParent -> find_parent - * findParents -> find_parents - * findPrevious -> find_previous - * findPreviousSibling -> find_previous_sibling - * findPreviousSiblings -> find_previous_siblings - * nextSibling -> next_sibling - * previousSibling -> previous_sibling - -Methods have been renamed for compatibility with Python 3. - - * Tag.has_key() -> Tag.has_attr() - - (This was misleading, anyway, because has_key() looked at - a tag's attributes and __in__ looked at a tag's contents.) - -Some attributes have also been renamed: - - * Tag.isSelfClosing -> Tag.is_empty_element - * UnicodeDammit.unicode -> UnicodeDammit.unicode_markup - * Tag.next -> Tag.next_element - * Tag.previous -> Tag.previous_element - -So have some arguments to popular methods: - - * BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...) - * BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...) - -== Generators are now properties == - -The generators have been given more sensible (and PEP 8-compliant) -names, and turned into properties: - - * childGenerator() -> children - * nextGenerator() -> next_elements - * nextSiblingGenerator() -> next_siblings - * previousGenerator() -> previous_elements - * previousSiblingGenerator() -> previous_siblings - * recursiveChildGenerator() -> recursive_children - * parentGenerator() -> parents - -So instead of this: - - for parent in tag.parentGenerator(): - ... - -You can write this: - - for parent in tag.parents: - ... - -(But the old code will still work.) - -== tag.string is recursive == - -tag.string now operates recursively. If tag A contains a single tag B -and nothing else, then A.string is the same as B.string. So: - -<a><b>foo</b></a> - -The value of a.string used to be None, and now it's "foo". - -== Empty-element tags == - -Beautiful Soup's handling of empty-element tags (aka self-closing -tags) has been improved, especially when parsing XML. Previously you -had to explicitly specify a list of empty-element tags when parsing -XML. You can still do that, but if you don't, Beautiful Soup now -considers any empty tag to be an empty-element tag. - -The determination of empty-element-ness is now made at runtime rather -than parse time. If you add a child to an empty-element tag, it stops -being an empty-element tag. - -== Entities are always converted to Unicode == - -An HTML or XML entity is always converted into the corresponding -Unicode character. There are no longer any smartQuotesTo or -convertEntities arguments. (Unicode, Dammit still has smart_quotes_to, -but its default is now to turn smart quotes into Unicode.) - -== CDATA sections are normal text, if they're understood at all. == - -Currently, the lxml and html5lib HTML parsers ignore CDATA sections in -markup: - - <p><![CDATA[foo]]></p> => <p></p> - -A future version of html5lib will turn CDATA sections into text nodes, -but only within tags like <svg> and <math>: - - <svg><![CDATA[foo]]></svg> => <p>foo</p> - -The default XML parser (which uses lxml behind the scenes) turns CDATA -sections into ordinary text elements: - - <p><![CDATA[foo]]></p> => <p>foo</p> - -In theory it's possible to preserve the CDATA sections when using the -XML parser, but I don't see how to get it to work in practice. - -== Miscellaneous other stuff == - -If the BeautifulSoup instance has .is_xml set to True, an appropriate -XML declaration will be emitted when the tree is transformed into a -string: - - <?xml version="1.0" encoding="utf-8"> - <markup> - ... - </markup> - -The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree -builders set it to False. If you want to parse XHTML with an HTML -parser, you can set it manually. +See README.TXT. = 3.2.0 = |