diff options
author | Leonard Richardson <leonard.richardson@canonical.com> | 2011-02-27 20:21:18 -0500 |
---|---|---|
committer | Leonard Richardson <leonard.richardson@canonical.com> | 2011-02-27 20:21:18 -0500 |
commit | 57c9ba5583abb7209a1613a54c5c55f7c779a88f (patch) | |
tree | 466ab17969cb3cd40b2359e24535d2929bb898a9 | |
parent | 63dc8117e8f396b25688623c7f1920b4f0911373 (diff) |
Prep for an alpha release.
-rw-r--r-- | CHANGELOG | 25 | ||||
-rw-r--r-- | README.txt | 5 |
2 files changed, 24 insertions, 6 deletions
@@ -1,5 +1,22 @@ = 4.0 = +This is a nearly-complete rewrite that removes Beautiful Soup's custom +HTML parser in favor of a system that lets you write a little glue +code and plug in whatever HTML or XML parser you want. + +Beautiful Soup 4.0 comes with glue code for four parsers: an Python's +HTMLParser, lxml's HTML and XML parsers, and html5lib's HTML +parser. HTMLParser is the default, but I recommend you install one of +the other parsers, or you'll have problems handling real-world HTML. + +== The module name has changed == + +Previously you imported the BeautifulSoup class from a module also +called BeautifulSoup. To save keystrokes and make it clear which +version of the API is in use, the module is now called 'bs4': + + >>> from bs4 import BeautifulSoup + == Better method names == Methods have been renamed to comply with PEP 8. The old names still @@ -25,7 +42,6 @@ So have some arguments to popular methods: * BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...) * BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...) - * Tag.encode(prettyPrint=...) -> Tag.encode(pretty_print=...) == Generators are now properties == @@ -77,12 +93,13 @@ being an empty-element tag. An HTML or XML entity is always converted into the corresponding Unicode character. There are no longer any smartQuotesTo or -convert_entities arguments. (Unicode Dammit still has smart_quotes_to, -but the default is now to turn smart quotes into Unicode.) +convertEntities arguments. (Unicode, Dammit still has smart_quotes_to, +but its default is now to turn smart quotes into Unicode.) == CDATA sections are normal text, if they're understood at all. == -Currently, both HTML parsers ignore CDATA sections in markup: +Currently, the lxml and html5lib HTML parsers ignore CDATA sections in +markup: <p><![CDATA[foo]]></p> => <p></p> @@ -1,8 +1,9 @@ = About Beautiful Soup 4 = Earlier versions of Beautiful Soup included a custom HTML -parser. Beautiful Soup 4 does not include a parser. You'll need to -install either lxml or html5lib. +parser. Beautiful Soup 4 uses Python's default HTMLParser, which does +fairly poorly on real-world HTML. By installing lxml or html5lib you +can get more accurate parsing and possibly better performance as well. = Introduction = |