diff options
Diffstat (limited to 'CHANGELOG')
-rw-r--r-- | CHANGELOG | 74 |
1 files changed, 72 insertions, 2 deletions
@@ -26,9 +26,79 @@ Added PageElement.insert_after(). Raise an exception when the user tries to do something nonsensical like insert a tag into itself. -= 4.0 = += 4.0.0b3 = + +Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful +Soup's custom HTML parser in favor of a system that lets you write a +little glue code and plug in any HTML or XML parser you want. + +Beautiful Soup 4.0 comes with glue code for four parsers: + + * Python's standard HTMLParser (html.parser in Python 3) + * lxml's HTML and XML parsers + * html5lib's HTML parser + +HTMLParser is the default, but I recommend you install lxml if you +can. + +For complete documentation, see the Sphinx documentation in +bs4/doc/source/. What follows is a summary of the changes from +Beautiful Soup 3. + +=== The module name has changed === + +Previously you imported the BeautifulSoup class from a module also +called BeautifulSoup. To save keystrokes and make it clear which +version of the API is in use, the module is now called 'bs4': + + >>> from bs4 import BeautifulSoup + +=== It works with Python 3 === + +Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was +so bad that it barely worked at all. Beautiful Soup 4 works with +Python 3, and since its parser is pluggable, you don't sacrifice +quality. + +Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3 +support to the finish line. Ezio Melotti is also to thank for greatly +improving the HTML parser that comes with Python 3.2. + +=== CDATA sections are normal text, if they're understood at all. === + +Currently, the lxml and html5lib HTML parsers ignore CDATA sections in +markup: + + <p><![CDATA[foo]]></p> => <p></p> + +A future version of html5lib will turn CDATA sections into text nodes, +but only within tags like <svg> and <math>: + + <svg><![CDATA[foo]]></svg> => <p>foo</p> + +The default XML parser (which uses lxml behind the scenes) turns CDATA +sections into ordinary text elements: + + <p><![CDATA[foo]]></p> => <p>foo</p> + +In theory it's possible to preserve the CDATA sections when using the +XML parser, but I don't see how to get it to work in practice. + +=== Miscellaneous other stuff === + +If the BeautifulSoup instance has .is_xml set to True, an appropriate +XML declaration will be emitted when the tree is transformed into a +string: + + <?xml version="1.0" encoding="utf-8"> + <markup> + ... + </markup> + +The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree +builders set it to False. If you want to parse XHTML with an HTML +parser, you can set it manually. -Nearly complete rewrite = 3.2.0 = |