diff options
Diffstat (limited to 'README.txt')
-rw-r--r-- | README.txt | 178 |
1 files changed, 171 insertions, 7 deletions
@@ -1,10 +1,3 @@ -= About Beautiful Soup 4 = - -Earlier versions of Beautiful Soup included a custom HTML -parser. Beautiful Soup 4 uses Python's default HTMLParser, which does -fairly poorly on real-world HTML. By installing lxml or html5lib you -can get more accurate parsing and possibly better performance as well. - = Introduction = >>> from bs4 import BeautifulSoup @@ -29,4 +22,175 @@ can get more accurate parsing and possibly better performance as well. >>> soup.i <i>HTML</i> + >>> soup = BeautifulSoup("<tag1>Some<tag2/>bad<tag3>XML", "xml") + >>> print soup.prettify() + <?xml version="1.0" encoding="utf-8"> + <tag1> + Some + <tag2 /> + bad + <tag3> + XML + </tag3> + </tag1> + += About Beautiful Soup 4 = + +This is a nearly-complete rewrite that removes Beautiful Soup's custom +HTML parser in favor of a system that lets you write a little glue +code and plug in any HTML or XML parser you want. + +Beautiful Soup 4.0 comes with glue code for four parsers: + + * Python's standard HTMLParser + * lxml's HTML and XML parsers + * html5lib's HTML parser + +HTMLParser is the default, but I recommend you install one of the +other parsers, or you'll have problems handling real-world markup. + +== The module name has changed == + +Previously you imported the BeautifulSoup class from a module also +called BeautifulSoup. To save keystrokes and make it clear which +version of the API is in use, the module is now called 'bs4': + + >>> from bs4 import BeautifulSoup + +== It works with Python 3 == + +Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was +so bad that it barely worked at all. Beautiful Soup 4 works with +Python 3, and since its parser is pluggable, you don't sacrifice +quality. + +Special thanks to Thomas Kluyver for getting Python 3 support to the +finish line. + +== Better method names == + +Methods and attributes have been renamed to comply with PEP 8. The old names +still work. Here are the renames: + + * replaceWith -> replace_with + * replaceWithChildren -> replace_with_children + * findAll -> find_all + * findAllNext -> find_all_next + * findAllPrevious -> find_all_previous + * findNext -> find_next + * findNextSibling -> find_next_sibling + * findNextSiblings -> find_next_siblings + * findParent -> find_parent + * findParents -> find_parents + * findPrevious -> find_previous + * findPreviousSibling -> find_previous_sibling + * findPreviousSiblings -> find_previous_siblings + * nextSibling -> next_sibling + * previousSibling -> previous_sibling + +Methods have been renamed for compatibility with Python 3. + + * Tag.has_key() -> Tag.has_attr() + + (This was misleading, anyway, because has_key() looked at + a tag's attributes and __in__ looked at a tag's contents.) + +Some attributes have also been renamed: + + * Tag.isSelfClosing -> Tag.is_empty_element + * UnicodeDammit.unicode -> UnicodeDammit.unicode_markup + * Tag.next -> Tag.next_element + * Tag.previous -> Tag.previous_element + +So have some arguments to popular methods: + + * BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...) + * BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...) + +== Generators are now properties == + +The generators have been given more sensible (and PEP 8-compliant) +names, and turned into properties: + + * childGenerator() -> children + * nextGenerator() -> next_elements + * nextSiblingGenerator() -> next_siblings + * previousGenerator() -> previous_elements + * previousSiblingGenerator() -> previous_siblings + * recursiveChildGenerator() -> recursive_children + * parentGenerator() -> parents + +So instead of this: + + for parent in tag.parentGenerator(): + ... + +You can write this: + + for parent in tag.parents: + ... + +(But the old code will still work.) + +== tag.string is recursive == + +tag.string now operates recursively. If tag A contains a single tag B +and nothing else, then A.string is the same as B.string. So: + +<a><b>foo</b></a> + +The value of a.string used to be None, and now it's "foo". + +== Empty-element tags == + +Beautiful Soup's handling of empty-element tags (aka self-closing +tags) has been improved, especially when parsing XML. Previously you +had to explicitly specify a list of empty-element tags when parsing +XML. You can still do that, but if you don't, Beautiful Soup now +considers any empty tag to be an empty-element tag. + +The determination of empty-element-ness is now made at runtime rather +than parse time. If you add a child to an empty-element tag, it stops +being an empty-element tag. + +== Entities are always converted to Unicode == + +An HTML or XML entity is always converted into the corresponding +Unicode character. There are no longer any smartQuotesTo or +convertEntities arguments. (Unicode, Dammit still has smart_quotes_to, +but its default is now to turn smart quotes into Unicode.) + +== CDATA sections are normal text, if they're understood at all. == + +Currently, the lxml and html5lib HTML parsers ignore CDATA sections in +markup: + + <p><![CDATA[foo]]></p> => <p></p> + +A future version of html5lib will turn CDATA sections into text nodes, +but only within tags like <svg> and <math>: + + <svg><![CDATA[foo]]></svg> => <p>foo</p> + +The default XML parser (which uses lxml behind the scenes) turns CDATA +sections into ordinary text elements: + + <p><![CDATA[foo]]></p> => <p>foo</p> + +In theory it's possible to preserve the CDATA sections when using the +XML parser, but I don't see how to get it to work in practice. + +== Miscellaneous other stuff == + +If the BeautifulSoup instance has .is_xml set to True, an appropriate +XML declaration will be emitted when the tree is transformed into a +string: + + <?xml version="1.0" encoding="utf-8"> + <markup> + ... + </markup> +The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree +builders set it to False. If you want to parse XHTML with an HTML +parser, you can set it manually. |