= 4.0.0b7 () = * Upon decoding to string, any characters that can't be represented in your chosen encoding will be converted into numeric XML entity references. * Issue a warning if characters were replaced with REPLACEMENT CHARACTER during Unicode conversion. * Restored compatibility with Python 2.6. = 4.0.0b6 (20110216) = * Multi-valued attributes like "class" always have a list of values, even if there's only one value in the list. * Added a number of multi-valued attributes defined in HTML5. * Stopped generating a space before the slash that closes an empty-element tag. This may come back if I add a special XHTML mode (http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty useless. * Passing text along with tag-specific arguments to a find* method: find("a", text="Click here") will find tags that contain the given text as their .string. Previously, the tag-specific arguments were ignored and only strings were searched. * Fixed a bug that caused the html5lib tree builder to build a partially disconnected tree. Generally cleaned up the html5lib tree builder. * If you restrict a multi-valued attribute like "class" to a string that contains spaces, Beautiful Soup will only consider it a match if the values correspond to that specific string. = 4.0.0b5 (20120209) = * Rationalized Beautiful Soup's treatment of CSS class. A tag belonging to multiple CSS classes is treated as having a list of values for the 'class' attribute. Searching for a CSS class will match *any* of the CSS classes. This actually affects all attributes that the HTML standard defines as taking multiple values (class, rel, rev, archive, accept-charset, and headers), but 'class' is by far the most common. [bug=41034] * If you pass anything other than a dictionary as the second argument to one of the find* methods, it'll assume you want to use that object to search against a tag's CSS classes. Previously this only worked if you passed in a string. * Fixed a bug that caused a crash when you passed a dictionary as an attribute value (possibly because you mistyped "attrs"). [bug=842419] * Unicode, Dammit now detects the encoding in HTML 5-style tags like . [bug=837268] * If Unicode, Dammit can't figure out a consistent encoding for a page, it will try each of its guesses again, with errors="replace" instead of errors="strict". This may mean that some data gets replaced with REPLACEMENT CHARACTER, but at least most of it will get turned into Unicode. [bug=754903] * Patched over a bug in html5lib (?) that was crashing Beautiful Soup on certain kinds of markup. [bug=838800] * Fixed a bug that wrecked the tree if you replaced an element with an empty string. [bug=728697] * Improved Unicode, Dammit's behavior when you give it Unicode to begin with. = 4.0.0b4 (20120208) = * Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag() * BeautifulSoup.new_tag() will follow the rules of whatever tree-builder was used to create the original BeautifulSoup object. A new
tag will look like "
" if the soup object was created to parse XML, but it will look like "" if the soup object was created to parse HTML. * We pass in strict=False to html.parser on Python 3, greatly improving html.parser's ability to handle bad HTML. * We also monkeypatch a serious bug in html.parser that made strict=False disastrous on Python 3.2.2. * Replaced the "substitute_html_entities" argument with the more general "formatter" argument. * Bare ampersands and angle brackets are always converted to XML entities unless the user prevents it. * Added PageElement.insert_before() and PageElement.insert_after(), which let you put an element into the parse tree with respect to some other element. * Raise an exception when the user tries to do something nonsensical like insert a tag into itself. = 4.0.0b3 (20120203) = Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful Soup's custom HTML parser in favor of a system that lets you write a little glue code and plug in any HTML or XML parser you want. Beautiful Soup 4.0 comes with glue code for four parsers: * Python's standard HTMLParser (html.parser in Python 3) * lxml's HTML and XML parsers * html5lib's HTML parser HTMLParser is the default, but I recommend you install lxml if you can. For complete documentation, see the Sphinx documentation in bs4/doc/source/. What follows is a summary of the changes from Beautiful Soup 3. === The module name has changed === Previously you imported the BeautifulSoup class from a module also called BeautifulSoup. To save keystrokes and make it clear which version of the API is in use, the module is now called 'bs4': >>> from bs4 import BeautifulSoup === It works with Python 3 === Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was so bad that it barely worked at all. Beautiful Soup 4 works with Python 3, and since its parser is pluggable, you don't sacrifice quality. Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3 support to the finish line. Ezio Melotti is also to thank for greatly improving the HTML parser that comes with Python 3.2. === CDATA sections are normal text, if they're understood at all. === Currently, the lxml and html5lib HTML parsers ignore CDATA sections in markup: => A future version of html5lib will turn CDATA sections into text nodes, but only within tags like