diff options
Diffstat (limited to 'NEWS.txt')
-rw-r--r-- | NEWS.txt | 231 |
1 files changed, 231 insertions, 0 deletions
diff --git a/NEWS.txt b/NEWS.txt new file mode 100644 index 0000000..aa08b6e --- /dev/null +++ b/NEWS.txt @@ -0,0 +1,231 @@ += 4.0.0b4 = + +* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag() + +* BeautifulSoup.new_tag() will follow the rules of whatever + tree-builder was used to create the original BeautifulSoup object. A + new <p> tag will look like "<p />" if the soup object was created to + parse XML, but it will look like "<p></p>" if the soup object was + created to parse HTML. + +* We pass in strict=False to html.parser on Python 3, greatly + improving html.parser's ability to handle bad HTML. + +* We also monkeypatch a serious bug in html.parser that made + strict=False disastrous on Python 3.2.2. + +* Replaced the "substitute_html_entities" argument with the + "formatter" argument. + +* Bare ampersands and angle brackets are always converted to XML + entities unless the user prevents it. + +* Added PageElement.insert_before(). + +* Added PageElement.insert_after(). + +* Raise an exception when the user tries to do something nonsensical + like insert a tag into itself. + + += 4.0.0b3 = + +Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful +Soup's custom HTML parser in favor of a system that lets you write a +little glue code and plug in any HTML or XML parser you want. + +Beautiful Soup 4.0 comes with glue code for four parsers: + + * Python's standard HTMLParser (html.parser in Python 3) + * lxml's HTML and XML parsers + * html5lib's HTML parser + +HTMLParser is the default, but I recommend you install lxml if you +can. + +For complete documentation, see the Sphinx documentation in +bs4/doc/source/. What follows is a summary of the changes from +Beautiful Soup 3. + +=== The module name has changed === + +Previously you imported the BeautifulSoup class from a module also +called BeautifulSoup. To save keystrokes and make it clear which +version of the API is in use, the module is now called 'bs4': + + >>> from bs4 import BeautifulSoup + +=== It works with Python 3 === + +Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was +so bad that it barely worked at all. Beautiful Soup 4 works with +Python 3, and since its parser is pluggable, you don't sacrifice +quality. + +Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3 +support to the finish line. Ezio Melotti is also to thank for greatly +improving the HTML parser that comes with Python 3.2. + +=== CDATA sections are normal text, if they're understood at all. === + +Currently, the lxml and html5lib HTML parsers ignore CDATA sections in +markup: + + <p><![CDATA[foo]]></p> => <p></p> + +A future version of html5lib will turn CDATA sections into text nodes, +but only within tags like <svg> and <math>: + + <svg><![CDATA[foo]]></svg> => <p>foo</p> + +The default XML parser (which uses lxml behind the scenes) turns CDATA +sections into ordinary text elements: + + <p><![CDATA[foo]]></p> => <p>foo</p> + +In theory it's possible to preserve the CDATA sections when using the +XML parser, but I don't see how to get it to work in practice. + +=== Miscellaneous other stuff === + +If the BeautifulSoup instance has .is_xml set to True, an appropriate +XML declaration will be emitted when the tree is transformed into a +string: + + <?xml version="1.0" encoding="utf-8"> + <markup> + ... + </markup> + +The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree +builders set it to False. If you want to parse XHTML with an HTML +parser, you can set it manually. + + += 3.2.0 = + +The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2 +to make it obvious which one you should use. + += 3.1.0 = + +A hybrid version that supports 2.4 and can be automatically converted +to run under Python 3.0. There are three backwards-incompatible +changes you should be aware of, but no new features or deliberate +behavior changes. + +1. str() may no longer do what you want. This is because the meaning +of str() inverts between Python 2 and 3; in Python 2 it gives you a +byte string, in Python 3 it gives you a Unicode string. + +The effect of this is that you can't pass an encoding to .__str__ +anymore. Use encode() to get a string and decode() to get Unicode, and +you'll be ready (well, readier) for Python 3. + +2. Beautiful Soup is now based on HTMLParser rather than SGMLParser, +which is gone in Python 3. There's some bad HTML that SGMLParser +handled but HTMLParser doesn't, usually to do with attribute values +that aren't closed or have brackets inside them: + + <a href="foo</a>, </a><a href="bar">baz</a> + <a b="<a>">', '<a b="<a>"></a><a>"></a> + +A later version of Beautiful Soup will allow you to plug in different +parsers to make tradeoffs between speed and the ability to handle bad +HTML. + +3. In Python 3 (but not Python 2), HTMLParser converts entities within +attributes to the corresponding Unicode characters. In Python 2 it's +possible to parse this string and leave the é intact. + + <a href="http://crummy.com?sacré&bleu"> + +In Python 3, the é is always converted to \xe9 during +parsing. + + += 3.0.7a = + +Added an import that makes BS work in Python 2.3. + + += 3.0.7 = + +Fixed a UnicodeDecodeError when unpickling documents that contain +non-ASCII characters. + +Fixed a TypeError that occured in some circumstances when a tag +contained no text. + +Jump through hoops to avoid the use of chardet, which can be extremely +slow in some circumstances. UTF-8 documents should never trigger the +use of chardet. + +Whitespace is preserved inside <pre> and <textarea> tags that contain +nothing but whitespace. + +Beautiful Soup can now parse a doctype that's scoped to an XML namespace. + + += 3.0.6 = + +Got rid of a very old debug line that prevented chardet from working. + +Added a Tag.decompose() method that completely disconnects a tree or a +subset of a tree, breaking it up into bite-sized pieces that are +easy for the garbage collecter to collect. + +Tag.extract() now returns the tag that was extracted. + +Tag.findNext() now does something with the keyword arguments you pass +it instead of dropping them on the floor. + +Fixed a Unicode conversion bug. + +Fixed a bug that garbled some <meta> tags when rewriting them. + + += 3.0.5 = + +Soup objects can now be pickled, and copied with copy.deepcopy. + +Tag.append now works properly on existing BS objects. (It wasn't +originally intended for outside use, but it can be now.) (Giles +Radford) + +Passing in a nonexistent encoding will no longer crash the parser on +Python 2.4 (John Nagle). + +Fixed an underlying bug in SGMLParser that thinks ASCII has 255 +characters instead of 127 (John Nagle). + +Entities are converted more consistently to Unicode characters. + +Entity references in attribute values are now converted to Unicode +characters when appropriate. Numeric entities are always converted, +because SGMLParser always converts them outside of attribute values. + +ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to +XHTML_ENTITIES. + +The regular expression for bare ampersands was too loose. In some +cases ampersands were not being escaped. (Sam Ruby?) + +Non-breaking spaces and other special Unicode space characters are no +longer folded to ASCII spaces. (Robert Leftwich) + +Information inside a TEXTAREA tag is now parsed literally, not as HTML +tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang) + + += 3.0.4 = + +Fixed a bug that crashed Unicode conversion in some cases. + +Fixed a bug that prevented UnicodeDammit from being used as a +general-purpose data scrubber. + +Fixed some unit test failures when running against Python 2.5. + +When considering whether to convert smart quotes, UnicodeDammit now +looks at the original encoding in a case-insensitive way. |