CHANGELOG


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229

= 4.0 beta 4 =

Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag()

BeautifulSoup.new_tag() will follow the rules of whatever tree-builder
was used to create the original BeautifulSoup object. A new <p> tag
will look like "<p />" if the soup object was created to parse XML,
but it will look like "<p></p>" if the soup object was created to
parse HTML.

We pass in strict=False to html.parser on Python 3, greatly improving
html.parser's ability to handle bad HTML.

Monkeypatch a serious bug in html.parser that made strict=False
disastrous on Python 3.2.2.

Replaced the "substitute_html_entities" argument with the "formatter" argument.

Bare ampersands and angle brackets are always converted to XML
entities unless the user prevents it.

Added PageElement.insert_before().

Added PageElement.insert_after().

Raise an exception when the user tries to do something nonsensical
like insert a tag into itself.

= 4.0.0b3 =

Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful
Soup's custom HTML parser in favor of a system that lets you write a
little glue code and plug in any HTML or XML parser you want.

Beautiful Soup 4.0 comes with glue code for four parsers:

 * Python's standard HTMLParser (html.parser in Python 3)
 * lxml's HTML and XML parsers
 * html5lib's HTML parser

HTMLParser is the default, but I recommend you install lxml if you
can.

For complete documentation, see the Sphinx documentation in
bs4/doc/source/. What follows is a summary of the changes from
Beautiful Soup 3.

=== The module name has changed ===

Previously you imported the BeautifulSoup class from a module also
called BeautifulSoup. To save keystrokes and make it clear which
version of the API is in use, the module is now called 'bs4':

    >>> from bs4 import BeautifulSoup

=== It works with Python 3 ===

Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
so bad that it barely worked at all. Beautiful Soup 4 works with
Python 3, and since its parser is pluggable, you don't sacrifice
quality.

Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3
support to the finish line. Ezio Melotti is also to thank for greatly
improving the HTML parser that comes with Python 3.2.

=== CDATA sections are normal text, if they're understood at all. ===

Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
markup:

 <p><![CDATA[foo]]></p> => <p></p>

A future version of html5lib will turn CDATA sections into text nodes,
but only within tags like <svg> and <math>:

 <svg><![CDATA[foo]]></svg> => <p>foo</p>

The default XML parser (which uses lxml behind the scenes) turns CDATA
sections into ordinary text elements:

 <p><![CDATA[foo]]></p> => <p>foo</p>

In theory it's possible to preserve the CDATA sections when using the
XML parser, but I don't see how to get it to work in practice.

=== Miscellaneous other stuff ===

If the BeautifulSoup instance has .is_xml set to True, an appropriate
XML declaration will be emitted when the tree is transformed into a
string:

    <?xml version="1.0" encoding="utf-8">
    <markup>
     ...
    </markup>

The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree
builders set it to False. If you want to parse XHTML with an HTML
parser, you can set it manually.


= 3.2.0 =

The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2
to make it obvious which one you should use.

= 3.1.0 =

A hybrid version that supports 2.4 and can be automatically converted
to run under Python 3.0. There are three backwards-incompatible
changes you should be aware of, but no new features or deliberate
behavior changes.

1. str() may no longer do what you want. This is because the meaning
of str() inverts between Python 2 and 3; in Python 2 it gives you a
byte string, in Python 3 it gives you a Unicode string.

The effect of this is that you can't pass an encoding to .__str__
anymore. Use encode() to get a string and decode() to get Unicode, and
you'll be ready (well, readier) for Python 3.

2. Beautiful Soup is now based on HTMLParser rather than SGMLParser,
which is gone in Python 3. There's some bad HTML that SGMLParser
handled but HTMLParser doesn't, usually to do with attribute values
that aren't closed or have brackets inside them:

  <a href="foo</a>, </a><a href="bar">baz</a>
  <a b="<a>">', '<a b="&lt;a&gt;"></a><a>"></a>

A later version of Beautiful Soup will allow you to plug in different
parsers to make tradeoffs between speed and the ability to handle bad
HTML.

3. In Python 3 (but not Python 2), HTMLParser converts entities within
attributes to the corresponding Unicode characters. In Python 2 it's
possible to parse this string and leave the &eacute; intact.

 <a href="http://crummy.com?sacr&eacute;&bleu">

In Python 3, the &eacute; is always converted to \xe9 during
parsing.


= 3.0.7a =

Added an import that makes BS work in Python 2.3.


= 3.0.7 =

Fixed a UnicodeDecodeError when unpickling documents that contain
non-ASCII characters.

Fixed a TypeError that occured in some circumstances when a tag
contained no text.

Jump through hoops to avoid the use of chardet, which can be extremely
slow in some circumstances. UTF-8 documents should never trigger the
use of chardet.

Whitespace is preserved inside <pre> and <textarea> tags that contain
nothing but whitespace.

Beautiful Soup can now parse a doctype that's scoped to an XML namespace.


= 3.0.6 =

Got rid of a very old debug line that prevented chardet from working.

Added a Tag.decompose() method that completely disconnects a tree or a
subset of a tree, breaking it up into bite-sized pieces that are
easy for the garbage collecter to collect.

Tag.extract() now returns the tag that was extracted.

Tag.findNext() now does something with the keyword arguments you pass
it instead of dropping them on the floor.

Fixed a Unicode conversion bug.

Fixed a bug that garbled some <meta> tags when rewriting them.


= 3.0.5 =

Soup objects can now be pickled, and copied with copy.deepcopy.

Tag.append now works properly on existing BS objects. (It wasn't
originally intended for outside use, but it can be now.) (Giles
Radford)

Passing in a nonexistent encoding will no longer crash the parser on
Python 2.4 (John Nagle).

Fixed an underlying bug in SGMLParser that thinks ASCII has 255
characters instead of 127 (John Nagle).

Entities are converted more consistently to Unicode characters.

Entity references in attribute values are now converted to Unicode
characters when appropriate. Numeric entities are always converted,
because SGMLParser always converts them outside of attribute values.

ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to
XHTML_ENTITIES.

The regular expression for bare ampersands was too loose. In some
cases ampersands were not being escaped. (Sam Ruby?)

Non-breaking spaces and other special Unicode space characters are no
longer folded to ASCII spaces. (Robert Leftwich)

Information inside a TEXTAREA tag is now parsed literally, not as HTML
tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang)


= 3.0.4 =

Fixed a bug that crashed Unicode conversion in some cases.

Fixed a bug that prevented UnicodeDammit from being used as a
general-purpose data scrubber.

Fixed some unit test failures when running against Python 2.5.

When considering whether to convert smart quotes, UnicodeDammit now
looks at the original encoding in a case-insensitive way.