1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
|
= Introduction =
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>Some<b>bad<i>HTML")
>>> print soup.prettify()
<html>
<body>
<p>
Some
<b>
bad
<i>
HTML
</i>
</b>
</p>
</body>
</html>
>>> soup.find(text="bad")
u'bad'
>>> soup.i
<i>HTML</i>
>>> soup = BeautifulSoup("<tag1>Some<tag2/>bad<tag3>XML", "xml")
>>> print soup.prettify()
<?xml version="1.0" encoding="utf-8">
<tag1>
Some
<tag2 />
bad
<tag3>
XML
</tag3>
</tag1>
= About Beautiful Soup 4 =
This is a nearly-complete rewrite that removes Beautiful Soup's custom
HTML parser in favor of a system that lets you write a little glue
code and plug in any HTML or XML parser you want.
Beautiful Soup 4.0 comes with glue code for four parsers:
* Python's standard HTMLParser (html.parser in Python 3)
* lxml's HTML and XML parsers
* html5lib's HTML parser
HTMLParser is the default, but I recommend you install one of the
other parsers, or you'll have problems handling real-world markup.
For complete documentation, see the Sphinx documentation in
docs/source. What follows is a summary of the changes from Beautiful
Soup 3.
== The module name has changed ==
Previously you imported the BeautifulSoup class from a module also
called BeautifulSoup. To save keystrokes and make it clear which
version of the API is in use, the module is now called 'bs4':
>>> from bs4 import BeautifulSoup
== It works with Python 3 ==
Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
so bad that it barely worked at all. Beautiful Soup 4 works with
Python 3, and since its parser is pluggable, you don't sacrifice
quality.
Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3
support to the finish line. Ezio Melotti is also to thank for greatly
improving the HTML parser that comes with Python 3.2.
== Better method names ==
Methods and attributes have been renamed to comply with PEP 8. The old names
still work. Here are the renames:
* replaceWith -> replace_with
* replaceWithChildren -> replace_with_children
* findAll -> find_all
* findAllNext -> find_all_next
* findAllPrevious -> find_all_previous
* findNext -> find_next
* findNextSibling -> find_next_sibling
* findNextSiblings -> find_next_siblings
* findParent -> find_parent
* findParents -> find_parents
* findPrevious -> find_previous
* findPreviousSibling -> find_previous_sibling
* findPreviousSiblings -> find_previous_siblings
* nextSibling -> next_sibling
* previousSibling -> previous_sibling
Methods have been renamed for compatibility with Python 3.
* Tag.has_key() -> Tag.has_attr()
(This was misleading, anyway, because has_key() looked at
a tag's attributes and __in__ looked at a tag's contents.)
Some attributes have also been renamed, mostly to avoid using words
that have meaning to Python, like "unicode" and "next":
* Tag.isSelfClosing -> Tag.is_empty_element (backwards compatible)
* UnicodeDammit.unicode -> UnicodeDammit.unicode_markup
(not backwards compatible)
* Tag.next -> Tag.next_element (not backwards compatible)
* Tag.previous -> Tag.previous_element (not backwards compatible)
So have some arguments to the Beautiful Soup constructor:
* BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...)
* BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...)
You can use the old names, but you'll get a DeprecationError.
== Generators are now properties ==
The generators have been given more sensible (and PEP 8-compliant)
names, and turned into properties:
* childGenerator() -> children
* nextGenerator() -> next_elements
* nextSiblingGenerator() -> next_siblings
* previousGenerator() -> previous_elements
* previousSiblingGenerator() -> previous_siblings
* recursiveChildGenerator() -> descendants
* parentGenerator() -> parents
So instead of this:
for parent in tag.parentGenerator():
...
You can write this:
for parent in tag.parents:
...
(But the old code will still work.)
== tag.string is recursive ==
tag.string now operates recursively. If tag A contains a single tag B
and nothing else, then A.string is the same as B.string. So:
<a><b>foo</b></a>
The value of a.string used to be None, and now it's "foo".
== Empty-element tags ==
Beautiful Soup's handling of empty-element tags (aka self-closing
tags) has been improved, especially when parsing XML. Previously you
had to explicitly specify a list of empty-element tags when parsing
XML. You can still do that, but if you don't, Beautiful Soup now
considers any empty tag to be an empty-element tag.
The determination of empty-element-ness is now made at runtime rather
than parse time. If you add a child to an empty-element tag, it stops
being an empty-element tag.
== Entities are always converted to Unicode ==
An HTML or XML entity is always converted into the corresponding
Unicode character. There are no longer any smartQuotesTo or
convertEntities arguments. (Unicode, Dammit still has smart_quotes_to,
but its default is now to turn smart quotes into Unicode.)
== CDATA sections are normal text, if they're understood at all. ==
Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
markup:
<p><![CDATA[foo]]></p> => <p></p>
A future version of html5lib will turn CDATA sections into text nodes,
but only within tags like <svg> and <math>:
<svg><![CDATA[foo]]></svg> => <p>foo</p>
The default XML parser (which uses lxml behind the scenes) turns CDATA
sections into ordinary text elements:
<p><![CDATA[foo]]></p> => <p>foo</p>
In theory it's possible to preserve the CDATA sections when using the
XML parser, but I don't see how to get it to work in practice.
== Miscellaneous other stuff ==
If the BeautifulSoup instance has .is_xml set to True, an appropriate
XML declaration will be emitted when the tree is transformed into a
string:
<?xml version="1.0" encoding="utf-8">
<markup>
...
</markup>
The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree
builders set it to False. If you want to parse XHTML with an HTML
parser, you can set it manually.
= Running the unit tests =
Here's how to run the tests on Python 2.7:
$ cd bs4
$ python2.7 -m unittest discover -s bs4
Here's how to do it with Python 3.2:
$ ./convert-py3k
$ cd py3k/bs4
$ python3 -m unittest discover -s bs4
The script test-all-versions will run the tests twice, once on Python
2.7 and once on Python 3.
|