summaryrefslogtreecommitdiff
path: root/doc/source/index.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/source/index.rst')
-rw-r--r--doc/source/index.rst931
1 files changed, 458 insertions, 473 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst
index f655327..76a32e9 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -54,8 +54,7 @@ Quick Start
Here's an HTML document I'll be using as an example throughout this
document. It's part of a story from `Alice in Wonderland`::
- html_doc = """
- <html><head><title>The Dormouse's story</title></head>
+ html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
@@ -186,7 +185,7 @@ works on Python 2 and Python 3. Make sure you use the right version of
:kbd:`$ pip install beautifulsoup4`
-(The ``BeautifulSoup`` package is probably `not` what you want. That's
+(The ``BeautifulSoup`` package is `not` what you want. That's
the previous major release, `Beautiful Soup 3`_. Lots of software uses
BS3, so it's still available, but if you're writing new code you
should install ``beautifulsoup4``.)
@@ -307,14 +306,14 @@ constructor. You can pass in a string or an open filehandle::
from bs4 import BeautifulSoup
with open("index.html") as fp:
- soup = BeautifulSoup(fp)
+ soup = BeautifulSoup(fp, 'html.parser')
- soup = BeautifulSoup("<html>a web page</html>")
+ soup = BeautifulSoup("<html>a web page</html>", 'html.parser')
First, the document is converted to Unicode, and HTML entities are
converted to Unicode characters::
- print(BeautifulSoup("<html><head></head><body>Sacr&eacute; bleu!</body></html>"))
+ print(BeautifulSoup("<html><head></head><body>Sacr&eacute; bleu!</body></html>", "html.parser"))
# <html><head></head><body>Sacré bleu!</body></html>
Beautiful Soup then parses the document using the best available
@@ -336,7 +335,7 @@ and ``Comment``.
A ``Tag`` object corresponds to an XML or HTML tag in the original document::
- soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
+ soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>
@@ -351,7 +350,7 @@ Name
Every tag has a name, accessible as ``.name``::
tag.name
- # u'b'
+ # 'b'
If you change a tag's name, the change will be reflected in any HTML
markup generated by Beautiful Soup::
@@ -368,13 +367,14 @@ id="boldest">`` has an attribute "id" whose value is
"boldest". You can access a tag's attributes by treating the tag like
a dictionary::
+ tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id']
- # u'boldest'
+ # 'boldest'
You can access that dictionary directly as ``.attrs``::
tag.attrs
- # {u'id': 'boldest'}
+ # {'id': 'boldest'}
You can add, remove, and modify a tag's attributes. Again, this is
done by treating the tag as a dictionary::
@@ -387,11 +387,11 @@ done by treating the tag as a dictionary::
del tag['id']
del tag['another-attribute']
tag
- # <b></b>
+ # <b>bold</b>
tag['id']
# KeyError: 'id'
- print(tag.get('id'))
+ tag.get('id')
# None
.. _multivalue:
@@ -406,26 +406,26 @@ one CSS class). Others include ``rel``, ``rev``, ``accept-charset``,
``headers``, and ``accesskey``. Beautiful Soup presents the value(s)
of a multi-valued attribute as a list::
- css_soup = BeautifulSoup('<p class="body"></p>')
+ css_soup = BeautifulSoup('<p class="body"></p>', 'html.parser')
css_soup.p['class']
- # ["body"]
+ # ['body']
- css_soup = BeautifulSoup('<p class="body strikeout"></p>')
+ css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
css_soup.p['class']
- # ["body", "strikeout"]
+ # ['body', 'strikeout']
If an attribute `looks` like it has more than one value, but it's not
a multi-valued attribute as defined by any version of the HTML
standard, Beautiful Soup will leave the attribute alone::
- id_soup = BeautifulSoup('<p id="my id"></p>')
+ id_soup = BeautifulSoup('<p id="my id"></p>', 'html.parser')
id_soup.p['id']
# 'my id'
When you turn a tag back into a string, multiple attribute values are
consolidated::
- rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
+ rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>', 'html.parser')
rel_soup.a['rel']
# ['index']
rel_soup.a['rel'] = ['index', 'contents']
@@ -435,34 +435,34 @@ consolidated::
You can disable this by passing ``multi_valued_attributes=None`` as a
keyword argument into the ``BeautifulSoup`` constructor::
- no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html', multi_valued_attributes=None)
- no_list_soup.p['class']
- # u'body strikeout'
+ no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser', multi_valued_attributes=None)
+ no_list_soup.p['class']
+ # 'body strikeout'
You can use ```get_attribute_list`` to get a value that's always a
list, whether or not it's a multi-valued atribute::
- id_soup.p.get_attribute_list('id')
- # ["my id"]
+ id_soup.p.get_attribute_list('id')
+ # ["my id"]
If you parse a document as XML, there are no multi-valued attributes::
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']
- # u'body strikeout'
+ # 'body strikeout'
Again, you can configure this using the ``multi_valued_attributes`` argument::
- class_is_multi= { '*' : 'class'}
- xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi)
- xml_soup.p['class']
- # [u'body', u'strikeout']
+ class_is_multi= { '*' : 'class'}
+ xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi)
+ xml_soup.p['class']
+ # ['body', 'strikeout']
You probably won't need to do this, but if you do, use the defaults as
a guide. They implement the rules described in the HTML specification::
- from bs4.builder import builder_registry
- builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES
+ from bs4.builder import builder_registry
+ builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES
``NavigableString``
@@ -471,28 +471,31 @@ a guide. They implement the rules described in the HTML specification::
A string corresponds to a bit of text within a tag. Beautiful Soup
uses the ``NavigableString`` class to contain these bits of text::
+ soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
+ tag = soup.b
tag.string
- # u'Extremely bold'
+ # 'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>
A ``NavigableString`` is just like a Python Unicode string, except
that it also supports some of the features described in `Navigating
the tree`_ and `Searching the tree`_. You can convert a
-``NavigableString`` to a Unicode string with ``unicode()``::
+``NavigableString`` to a Unicode string with ``unicode()`` (in
+Python 2) or ``str`` (in Python 3)::
- unicode_string = unicode(tag.string)
+ unicode_string = str(tag.string)
unicode_string
- # u'Extremely bold'
+ # 'Extremely bold'
type(unicode_string)
- # <type 'unicode'>
+ # <type 'str'>
You can't edit a string in place, but you can replace one string with
another, using :ref:`replace_with()`::
tag.string.replace_with("No longer bold")
tag
- # <blockquote>No longer bold</blockquote>
+ # <b class="boldest">No longer bold</b>
``NavigableString`` supports most of the features described in
`Navigating the tree`_ and `Searching the tree`_, but not all of
@@ -518,13 +521,13 @@ You can also pass a ``BeautifulSoup`` object into one of the methods
defined in `Modifying the tree`_, just as you would a :ref:`Tag`. This
lets you do things like combine two parsed documents::
- doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
- footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")
- doc.find(text="INSERT FOOTER HERE").replace_with(footer)
- # u'INSERT FOOTER HERE'
- print(doc)
- # <?xml version="1.0" encoding="utf-8"?>
- # <document><content/><footer>Here's the footer</footer></document>
+ doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
+ footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")
+ doc.find(text="INSERT FOOTER HERE").replace_with(footer)
+ # 'INSERT FOOTER HERE'
+ print(doc)
+ # <?xml version="1.0" encoding="utf-8"?>
+ # <document><content/><footer>Here's the footer</footer></document>
Since the ``BeautifulSoup`` object doesn't correspond to an actual
HTML or XML tag, it has no name and no attributes. But sometimes it's
@@ -532,7 +535,7 @@ useful to look at its ``.name``, so it's been given the special
``.name`` "[document]"::
soup.name
- # u'[document]'
+ # '[document]'
Comments and other special strings
----------------------------------
@@ -543,7 +546,7 @@ leftover bits. The main one you'll probably encounter
is the comment::
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
- soup = BeautifulSoup(markup)
+ soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>
@@ -551,7 +554,7 @@ is the comment::
The ``Comment`` object is just a special type of ``NavigableString``::
comment
- # u'Hey, buddy. Want to buy a used parser'
+ # 'Hey, buddy. Want to buy a used parser'
But when it appears as part of an HTML document, a ``Comment`` is
displayed with special formatting::
@@ -666,13 +669,13 @@ A tag's children are available in a list called ``.contents``::
# <head><title>The Dormouse's story</title></head>
head_tag.contents
- [<title>The Dormouse's story</title>]
+ # [<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
- # [u'The Dormouse's story']
+ # ['The Dormouse's story']
The ``BeautifulSoup`` object itself has children. In this case, the
<html> tag is the child of the ``BeautifulSoup`` object.::
@@ -680,7 +683,7 @@ The ``BeautifulSoup`` object itself has children. In this case, the
len(soup.contents)
# 1
soup.contents[0].name
- # u'html'
+ # 'html'
A string does not have ``.contents``, because it can't contain
anything::
@@ -725,7 +728,7 @@ descendants::
len(list(soup.children))
# 1
len(list(soup.descendants))
- # 25
+ # 26
.. _.string:
@@ -736,7 +739,7 @@ If a tag has only one child, and that child is a ``NavigableString``,
the child is made available as ``.string``::
title_tag.string
- # u'The Dormouse's story'
+ # 'The Dormouse's story'
If a tag's only child is another tag, and `that` tag has a
``.string``, then the parent tag is considered to have the same
@@ -746,7 +749,7 @@ If a tag's only child is another tag, and `that` tag has a
# [<title>The Dormouse's story</title>]
head_tag.string
- # u'The Dormouse's story'
+ # 'The Dormouse's story'
If a tag contains more than one thing, then it's not clear what
``.string`` should refer to, so ``.string`` is defined to be
@@ -765,36 +768,38 @@ just the strings. Use the ``.strings`` generator::
for string in soup.strings:
print(repr(string))
- # u"The Dormouse's story"
- # u'\n\n'
- # u"The Dormouse's story"
- # u'\n\n'
- # u'Once upon a time there were three little sisters; and their names were\n'
- # u'Elsie'
- # u',\n'
- # u'Lacie'
- # u' and\n'
- # u'Tillie'
- # u';\nand they lived at the bottom of a well.'
- # u'\n\n'
- # u'...'
- # u'\n'
+ '\n'
+ # "The Dormouse's story"
+ # '\n'
+ # '\n'
+ # "The Dormouse's story"
+ # '\n'
+ # 'Once upon a time there were three little sisters; and their names were\n'
+ # 'Elsie'
+ # ',\n'
+ # 'Lacie'
+ # ' and\n'
+ # 'Tillie'
+ # ';\nand they lived at the bottom of a well.'
+ # '\n'
+ # '...'
+ # '\n'
These strings tend to have a lot of extra whitespace, which you can
remove by using the ``.stripped_strings`` generator instead::
for string in soup.stripped_strings:
print(repr(string))
- # u"The Dormouse's story"
- # u"The Dormouse's story"
- # u'Once upon a time there were three little sisters; and their names were'
- # u'Elsie'
- # u','
- # u'Lacie'
- # u'and'
- # u'Tillie'
- # u';\nand they lived at the bottom of a well.'
- # u'...'
+ # "The Dormouse's story"
+ # "The Dormouse's story"
+ # 'Once upon a time there were three little sisters; and their names were'
+ # 'Elsie'
+ # ','
+ # 'Lacie'
+ # 'and'
+ # 'Tillie'
+ # ';\n and they lived at the bottom of a well.'
+ # '...'
Here, strings consisting entirely of whitespace are ignored, and
whitespace at the beginning and end of strings is removed.
@@ -851,25 +856,19 @@ buried deep within the document, to the very top of the document::
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
for parent in link.parents:
- if parent is None:
- print(parent)
- else:
- print(parent.name)
+ print(parent.name)
# p
# body
# html
# [document]
- # None
Going sideways
--------------
Consider a simple document like this::
- sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
+ sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>", 'html.parser')
print(sibling_soup.prettify())
- # <html>
- # <body>
# <a>
# <b>
# text1
@@ -878,8 +877,6 @@ Consider a simple document like this::
# text2
# </c>
# </a>
- # </body>
- # </html>
The <b> tag and the <c> tag are at the same level: they're both direct
children of the same tag. We call them `siblings`. When a document is
@@ -912,7 +909,7 @@ The strings "text1" and "text2" are `not` siblings, because they don't
have the same parent::
sibling_soup.b.string
- # u'text1'
+ # 'text1'
print(sibling_soup.b.string.next_sibling)
# None
@@ -921,9 +918,9 @@ In real documents, the ``.next_sibling`` or ``.previous_sibling`` of a
tag will usually be a string containing whitespace. Going back to the
"three sisters" document::
- <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
- <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
- <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
+ # <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
+ # <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
+ # <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
You might think that the ``.next_sibling`` of the first <a> tag would
be the second <a> tag. But actually, it's a string: the comma and
@@ -934,7 +931,7 @@ newline that separate the first <a> tag from the second::
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
link.next_sibling
- # u',\n'
+ # ',\n '
The second <a> tag is actually the ``.next_sibling`` of the comma::
@@ -951,29 +948,27 @@ You can iterate over a tag's siblings with ``.next_siblings`` or
for sibling in soup.a.next_siblings:
print(repr(sibling))
- # u',\n'
+ # ',\n'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
- # u' and\n'
+ # ' and\n'
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
- # u'; and they lived at the bottom of a well.'
- # None
+ # '; and they lived at the bottom of a well.'
for sibling in soup.find(id="link3").previous_siblings:
print(repr(sibling))
# ' and\n'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
- # u',\n'
+ # ',\n'
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
- # u'Once upon a time there were three little sisters; and their names were\n'
- # None
+ # 'Once upon a time there were three little sisters; and their names were\n'
Going back and forth
--------------------
Take a look at the beginning of the "three sisters" document::
- <html><head><title>The Dormouse's story</title></head>
- <p class="title"><b>The Dormouse's story</b></p>
+ # <html><head><title>The Dormouse's story</title></head>
+ # <p class="title"><b>The Dormouse's story</b></p>
An HTML parser takes this string of characters and turns it into a
series of events: "open an <html> tag", "open a <head> tag", "open a
@@ -999,14 +994,14 @@ interrupted by the start of the <a> tag.::
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
last_a_tag.next_sibling
- # '; and they lived at the bottom of a well.'
+ # ';\nand they lived at the bottom of a well.'
But the ``.next_element`` of that <a> tag, the thing that was parsed
immediately after the <a> tag, is `not` the rest of that sentence:
it's the word "Tillie"::
last_a_tag.next_element
- # u'Tillie'
+ # 'Tillie'
That's because in the original markup, the word "Tillie" appeared
before that semicolon. The parser encountered an <a> tag, then the
@@ -1019,7 +1014,7 @@ The ``.previous_element`` attribute is the exact opposite of
immediately before this one::
last_a_tag.previous_element
- # u' and\n'
+ # ' and\n'
last_a_tag.previous_element.next_element
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
@@ -1031,13 +1026,12 @@ forward or backward in the document as it was parsed::
for element in last_a_tag.next_elements:
print(repr(element))
- # u'Tillie'
- # u';\nand they lived at the bottom of a well.'
- # u'\n\n'
+ # 'Tillie'
+ # ';\nand they lived at the bottom of a well.'
+ # '\n'
# <p class="story">...</p>
- # u'...'
- # u'\n'
- # None
+ # '...'
+ # '\n'
Searching the tree
==================
@@ -1188,8 +1182,10 @@ If you pass in a function to filter on a specific attribute like
value, not the whole tag. Here's a function that finds all ``a`` tags
whose ``href`` attribute *does not* match a regular expression::
+ import re
def not_lacie(href):
return href and not re.compile("lacie").search(href)
+
soup.find_all(href=not_lacie)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
@@ -1204,7 +1200,8 @@ objects::
and isinstance(tag.previous_element, NavigableString))
for tag in soup.find_all(surrounded_by_strings):
- print tag.name
+ print(tag.name)
+ # body
# p
# a
# a
@@ -1216,7 +1213,7 @@ Now we're ready to look at the search methods in detail.
``find_all()``
--------------
-Signature: find_all(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive
+Method signature: find_all(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive
<recursive>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
The ``find_all()`` method looks through a tag's descendants and
@@ -1239,7 +1236,7 @@ examples in `Kinds of filters`_, but here are a few more::
import re
soup.find(string=re.compile("sisters"))
- # u'Once upon a time there were three little sisters; and their names were\n'
+ # 'Once upon a time there were three little sisters; and their names were\n'
Some of these should look familiar, but others are new. What does it
mean to pass in a value for ``string``, or ``id``? Why does
@@ -1297,12 +1294,12 @@ You can filter multiple attributes at once by passing in more than one
keyword argument::
soup.find_all(href=re.compile("elsie"), id='link1')
- # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
Some attributes, like the data-* attributes in HTML 5, have names that
can't be used as the names of keyword arguments::
- data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
+ data_soup = BeautifulSoup('<div data-foo="value">foo!</div>', 'html.parser')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
@@ -1318,7 +1315,7 @@ because Beautiful Soup uses the ``name`` argument to contain the name
of the tag itself. Instead, you can give a value to 'name' in the
``attrs`` argument::
- name_soup = BeautifulSoup('<input name="email"/>')
+ name_soup = BeautifulSoup('<input name="email"/>', 'html.parser')
name_soup.find_all(name="email")
# []
name_soup.find_all(attrs={"name": "email"})
@@ -1359,7 +1356,7 @@ values for its "class" attribute. When you search for a tag that
matches a certain CSS class, you're matching against `any` of its CSS
classes::
- css_soup = BeautifulSoup('<p class="body strikeout"></p>')
+ css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
css_soup.find_all("p", class_="strikeout")
# [<p class="body strikeout"></p>]
@@ -1403,20 +1400,20 @@ regular expression`_, `a list`_, `a function`_, or `the value True`_.
Here are some examples::
soup.find_all(string="Elsie")
- # [u'Elsie']
+ # ['Elsie']
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
- # [u'Elsie', u'Lacie', u'Tillie']
+ # ['Elsie', 'Lacie', 'Tillie']
soup.find_all(string=re.compile("Dormouse"))
- [u"The Dormouse's story", u"The Dormouse's story"]
+ # ["The Dormouse's story", "The Dormouse's story"]
def is_the_only_string_within_a_tag(s):
"""Return True if this string is the only child of its parent tag."""
return (s == s.parent.string)
soup.find_all(string=is_the_only_string_within_a_tag)
- # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']
+ # ["The Dormouse's story", "The Dormouse's story", 'Elsie', 'Lacie', 'Tillie', '...']
Although ``string`` is for finding strings, you can combine it with
arguments that find tags: Beautiful Soup will find all tags whose
@@ -1509,7 +1506,7 @@ These two lines are also equivalent::
``find()``
----------
-Signature: find(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive
+Method signature: find(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive
<recursive>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`)
The ``find_all()`` method scans the entire document looking for
@@ -1546,9 +1543,9 @@ names`_? That trick works by repeatedly calling ``find()``::
``find_parents()`` and ``find_parent()``
----------------------------------------
-Signature: find_parents(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
+Method signature: find_parents(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
-Signature: find_parent(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`)
+Method signature: find_parent(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`)
I spent a lot of time above covering ``find_all()`` and
``find()``. The Beautiful Soup API defines ten other methods for
@@ -1564,22 +1561,22 @@ do the opposite: they work their way `up` the tree, looking at a tag's
(or a string's) parents. Let's try them out, starting from a string
buried deep in the "three daughters" document::
- a_string = soup.find(string="Lacie")
- a_string
- # u'Lacie'
+ a_string = soup.find(string="Lacie")
+ a_string
+ # 'Lacie'
- a_string.find_parents("a")
- # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
+ a_string.find_parents("a")
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
- a_string.find_parent("p")
- # <p class="story">Once upon a time there were three little sisters; and their names were
- # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
- # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
- # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
- # and they lived at the bottom of a well.</p>
+ a_string.find_parent("p")
+ # <p class="story">Once upon a time there were three little sisters; and their names were
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
+ # and they lived at the bottom of a well.</p>
- a_string.find_parents("p", class="title")
- # []
+ a_string.find_parents("p", class_="title")
+ # []
One of the three <a> tags is the direct parent of the string in
question, so our search finds it. One of the three <p> tags is an
@@ -1597,9 +1594,9 @@ each one against the provided filter to see if it matches.
``find_next_siblings()`` and ``find_next_sibling()``
----------------------------------------------------
-Signature: find_next_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
+Method signature: find_next_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
-Signature: find_next_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`)
+Method signature: find_next_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`)
These methods use :ref:`.next_siblings <sibling-generators>` to
iterate over the rest of an element's siblings in the tree. The
@@ -1621,9 +1618,9 @@ and ``find_next_sibling()`` only returns the first one::
``find_previous_siblings()`` and ``find_previous_sibling()``
------------------------------------------------------------
-Signature: find_previous_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
+Method signature: find_previous_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
-Signature: find_previous_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`)
+Method signature: find_previous_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`)
These methods use :ref:`.previous_siblings <sibling-generators>` to iterate over an element's
siblings that precede it in the tree. The ``find_previous_siblings()``
@@ -1646,9 +1643,9 @@ method returns all the siblings that match, and
``find_all_next()`` and ``find_next()``
---------------------------------------
-Signature: find_all_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
+Method signature: find_all_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
-Signature: find_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`)
+Method signature: find_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`)
These methods use :ref:`.next_elements <element-generators>` to
iterate over whatever tags and strings that come after it in the
@@ -1660,8 +1657,8 @@ document. The ``find_all_next()`` method returns all matches, and
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
first_link.find_all_next(string=True)
- # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
- # u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']
+ # ['Elsie', ',\n', 'Lacie', ' and\n', 'Tillie',
+ # ';\nand they lived at the bottom of a well.', '\n', '...', '\n']
first_link.find_next("p")
# <p class="story">...</p>
@@ -1676,9 +1673,9 @@ show up later in the document than the starting element.
``find_all_previous()`` and ``find_previous()``
-----------------------------------------------
-Signature: find_all_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
+Method signature: find_all_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`)
-Signature: find_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`)
+Method signature: find_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`)
These methods use :ref:`.previous_elements <element-generators>` to
iterate over the tags and strings that came before it in the
@@ -1837,9 +1834,9 @@ selectors.::
soup.select("child")
# [<ns1:child>I'm in namespace 1</ns1:child>, <ns2:child>I'm in namespace 2</ns2:child>]
- soup.select("ns1|child", namespaces=namespaces)
+ soup.select("ns1|child", namespaces=soup.namespaces)
# [<ns1:child>I'm in namespace 1</ns1:child>]
-
+
When handling a CSS selector that uses namespaces, Beautiful Soup
uses the namespace abbreviations it found when parsing the
document. You can override this by passing in your own dictionary of
@@ -1869,7 +1866,7 @@ I covered this earlier, in `Attributes`_, but it bears repeating. You
can rename a tag, change the values of its attributes, add new
attributes, and delete attributes::
- soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
+ soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.name = "blockquote"
@@ -1889,13 +1886,13 @@ Modifying ``.string``
If you set a tag's ``.string`` attribute to a new string, the tag's contents are
replaced with that string::
- markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
- soup = BeautifulSoup(markup)
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup, 'html.parser')
- tag = soup.a
- tag.string = "New link text."
- tag
- # <a href="http://example.com/">New link text.</a>
+ tag = soup.a
+ tag.string = "New link text."
+ tag
+ # <a href="http://example.com/">New link text.</a>
Be careful: if the tag contained other tags, they and all their
contents will be destroyed.
@@ -1906,13 +1903,13 @@ contents will be destroyed.
You can add to a tag's contents with ``Tag.append()``. It works just
like calling ``.append()`` on a Python list::
- soup = BeautifulSoup("<a>Foo</a>")
- soup.a.append("Bar")
+ soup = BeautifulSoup("<a>Foo</a>", 'html.parser')
+ soup.a.append("Bar")
- soup
- # <html><head></head><body><a>FooBar</a></body></html>
- soup.a.contents
- # [u'Foo', u'Bar']
+ soup
+ # <a>FooBar</a>
+ soup.a.contents
+ # ['Foo', 'Bar']
``extend()``
------------
@@ -1921,13 +1918,13 @@ Starting in Beautiful Soup 4.7.0, ``Tag`` also supports a method
called ``.extend()``, which works just like calling ``.extend()`` on a
Python list::
- soup = BeautifulSoup("<a>Soup</a>")
- soup.a.extend(["'s", " ", "on"])
+ soup = BeautifulSoup("<a>Soup</a>", 'html.parser')
+ soup.a.extend(["'s", " ", "on"])
- soup
- # <html><head></head><body><a>Soup's on</a></body></html>
- soup.a.contents
- # [u'Soup', u''s', u' ', u'on']
+ soup
+ # <a>Soup's on</a>
+ soup.a.contents
+ # ['Soup', ''s', ' ', 'on']
``NavigableString()`` and ``.new_tag()``
-------------------------------------------------
@@ -1936,43 +1933,43 @@ If you need to add a string to a document, no problem--you can pass a
Python string in to ``append()``, or you can call the ``NavigableString``
constructor::
- soup = BeautifulSoup("<b></b>")
- tag = soup.b
- tag.append("Hello")
- new_string = NavigableString(" there")
- tag.append(new_string)
- tag
- # <b>Hello there.</b>
- tag.contents
- # [u'Hello', u' there']
+ soup = BeautifulSoup("<b></b>", 'html.parser')
+ tag = soup.b
+ tag.append("Hello")
+ new_string = NavigableString(" there")
+ tag.append(new_string)
+ tag
+ # <b>Hello there.</b>
+ tag.contents
+ # ['Hello', ' there']
If you want to create a comment or some other subclass of
``NavigableString``, just call the constructor::
- from bs4 import Comment
- new_comment = Comment("Nice to see you.")
- tag.append(new_comment)
- tag
- # <b>Hello there<!--Nice to see you.--></b>
- tag.contents
- # [u'Hello', u' there', u'Nice to see you.']
+ from bs4 import Comment
+ new_comment = Comment("Nice to see you.")
+ tag.append(new_comment)
+ tag
+ # <b>Hello there<!--Nice to see you.--></b>
+ tag.contents
+ # ['Hello', ' there', 'Nice to see you.']
`(This is a new feature in Beautiful Soup 4.4.0.)`
What if you need to create a whole new tag? The best solution is to
call the factory method ``BeautifulSoup.new_tag()``::
- soup = BeautifulSoup("<b></b>")
- original_tag = soup.b
+ soup = BeautifulSoup("<b></b>", 'html.parser')
+ original_tag = soup.b
- new_tag = soup.new_tag("a", href="http://www.example.com")
- original_tag.append(new_tag)
- original_tag
- # <b><a href="http://www.example.com"></a></b>
+ new_tag = soup.new_tag("a", href="http://www.example.com")
+ original_tag.append(new_tag)
+ original_tag
+ # <b><a href="http://www.example.com"></a></b>
- new_tag.string = "Link text."
- original_tag
- # <b><a href="http://www.example.com">Link text.</a></b>
+ new_tag.string = "Link text."
+ original_tag
+ # <b><a href="http://www.example.com">Link text.</a></b>
Only the first argument, the tag name, is required.
@@ -1984,15 +1981,15 @@ doesn't necessarily go at the end of its parent's
``.contents``. It'll be inserted at whatever numeric position you
say. It works just like ``.insert()`` on a Python list::
- markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
- soup = BeautifulSoup(markup)
- tag = soup.a
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup, 'html.parser')
+ tag = soup.a
- tag.insert(1, "but did not endorse ")
- tag
- # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
- tag.contents
- # [u'I linked to ', u'but did not endorse', <i>example.com</i>]
+ tag.insert(1, "but did not endorse ")
+ tag
+ # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
+ tag.contents
+ # ['I linked to ', 'but did not endorse', <i>example.com</i>]
``insert_before()`` and ``insert_after()``
------------------------------------------
@@ -2000,36 +1997,36 @@ say. It works just like ``.insert()`` on a Python list::
The ``insert_before()`` method inserts tags or strings immediately
before something else in the parse tree::
- soup = BeautifulSoup("<b>stop</b>")
- tag = soup.new_tag("i")
- tag.string = "Don't"
- soup.b.string.insert_before(tag)
- soup.b
- # <b><i>Don't</i>stop</b>
+ soup = BeautifulSoup("<b>leave</b>", 'html.parser')
+ tag = soup.new_tag("i")
+ tag.string = "Don't"
+ soup.b.string.insert_before(tag)
+ soup.b
+ # <b><i>Don't</i>leave</b>
The ``insert_after()`` method inserts tags or strings immediately
following something else in the parse tree::
- div = soup.new_tag('div')
- div.string = 'ever'
- soup.b.i.insert_after(" you ", div)
- soup.b
- # <b><i>Don't</i> you <div>ever</div> stop</b>
- soup.b.contents
- # [<i>Don't</i>, u' you', <div>ever</div>, u'stop']
+ div = soup.new_tag('div')
+ div.string = 'ever'
+ soup.b.i.insert_after(" you ", div)
+ soup.b
+ # <b><i>Don't</i> you <div>ever</div> leave</b>
+ soup.b.contents
+ # [<i>Don't</i>, ' you', <div>ever</div>, 'leave']
``clear()``
-----------
``Tag.clear()`` removes the contents of a tag::
- markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
- soup = BeautifulSoup(markup)
- tag = soup.a
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup, 'html.parser')
+ tag = soup.a
- tag.clear()
- tag
- # <a href="http://example.com/"></a>
+ tag.clear()
+ tag
+ # <a href="http://example.com/"></a>
``extract()``
-------------
@@ -2037,34 +2034,34 @@ following something else in the parse tree::
``PageElement.extract()`` removes a tag or string from the tree. It
returns the tag or string that was extracted::
- markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
- soup = BeautifulSoup(markup)
- a_tag = soup.a
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup, 'html.parser')
+ a_tag = soup.a
- i_tag = soup.i.extract()
+ i_tag = soup.i.extract()
- a_tag
- # <a href="http://example.com/">I linked to</a>
+ a_tag
+ # <a href="http://example.com/">I linked to</a>
- i_tag
- # <i>example.com</i>
+ i_tag
+ # <i>example.com</i>
- print(i_tag.parent)
- None
+ print(i_tag.parent)
+ # None
At this point you effectively have two parse trees: one rooted at the
``BeautifulSoup`` object you used to parse the document, and one rooted
at the tag that was extracted. You can go on to call ``extract`` on
a child of the element you extracted::
- my_string = i_tag.string.extract()
- my_string
- # u'example.com'
+ my_string = i_tag.string.extract()
+ my_string
+ # 'example.com'
- print(my_string.parent)
- # None
- i_tag
- # <i></i>
+ print(my_string.parent)
+ # None
+ i_tag
+ # <i></i>
``decompose()``
@@ -2073,25 +2070,25 @@ a child of the element you extracted::
``Tag.decompose()`` removes a tag from the tree, then `completely
destroys it and its contents`::
- markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
- soup = BeautifulSoup(markup)
- a_tag = soup.a
- i_tag = soup.i
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup, 'html.parser')
+ a_tag = soup.a
+ i_tag = soup.i
- i_tag.decompose()
- a_tag
- # <a href="http://example.com/">I linked to</a>
+ i_tag.decompose()
+ a_tag
+ # <a href="http://example.com/">I linked to</a>
The behavior of a decomposed ``Tag`` or ``NavigableString`` is not
defined and you should not use it for anything. If you're not sure
whether something has been decomposed, you can check its
``.decomposed`` property `(new in Beautiful Soup 4.9.0)`::
- i_tag.decomposed
- # True
+ i_tag.decomposed
+ # True
- a_tag.decomposed
- # False
+ a_tag.decomposed
+ # False
.. _replace_with():
@@ -2102,16 +2099,16 @@ whether something has been decomposed, you can check its
``PageElement.replace_with()`` removes a tag or string from the tree,
and replaces it with the tag or string of your choice::
- markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
- soup = BeautifulSoup(markup)
- a_tag = soup.a
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup, 'html.parser')
+ a_tag = soup.a
- new_tag = soup.new_tag("b")
- new_tag.string = "example.net"
- a_tag.i.replace_with(new_tag)
+ new_tag = soup.new_tag("b")
+ new_tag.string = "example.net"
+ a_tag.i.replace_with(new_tag)
- a_tag
- # <a href="http://example.com/">I linked to <b>example.net</b></a>
+ a_tag
+ # <a href="http://example.com/">I linked to <b>example.net</b></a>
``replace_with()`` returns the tag or string that was replaced, so
that you can examine it or add it back to another part of the tree.
@@ -2122,11 +2119,11 @@ that you can examine it or add it back to another part of the tree.
``PageElement.wrap()`` wraps an element in the tag you specify. It
returns the new wrapper::
- soup = BeautifulSoup("<p>I wish I was bold.</p>")
+ soup = BeautifulSoup("<p>I wish I was bold.</p>", 'html.parser')
soup.p.string.wrap(soup.new_tag("b"))
# <b>I wish I was bold.</b>
- soup.p.wrap(soup.new_tag("div")
+ soup.p.wrap(soup.new_tag("div"))
# <div><p><b>I wish I was bold.</b></p></div>
This method is new in Beautiful Soup 4.0.5.
@@ -2137,13 +2134,13 @@ This method is new in Beautiful Soup 4.0.5.
``Tag.unwrap()`` is the opposite of ``wrap()``. It replaces a tag with
whatever's inside that tag. It's good for stripping out markup::
- markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
- soup = BeautifulSoup(markup)
- a_tag = soup.a
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup, 'html.parser')
+ a_tag = soup.a
- a_tag.i.unwrap()
- a_tag
- # <a href="http://example.com/">I linked to example.com</a>
+ a_tag.i.unwrap()
+ a_tag
+ # <a href="http://example.com/">I linked to example.com</a>
Like ``replace_with()``, ``unwrap()`` returns the tag
that was replaced.
@@ -2153,27 +2150,27 @@ that was replaced.
After calling a bunch of methods that modify the parse tree, you may end up with two or more ``NavigableString`` objects next to each other. Beautiful Soup doesn't have any problems with this, but since it can't happen in a freshly parsed document, you might not expect behavior like the following::
- soup = BeautifulSoup("<p>A one</p>")
- soup.p.append(", a two")
+ soup = BeautifulSoup("<p>A one</p>", 'html.parser')
+ soup.p.append(", a two")
- soup.p.contents
- # [u'A one', u', a two']
+ soup.p.contents
+ # ['A one', ', a two']
- print(soup.p.encode())
- # <p>A one, a two</p>
+ print(soup.p.encode())
+ # b'<p>A one, a two</p>'
- print(soup.p.prettify())
- # <p>
- # A one
- # , a two
- # </p>
+ print(soup.p.prettify())
+ # <p>
+ # A one
+ # , a two
+ # </p>
You can call ``Tag.smooth()`` to clean up the parse tree by consolidating adjacent strings::
soup.smooth()
soup.p.contents
- # [u'A one, a two']
+ # ['A one, a two']
print(soup.p.prettify())
# <p>
@@ -2194,35 +2191,35 @@ The ``prettify()`` method will turn a Beautiful Soup parse tree into a
nicely formatted Unicode string, with a separate line for each
tag and each string::
- markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
- soup = BeautifulSoup(markup)
- soup.prettify()
- # '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...'
-
- print(soup.prettify())
- # <html>
- # <head>
- # </head>
- # <body>
- # <a href="http://example.com/">
- # I linked to
- # <i>
- # example.com
- # </i>
- # </a>
- # </body>
- # </html>
+ markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup, 'html.parser')
+ soup.prettify()
+ # '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...'
+
+ print(soup.prettify())
+ # <html>
+ # <head>
+ # </head>
+ # <body>
+ # <a href="http://example.com/">
+ # I linked to
+ # <i>
+ # example.com
+ # </i>
+ # </a>
+ # </body>
+ # </html>
You can call ``prettify()`` on the top-level ``BeautifulSoup`` object,
or on any of its ``Tag`` objects::
- print(soup.a.prettify())
- # <a href="http://example.com/">
- # I linked to
- # <i>
- # example.com
- # </i>
- # </a>
+ print(soup.a.prettify())
+ # <a href="http://example.com/">
+ # I linked to
+ # <i>
+ # example.com
+ # </i>
+ # </a>
Since it adds whitespace (in the form of newlines), ``prettify()``
changes the meaning of an HTML document and should not be used to
@@ -2233,14 +2230,14 @@ Non-pretty printing
-------------------
If you just want a string, with no fancy formatting, you can call
-``unicode()`` or ``str()`` on a ``BeautifulSoup`` object, or a ``Tag``
-within it::
+``str()`` on a ``BeautifulSoup`` object (``unicode()`` in Python 2),
+or on a ``Tag`` within it::
str(soup)
# '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'
- unicode(soup.a)
- # u'<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ str(soup.a)
+ # '<a href="http://example.com/">I linked to <i>example.com</i></a>'
The ``str()`` function returns a string encoded in UTF-8. See
`Encodings`_ for other options.
@@ -2256,26 +2253,26 @@ Output formatters
If you give Beautiful Soup a document that contains HTML entities like
"&lquot;", they'll be converted to Unicode characters::
- soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")
- unicode(soup)
- # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'
+ soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.", 'html.parser')
+ str(soup)
+ # '“Dammit!” he said.'
-If you then convert the document to a string, the Unicode characters
+If you then convert the document to a bytestring, the Unicode characters
will be encoded as UTF-8. You won't get the HTML entities back::
- str(soup)
- # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>'
+ soup.encode("utf8")
+ # b'\xe2\x80\x9cDammit!\xe2\x80\x9d he said.'
By default, the only characters that are escaped upon output are bare
ampersands and angle brackets. These get turned into "&amp;", "&lt;",
and "&gt;", so that Beautiful Soup doesn't inadvertently generate
invalid HTML or XML::
- soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>")
+ soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>", 'html.parser')
soup.p
# <p>The law firm of Dewey, Cheatem, &amp; Howe</p>
- soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
+ soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>', 'html.parser')
soup.a
# <a href="http://example.com/?foo=val1&amp;bar=val2">A link</a>
@@ -2288,56 +2285,44 @@ The default is ``formatter="minimal"``. Strings will only be processed
enough to ensure that Beautiful Soup generates valid HTML/XML::
french = "<p>Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;</p>"
- soup = BeautifulSoup(french)
+ soup = BeautifulSoup(french, 'html.parser')
print(soup.prettify(formatter="minimal"))
- # <html>
- # <body>
- # <p>
- # Il a dit &lt;&lt;Sacré bleu!&gt;&gt;
- # </p>
- # </body>
- # </html>
+ # <p>
+ # Il a dit &lt;&lt;Sacré bleu!&gt;&gt;
+ # </p>
If you pass in ``formatter="html"``, Beautiful Soup will convert
Unicode characters to HTML entities whenever possible::
print(soup.prettify(formatter="html"))
- # <html>
- # <body>
- # <p>
- # Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;
- # </p>
- # </body>
- # </html>
+ # <p>
+ # Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;
+ # </p>
If you pass in ``formatter="html5"``, it's the same as
``formatter="html"``, but Beautiful Soup will
omit the closing slash in HTML void tags like "br"::
- soup = BeautifulSoup("<br>")
+ br = BeautifulSoup("<br>", 'html.parser').br
- print(soup.encode(formatter="html"))
- # <html><body><br/></body></html>
+ print(br.encode(formatter="html"))
+ # b'<br/>'
- print(soup.encode(formatter="html5"))
- # <html><body><br></body></html>
+ print(br.encode(formatter="html5"))
+ # b'<br>'
If you pass in ``formatter=None``, Beautiful Soup will not modify
strings at all on output. This is the fastest option, but it may lead
to Beautiful Soup generating invalid HTML/XML, as in these examples::
print(soup.prettify(formatter=None))
- # <html>
- # <body>
- # <p>
- # Il a dit <<Sacré bleu!>>
- # </p>
- # </body>
- # </html>
+ # <p>
+ # Il a dit <<Sacré bleu!>>
+ # </p>
- link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
+ link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>', 'html.parser')
print(link_soup.a.encode(formatter=None))
- # <a href="http://example.com/?foo=val1&bar=val2">A link</a>
+ # b'<a href="http://example.com/?foo=val1&bar=val2">A link</a>'
If you need more sophisticated control over your output, you can
use Beautiful Soup's ``Formatter`` class. Here's a formatter that
@@ -2347,16 +2332,13 @@ attribute value::
from bs4.formatter import HTMLFormatter
def uppercase(str):
return str.upper()
+
formatter = HTMLFormatter(uppercase)
print(soup.prettify(formatter=formatter))
- # <html>
- # <body>
- # <p>
- # IL A DIT <<SACRÉ BLEU!>>
- # </p>
- # </body>
- # </html>
+ # <p>
+ # IL A DIT <<SACRÉ BLEU!>>
+ # </p>
print(link_soup.a.prettify(formatter=formatter))
# <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2">
@@ -2367,7 +2349,7 @@ Subclassing ``HTMLFormatter`` or ``XMLFormatter`` will give you even
more control over the output. For example, Beautiful Soup sorts the
attributes in every tag by default::
- attr_soup = BeautifulSoup(b'<p z="1" m="2" a="3"></p>')
+ attr_soup = BeautifulSoup(b'<p z="1" m="2" a="3"></p>', 'html.parser')
print(attr_soup.p.encode())
# <p a="3" m="2" z="1"></p>
@@ -2380,8 +2362,9 @@ whenever it appears::
def attributes(self, tag):
for k, v in tag.attrs.items():
if k == 'm':
- continue
+ continue
yield k, v
+
print(attr_soup.p.encode(formatter=UnsortedAttributes()))
# <p z="1" a="3"></p>
@@ -2393,9 +2376,9 @@ all the strings in the document or something, but it will ignore the
return value::
from bs4.element import CData
- soup = BeautifulSoup("<a></a>")
+ soup = BeautifulSoup("<a></a>", 'html.parser')
soup.a.string = CData("one < three")
- print(soup.a.prettify(formatter="xml"))
+ print(soup.a.prettify(formatter="html"))
# <a>
# <![CDATA[one < three]]>
# </a>
@@ -2408,31 +2391,31 @@ If you only want the human-readable text inside a document or tag, you can use t
``get_text()`` method. It returns all the text in a document or
beneath a tag, as a single Unicode string::
- markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
- soup = BeautifulSoup(markup)
+ markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
+ soup = BeautifulSoup(markup, 'html.parser')
- soup.get_text()
- u'\nI linked to example.com\n'
- soup.i.get_text()
- u'example.com'
+ soup.get_text()
+ '\nI linked to example.com\n'
+ soup.i.get_text()
+ 'example.com'
You can specify a string to be used to join the bits of text
together::
# soup.get_text("|")
- u'\nI linked to |example.com|\n'
+ '\nI linked to |example.com|\n'
You can tell Beautiful Soup to strip whitespace from the beginning and
end of each bit of text::
# soup.get_text("|", strip=True)
- u'I linked to|example.com'
+ 'I linked to|example.com'
But at that point you might want to use the :ref:`.stripped_strings <string-generators>`
generator instead, and process the text yourself::
[text for text in soup.stripped_strings]
- # [u'I linked to', u'example.com']
+ # ['I linked to', 'example.com']
*As of Beautiful Soup version 4.9.0, when lxml or html.parser are in
use, the contents of <script>, <style>, and <template>
@@ -2549,11 +2532,11 @@ or UTF-8. But when you load that document into Beautiful Soup, you'll
discover it's been converted to Unicode::
markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
- soup = BeautifulSoup(markup)
+ soup = BeautifulSoup(markup, 'html.parser')
soup.h1
# <h1>Sacré bleu!</h1>
soup.h1.string
- # u'Sacr\xe9 bleu!'
+ # 'Sacr\xe9 bleu!'
It's not magic. (That sure would be nice.) Beautiful Soup uses a
sub-library called `Unicode, Dammit`_ to detect a document's encoding
@@ -2575,29 +2558,29 @@ Unicode, Dammit can't get a lock on it, and misidentifies it as
ISO-8859-7::
markup = b"<h1>\xed\xe5\xec\xf9</h1>"
- soup = BeautifulSoup(markup)
- soup.h1
- <h1>νεμω</h1>
- soup.original_encoding
- 'ISO-8859-7'
+ soup = BeautifulSoup(markup, 'html.parser')
+ print(soup.h1)
+ # <h1>νεμω</h1>
+ print(soup.original_encoding)
+ # iso-8859-7
We can fix this by passing in the correct ``from_encoding``::
- soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
- soup.h1
- <h1>םולש</h1>
- soup.original_encoding
- 'iso8859-8'
+ soup = BeautifulSoup(markup, 'html.parser', from_encoding="iso-8859-8")
+ print(soup.h1)
+ # <h1>םולש</h1>
+ print(soup.original_encoding)
+ # iso8859-8
If you don't know what the correct encoding is, but you know that
Unicode, Dammit is guessing wrong, you can pass the wrong guesses in
as ``exclude_encodings``::
- soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
- soup.h1
- <h1>םולש</h1>
- soup.original_encoding
- 'WINDOWS-1255'
+ soup = BeautifulSoup(markup, 'html.parser', exclude_encodings=["iso-8859-7"])
+ print(soup.h1)
+ # <h1>םולש</h1>
+ print(soup.original_encoding)
+ # WINDOWS-1255
Windows-1255 isn't 100% correct, but that encoding is a compatible
superset of ISO-8859-8, so it's close enough. (``exclude_encodings``
@@ -2633,7 +2616,7 @@ document written in the Latin-1 encoding::
</html>
'''
- soup = BeautifulSoup(markup)
+ soup = BeautifulSoup(markup, 'html.parser')
print(soup.prettify())
# <html>
# <head>
@@ -2661,17 +2644,17 @@ You can also call encode() on the ``BeautifulSoup`` object, or any
element in the soup, just as if it were a Python string::
soup.p.encode("latin-1")
- # '<p>Sacr\xe9 bleu!</p>'
+ # b'<p>Sacr\xe9 bleu!</p>'
soup.p.encode("utf-8")
- # '<p>Sacr\xc3\xa9 bleu!</p>'
+ # b'<p>Sacr\xc3\xa9 bleu!</p>'
Any characters that can't be represented in your chosen encoding will
be converted into numeric XML entity references. Here's a document
that includes the Unicode character SNOWMAN::
markup = u"<b>\N{SNOWMAN}</b>"
- snowman_soup = BeautifulSoup(markup)
+ snowman_soup = BeautifulSoup(markup, 'html.parser')
tag = snowman_soup.b
The SNOWMAN character can be part of a UTF-8 document (it looks like
@@ -2679,13 +2662,13 @@ The SNOWMAN character can be part of a UTF-8 document (it looks like
ASCII, so it's converted into "&#9731" for those encodings::
print(tag.encode("utf-8"))
- # <b>☃</b>
+ # b'<b>\xe2\x98\x83</b>'
- print tag.encode("latin-1")
- # <b>&#9731;</b>
+ print(tag.encode("latin-1"))
+ # b'<b>&#9731;</b>'
- print tag.encode("ascii")
- # <b>&#9731;</b>
+ print(tag.encode("ascii"))
+ # b'<b>&#9731;</b>'
Unicode, Dammit
---------------
@@ -2725,15 +2708,15 @@ entities::
markup = b"<p>I just \x93love\x94 Microsoft Word\x92s smart quotes</p>"
UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup
- # u'<p>I just &ldquo;love&rdquo; Microsoft Word&rsquo;s smart quotes</p>'
+ # '<p>I just &ldquo;love&rdquo; Microsoft Word&rsquo;s smart quotes</p>'
UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup
- # u'<p>I just &#x201C;love&#x201D; Microsoft Word&#x2019;s smart quotes</p>'
+ # '<p>I just &#x201C;love&#x201D; Microsoft Word&#x2019;s smart quotes</p>'
You can also convert Microsoft smart quotes to ASCII quotes::
UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup
- # u'<p>I just "love" Microsoft Word\'s smart quotes</p>'
+ # '<p>I just "love" Microsoft Word\'s smart quotes</p>'
Hopefully you'll find this feature useful, but Beautiful Soup doesn't
use it. Beautiful Soup prefers the default behavior, which is to
@@ -2741,7 +2724,7 @@ convert Microsoft smart quotes to Unicode characters along with
everything else::
UnicodeDammit(markup, ["windows-1252"]).unicode_markup
- # u'<p>I just \u201clove\u201d Microsoft Word\u2019s smart quotes</p>'
+ # '<p>I just “love” Microsoft Word’s smart quotes</p>'
Inconsistent encodings
^^^^^^^^^^^^^^^^^^^^^^
@@ -2798,31 +2781,31 @@ the original document each Tag was found. You can access this
information as ``Tag.sourceline`` (line number) and ``Tag.sourcepos``
(position of the start tag within a line)::
- markup = "<p\n>Paragraph 1</p>\n <p>Paragraph 2</p>"
- soup = BeautifulSoup(markup, 'html.parser')
- for tag in soup.find_all('p'):
- print(tag.sourceline, tag.sourcepos, tag.string)
- # (1, 0, u'Paragraph 1')
- # (2, 3, u'Paragraph 2')
+ markup = "<p\n>Paragraph 1</p>\n <p>Paragraph 2</p>"
+ soup = BeautifulSoup(markup, 'html.parser')
+ for tag in soup.find_all('p'):
+ print(repr((tag.sourceline, tag.sourcepos, tag.string)))
+ # (1, 0, 'Paragraph 1')
+ # (3, 4, 'Paragraph 2')
Note that the two parsers mean slightly different things by
``sourceline`` and ``sourcepos``. For html.parser, these numbers
represent the position of the initial less-than sign. For html5lib,
these numbers represent the position of the final greater-than sign::
- soup = BeautifulSoup(markup, 'html5lib')
- for tag in soup.find_all('p'):
- print(tag.sourceline, tag.sourcepos, tag.string)
- # (2, 1, u'Paragraph 1')
- # (3, 7, u'Paragraph 2')
+ soup = BeautifulSoup(markup, 'html5lib')
+ for tag in soup.find_all('p'):
+ print(repr((tag.sourceline, tag.sourcepos, tag.string)))
+ # (2, 0, 'Paragraph 1')
+ # (3, 6, 'Paragraph 2')
You can shut off this feature by passing ``store_line_numbers=False`
into the ``BeautifulSoup`` constructor::
- markup = "<p\n>Paragraph 1</p>\n <p>Paragraph 2</p>"
- soup = BeautifulSoup(markup, 'html.parser', store_line_numbers=False)
- soup.p.sourceline
- # None
+ markup = "<p\n>Paragraph 1</p>\n <p>Paragraph 2</p>"
+ soup = BeautifulSoup(markup, 'html.parser', store_line_numbers=False)
+ print(soup.p.sourceline)
+ # None
`This feature is new in 4.8.1, and the parsers based on lxml don't
support it.`
@@ -2839,16 +2822,16 @@ in different parts of the object tree, because they both look like
markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"
soup = BeautifulSoup(markup, 'html.parser')
first_b, second_b = soup.find_all('b')
- print first_b == second_b
+ print(first_b == second_b)
# True
- print first_b.previous_element == second_b.previous_element
+ print(first_b.previous_element == second_b.previous_element)
# False
If you want to see whether two variables refer to exactly the same
object, use `is`::
- print first_b is second_b
+ print(first_b is second_b)
# False
Copying Beautiful Soup objects
@@ -2859,23 +2842,23 @@ You can use ``copy.copy()`` to create a copy of any ``Tag`` or
import copy
p_copy = copy.copy(soup.p)
- print p_copy
+ print(p_copy)
# <p>I want <b>pizza</b> and more <b>pizza</b>!</p>
The copy is considered equal to the original, since it represents the
same markup as the original, but it's not the same object::
- print soup.p == p_copy
+ print(soup.p == p_copy)
# True
- print soup.p is p_copy
+ print(soup.p is p_copy)
# False
The only real difference is that the copy is completely detached from
the original Beautiful Soup object tree, just as if ``extract()`` had
been called on it::
- print p_copy.parent
+ print(p_copy.parent)
# None
This is because two different ``Tag`` objects can't occupy the same
@@ -2922,7 +2905,7 @@ three ``SoupStrainer`` objects::
only_tags_with_id_link2 = SoupStrainer(id="link2")
def is_short_string(string):
- return len(string) < 10
+ return string is not None and len(string) < 10
only_short_strings = SoupStrainer(string=is_short_string)
@@ -2930,8 +2913,7 @@ I'm going to bring back the "three sisters" document one more time,
and we'll see what the document looks like when it's parsed with these
three ``SoupStrainer`` objects::
- html_doc = """
- <html><head><title>The Dormouse's story</title></head>
+ html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
@@ -2973,10 +2955,10 @@ You can also pass a ``SoupStrainer`` into any of the methods covered
in `Searching the tree`_. This probably isn't terribly useful, but I
thought I'd mention it::
- soup = BeautifulSoup(html_doc)
+ soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all(only_short_strings)
- # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
- # u'\n\n', u'...', u'\n']
+ # ['\n\n', '\n\n', 'Elsie', ',\n', 'Lacie', ' and\n', 'Tillie',
+ # '\n\n', '...', '\n']
Customizing multi-valued attributes
-----------------------------------
@@ -2985,22 +2967,22 @@ In an HTML document, an attribute like ``class`` is given a list of
values, and an attribute like ``id`` is given a single value, because
the HTML specification treats those attributes differently::
- markup = '<a class="cls1 cls2" id="id1 id2">'
- soup = BeautifulSoup(markup)
- soup.a['class']
- # ['cls1', 'cls2']
- soup.a['id']
- # 'id1 id2'
+ markup = '<a class="cls1 cls2" id="id1 id2">'
+ soup = BeautifulSoup(markup, 'html.parser')
+ soup.a['class']
+ # ['cls1', 'cls2']
+ soup.a['id']
+ # 'id1 id2'
You can turn this off by passing in
``multi_valued_attributes=None``. Than all attributes will be given a
single value::
- soup = BeautifulSoup(markup, multi_valued_attributes=None)
- soup.a['class']
- # 'cls1 cls2'
- soup.a['id']
- # 'id1 id2'
+ soup = BeautifulSoup(markup, 'html.parser', multi_valued_attributes=None)
+ soup.a['class']
+ # 'cls1 cls2'
+ soup.a['id']
+ # 'id1 id2'
You can customize this behavior quite a bit by passing in a
dictionary for ``multi_valued_attributes``. If you need this, look at
@@ -3018,38 +3000,38 @@ When using the ``html.parser`` parser, you can use the
Beautiful Soup does when it encounters a tag that defines the same
attribute more than once::
- markup = '<a href="http://url1/" href="http://url2/">'
+ markup = '<a href="http://url1/" href="http://url2/">'
The default behavior is to use the last value found for the tag::
- soup = BeautifulSoup(markup, 'html.parser')
- soup.a['href']
- # http://url2/
+ soup = BeautifulSoup(markup, 'html.parser')
+ soup.a['href']
+ # http://url2/
- soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='replace')
- soup.a['href']
- # http://url2/
+ soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='replace')
+ soup.a['href']
+ # http://url2/
With ``on_duplicate_attribute='ignore'`` you can tell Beautiful Soup
to use the `first` value found and ignore the rest::
- soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='ignore')
- soup.a['href']
- # http://url1/
+ soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='ignore')
+ soup.a['href']
+ # http://url1/
(lxml and html5lib always do it this way; their behavior can't be
configured from within Beautiful Soup.)
If you need more, you can pass in a function that's called on each duplicate value::
- def accumulate(attributes_so_far, key, value):
- if not isinstance(attributes_so_far[key], list):
- attributes_so_far[key] = [attributes_so_far[key]]
- attributes_so_far[key].append(value)
+ def accumulate(attributes_so_far, key, value):
+ if not isinstance(attributes_so_far[key], list):
+ attributes_so_far[key] = [attributes_so_far[key]]
+ attributes_so_far[key].append(value)
- soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute=accumulate)
- soup.a['href']
- # ["http://url1/", "http://url2/"]
+ soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute=accumulate)
+ soup.a['href']
+ # ["http://url1/", "http://url2/"]
`(This is a new feature in Beautiful Soup 4.9.1.)`
@@ -3062,26 +3044,28 @@ contain that information. Instead of that default behavior, you can
tell Beautiful Soup to instantiate `subclasses` of ``Tag`` or
``NavigableString``, subclasses you define with custom behavior::
- from bs4 import Tag, NavigableString
- class MyTag(Tag):
- pass
-
- class MyString(NavigableString):
- pass
-
- markup = "<div>some text</div>"
- soup = BeautifulSoup(markup)
- isinstance(soup.div, MyTag)
- # False
- isinstance(soup.div.string, MyString)
- # False
-
- my_classes = { Tag: MyTag, NavigableString: MyString }
- soup = BeautifulSoup(markup, element_classes=my_classes)
- isinstance(soup.div, MyTag)
- # True
- isinstance(soup.div.string, MyString)
- # True
+ from bs4 import Tag, NavigableString
+ class MyTag(Tag):
+ pass
+
+
+ class MyString(NavigableString):
+ pass
+
+
+ markup = "<div>some text</div>"
+ soup = BeautifulSoup(markup, 'html.parser')
+ isinstance(soup.div, MyTag)
+ # False
+ isinstance(soup.div.string, MyString)
+ # False
+
+ my_classes = { Tag: MyTag, NavigableString: MyString }
+ soup = BeautifulSoup(markup, 'html.parser', element_classes=my_classes)
+ isinstance(soup.div, MyTag)
+ # True
+ isinstance(soup.div.string, MyString)
+ # True
This can be useful when incorporating Beautiful Soup into a test
framework.
@@ -3105,6 +3089,7 @@ missing a parser that Beautiful Soup could be using::
from bs4.diagnose import diagnose
with open("bad.html") as fp:
data = fp.read()
+
diagnose(data)
# Diagnostic running on Beautiful Soup 4.2.0
@@ -3154,7 +3139,7 @@ Version mismatch problems
-------------------------
* ``SyntaxError: Invalid syntax`` (on the line ``ROOT_TAG_NAME =
- u'[document]'``): Caused by running the Python 2 version of
+ '[document]'``): Caused by running the Python 2 version of
Beautiful Soup under Python 3, without converting the code.
* ``ImportError: No module named HTMLParser`` - Caused by running the
@@ -3210,7 +3195,7 @@ Miscellaneous
-------------
* ``UnicodeEncodeError: 'charmap' codec can't encode character
- u'\xfoo' in position bar`` (or just about any other
+ '\xfoo' in position bar`` (or just about any other
``UnicodeEncodeError``) - This problem shows up in two main
situations. First, when you try to print a Unicode character that
your console doesn't know how to display. (See `this page on the
@@ -3222,8 +3207,8 @@ Miscellaneous
* ``KeyError: [attr]`` - Caused by accessing ``tag['attr']`` when the
tag in question doesn't define the ``attr`` attribute. The most
- common errors are ``KeyError: 'href'`` and ``KeyError:
- 'class'``. Use ``tag.get('attr')`` if you're not sure ``attr`` is
+ common errors are ``KeyError: 'href'`` and ``KeyError: 'class'``.
+ Use ``tag.get('attr')`` if you're not sure ``attr`` is
defined, just as you would with a Python dictionary.
* ``AttributeError: 'ResultSet' object has no attribute 'foo'`` - This
@@ -3323,11 +3308,11 @@ Most code written against Beautiful Soup 3 will work against Beautiful
Soup 4 with one simple change. All you should have to do is change the
package name from ``BeautifulSoup`` to ``bs4``. So this::
- from BeautifulSoup import BeautifulSoup
+ from BeautifulSoup import BeautifulSoup
becomes this::
- from bs4 import BeautifulSoup
+ from bs4 import BeautifulSoup
* If you get the ``ImportError`` "No module named BeautifulSoup", your
problem is that you're trying to run Beautiful Soup 3 code, but you