diff options
author | Leonard Richardson <leonardr@segfault.org> | 2020-07-29 22:43:48 -0400 |
---|---|---|
committer | Leonard Richardson <leonardr@segfault.org> | 2020-07-29 22:43:48 -0400 |
commit | bd479f6ba3ed9db76d26cf36f12f1e9744f85ce4 (patch) | |
tree | 3eaea193cfff6a82ce28eb30f9db2bd47127b003 /doc | |
parent | 89bbbf3626a783cc15484cedbb4c5a663d95e824 (diff) |
Ran through all of the documentation code examples using Python 3, corrected discrepancies and errors, and updated representations.
Diffstat (limited to 'doc')
-rw-r--r-- | doc/source/index.rst | 931 |
1 files changed, 458 insertions, 473 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst index f655327..76a32e9 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -54,8 +54,7 @@ Quick Start Here's an HTML document I'll be using as an example throughout this document. It's part of a story from `Alice in Wonderland`:: - html_doc = """ - <html><head><title>The Dormouse's story</title></head> + html_doc = """<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> @@ -186,7 +185,7 @@ works on Python 2 and Python 3. Make sure you use the right version of :kbd:`$ pip install beautifulsoup4` -(The ``BeautifulSoup`` package is probably `not` what you want. That's +(The ``BeautifulSoup`` package is `not` what you want. That's the previous major release, `Beautiful Soup 3`_. Lots of software uses BS3, so it's still available, but if you're writing new code you should install ``beautifulsoup4``.) @@ -307,14 +306,14 @@ constructor. You can pass in a string or an open filehandle:: from bs4 import BeautifulSoup with open("index.html") as fp: - soup = BeautifulSoup(fp) + soup = BeautifulSoup(fp, 'html.parser') - soup = BeautifulSoup("<html>a web page</html>") + soup = BeautifulSoup("<html>a web page</html>", 'html.parser') First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:: - print(BeautifulSoup("<html><head></head><body>Sacré bleu!</body></html>")) + print(BeautifulSoup("<html><head></head><body>Sacré bleu!</body></html>", "html.parser")) # <html><head></head><body>Sacré bleu!</body></html> Beautiful Soup then parses the document using the best available @@ -336,7 +335,7 @@ and ``Comment``. A ``Tag`` object corresponds to an XML or HTML tag in the original document:: - soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') + soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser') tag = soup.b type(tag) # <class 'bs4.element.Tag'> @@ -351,7 +350,7 @@ Name Every tag has a name, accessible as ``.name``:: tag.name - # u'b' + # 'b' If you change a tag's name, the change will be reflected in any HTML markup generated by Beautiful Soup:: @@ -368,13 +367,14 @@ id="boldest">`` has an attribute "id" whose value is "boldest". You can access a tag's attributes by treating the tag like a dictionary:: + tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b tag['id'] - # u'boldest' + # 'boldest' You can access that dictionary directly as ``.attrs``:: tag.attrs - # {u'id': 'boldest'} + # {'id': 'boldest'} You can add, remove, and modify a tag's attributes. Again, this is done by treating the tag as a dictionary:: @@ -387,11 +387,11 @@ done by treating the tag as a dictionary:: del tag['id'] del tag['another-attribute'] tag - # <b></b> + # <b>bold</b> tag['id'] # KeyError: 'id' - print(tag.get('id')) + tag.get('id') # None .. _multivalue: @@ -406,26 +406,26 @@ one CSS class). Others include ``rel``, ``rev``, ``accept-charset``, ``headers``, and ``accesskey``. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:: - css_soup = BeautifulSoup('<p class="body"></p>') + css_soup = BeautifulSoup('<p class="body"></p>', 'html.parser') css_soup.p['class'] - # ["body"] + # ['body'] - css_soup = BeautifulSoup('<p class="body strikeout"></p>') + css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser') css_soup.p['class'] - # ["body", "strikeout"] + # ['body', 'strikeout'] If an attribute `looks` like it has more than one value, but it's not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:: - id_soup = BeautifulSoup('<p id="my id"></p>') + id_soup = BeautifulSoup('<p id="my id"></p>', 'html.parser') id_soup.p['id'] # 'my id' When you turn a tag back into a string, multiple attribute values are consolidated:: - rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>') + rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>', 'html.parser') rel_soup.a['rel'] # ['index'] rel_soup.a['rel'] = ['index', 'contents'] @@ -435,34 +435,34 @@ consolidated:: You can disable this by passing ``multi_valued_attributes=None`` as a keyword argument into the ``BeautifulSoup`` constructor:: - no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html', multi_valued_attributes=None) - no_list_soup.p['class'] - # u'body strikeout' + no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser', multi_valued_attributes=None) + no_list_soup.p['class'] + # 'body strikeout' You can use ```get_attribute_list`` to get a value that's always a list, whether or not it's a multi-valued atribute:: - id_soup.p.get_attribute_list('id') - # ["my id"] + id_soup.p.get_attribute_list('id') + # ["my id"] If you parse a document as XML, there are no multi-valued attributes:: xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml') xml_soup.p['class'] - # u'body strikeout' + # 'body strikeout' Again, you can configure this using the ``multi_valued_attributes`` argument:: - class_is_multi= { '*' : 'class'} - xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi) - xml_soup.p['class'] - # [u'body', u'strikeout'] + class_is_multi= { '*' : 'class'} + xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi) + xml_soup.p['class'] + # ['body', 'strikeout'] You probably won't need to do this, but if you do, use the defaults as a guide. They implement the rules described in the HTML specification:: - from bs4.builder import builder_registry - builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES + from bs4.builder import builder_registry + builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES ``NavigableString`` @@ -471,28 +471,31 @@ a guide. They implement the rules described in the HTML specification:: A string corresponds to a bit of text within a tag. Beautiful Soup uses the ``NavigableString`` class to contain these bits of text:: + soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser') + tag = soup.b tag.string - # u'Extremely bold' + # 'Extremely bold' type(tag.string) # <class 'bs4.element.NavigableString'> A ``NavigableString`` is just like a Python Unicode string, except that it also supports some of the features described in `Navigating the tree`_ and `Searching the tree`_. You can convert a -``NavigableString`` to a Unicode string with ``unicode()``:: +``NavigableString`` to a Unicode string with ``unicode()`` (in +Python 2) or ``str`` (in Python 3):: - unicode_string = unicode(tag.string) + unicode_string = str(tag.string) unicode_string - # u'Extremely bold' + # 'Extremely bold' type(unicode_string) - # <type 'unicode'> + # <type 'str'> You can't edit a string in place, but you can replace one string with another, using :ref:`replace_with()`:: tag.string.replace_with("No longer bold") tag - # <blockquote>No longer bold</blockquote> + # <b class="boldest">No longer bold</b> ``NavigableString`` supports most of the features described in `Navigating the tree`_ and `Searching the tree`_, but not all of @@ -518,13 +521,13 @@ You can also pass a ``BeautifulSoup`` object into one of the methods defined in `Modifying the tree`_, just as you would a :ref:`Tag`. This lets you do things like combine two parsed documents:: - doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml") - footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml") - doc.find(text="INSERT FOOTER HERE").replace_with(footer) - # u'INSERT FOOTER HERE' - print(doc) - # <?xml version="1.0" encoding="utf-8"?> - # <document><content/><footer>Here's the footer</footer></document> + doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml") + footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml") + doc.find(text="INSERT FOOTER HERE").replace_with(footer) + # 'INSERT FOOTER HERE' + print(doc) + # <?xml version="1.0" encoding="utf-8"?> + # <document><content/><footer>Here's the footer</footer></document> Since the ``BeautifulSoup`` object doesn't correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it's @@ -532,7 +535,7 @@ useful to look at its ``.name``, so it's been given the special ``.name`` "[document]":: soup.name - # u'[document]' + # '[document]' Comments and other special strings ---------------------------------- @@ -543,7 +546,7 @@ leftover bits. The main one you'll probably encounter is the comment:: markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" - soup = BeautifulSoup(markup) + soup = BeautifulSoup(markup, 'html.parser') comment = soup.b.string type(comment) # <class 'bs4.element.Comment'> @@ -551,7 +554,7 @@ is the comment:: The ``Comment`` object is just a special type of ``NavigableString``:: comment - # u'Hey, buddy. Want to buy a used parser' + # 'Hey, buddy. Want to buy a used parser' But when it appears as part of an HTML document, a ``Comment`` is displayed with special formatting:: @@ -666,13 +669,13 @@ A tag's children are available in a list called ``.contents``:: # <head><title>The Dormouse's story</title></head> head_tag.contents - [<title>The Dormouse's story</title>] + # [<title>The Dormouse's story</title>] title_tag = head_tag.contents[0] title_tag # <title>The Dormouse's story</title> title_tag.contents - # [u'The Dormouse's story'] + # ['The Dormouse's story'] The ``BeautifulSoup`` object itself has children. In this case, the <html> tag is the child of the ``BeautifulSoup`` object.:: @@ -680,7 +683,7 @@ The ``BeautifulSoup`` object itself has children. In this case, the len(soup.contents) # 1 soup.contents[0].name - # u'html' + # 'html' A string does not have ``.contents``, because it can't contain anything:: @@ -725,7 +728,7 @@ descendants:: len(list(soup.children)) # 1 len(list(soup.descendants)) - # 25 + # 26 .. _.string: @@ -736,7 +739,7 @@ If a tag has only one child, and that child is a ``NavigableString``, the child is made available as ``.string``:: title_tag.string - # u'The Dormouse's story' + # 'The Dormouse's story' If a tag's only child is another tag, and `that` tag has a ``.string``, then the parent tag is considered to have the same @@ -746,7 +749,7 @@ If a tag's only child is another tag, and `that` tag has a # [<title>The Dormouse's story</title>] head_tag.string - # u'The Dormouse's story' + # 'The Dormouse's story' If a tag contains more than one thing, then it's not clear what ``.string`` should refer to, so ``.string`` is defined to be @@ -765,36 +768,38 @@ just the strings. Use the ``.strings`` generator:: for string in soup.strings: print(repr(string)) - # u"The Dormouse's story" - # u'\n\n' - # u"The Dormouse's story" - # u'\n\n' - # u'Once upon a time there were three little sisters; and their names were\n' - # u'Elsie' - # u',\n' - # u'Lacie' - # u' and\n' - # u'Tillie' - # u';\nand they lived at the bottom of a well.' - # u'\n\n' - # u'...' - # u'\n' + '\n' + # "The Dormouse's story" + # '\n' + # '\n' + # "The Dormouse's story" + # '\n' + # 'Once upon a time there were three little sisters; and their names were\n' + # 'Elsie' + # ',\n' + # 'Lacie' + # ' and\n' + # 'Tillie' + # ';\nand they lived at the bottom of a well.' + # '\n' + # '...' + # '\n' These strings tend to have a lot of extra whitespace, which you can remove by using the ``.stripped_strings`` generator instead:: for string in soup.stripped_strings: print(repr(string)) - # u"The Dormouse's story" - # u"The Dormouse's story" - # u'Once upon a time there were three little sisters; and their names were' - # u'Elsie' - # u',' - # u'Lacie' - # u'and' - # u'Tillie' - # u';\nand they lived at the bottom of a well.' - # u'...' + # "The Dormouse's story" + # "The Dormouse's story" + # 'Once upon a time there were three little sisters; and their names were' + # 'Elsie' + # ',' + # 'Lacie' + # 'and' + # 'Tillie' + # ';\n and they lived at the bottom of a well.' + # '...' Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed. @@ -851,25 +856,19 @@ buried deep within the document, to the very top of the document:: link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> for parent in link.parents: - if parent is None: - print(parent) - else: - print(parent.name) + print(parent.name) # p # body # html # [document] - # None Going sideways -------------- Consider a simple document like this:: - sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>") + sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>", 'html.parser') print(sibling_soup.prettify()) - # <html> - # <body> # <a> # <b> # text1 @@ -878,8 +877,6 @@ Consider a simple document like this:: # text2 # </c> # </a> - # </body> - # </html> The <b> tag and the <c> tag are at the same level: they're both direct children of the same tag. We call them `siblings`. When a document is @@ -912,7 +909,7 @@ The strings "text1" and "text2" are `not` siblings, because they don't have the same parent:: sibling_soup.b.string - # u'text1' + # 'text1' print(sibling_soup.b.string.next_sibling) # None @@ -921,9 +918,9 @@ In real documents, the ``.next_sibling`` or ``.previous_sibling`` of a tag will usually be a string containing whitespace. Going back to the "three sisters" document:: - <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> - <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> - <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> + # <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> + # <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> + # <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> You might think that the ``.next_sibling`` of the first <a> tag would be the second <a> tag. But actually, it's a string: the comma and @@ -934,7 +931,7 @@ newline that separate the first <a> tag from the second:: # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> link.next_sibling - # u',\n' + # ',\n ' The second <a> tag is actually the ``.next_sibling`` of the comma:: @@ -951,29 +948,27 @@ You can iterate over a tag's siblings with ``.next_siblings`` or for sibling in soup.a.next_siblings: print(repr(sibling)) - # u',\n' + # ',\n' # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> - # u' and\n' + # ' and\n' # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> - # u'; and they lived at the bottom of a well.' - # None + # '; and they lived at the bottom of a well.' for sibling in soup.find(id="link3").previous_siblings: print(repr(sibling)) # ' and\n' # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> - # u',\n' + # ',\n' # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> - # u'Once upon a time there were three little sisters; and their names were\n' - # None + # 'Once upon a time there were three little sisters; and their names were\n' Going back and forth -------------------- Take a look at the beginning of the "three sisters" document:: - <html><head><title>The Dormouse's story</title></head> - <p class="title"><b>The Dormouse's story</b></p> + # <html><head><title>The Dormouse's story</title></head> + # <p class="title"><b>The Dormouse's story</b></p> An HTML parser takes this string of characters and turns it into a series of events: "open an <html> tag", "open a <head> tag", "open a @@ -999,14 +994,14 @@ interrupted by the start of the <a> tag.:: # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> last_a_tag.next_sibling - # '; and they lived at the bottom of a well.' + # ';\nand they lived at the bottom of a well.' But the ``.next_element`` of that <a> tag, the thing that was parsed immediately after the <a> tag, is `not` the rest of that sentence: it's the word "Tillie":: last_a_tag.next_element - # u'Tillie' + # 'Tillie' That's because in the original markup, the word "Tillie" appeared before that semicolon. The parser encountered an <a> tag, then the @@ -1019,7 +1014,7 @@ The ``.previous_element`` attribute is the exact opposite of immediately before this one:: last_a_tag.previous_element - # u' and\n' + # ' and\n' last_a_tag.previous_element.next_element # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> @@ -1031,13 +1026,12 @@ forward or backward in the document as it was parsed:: for element in last_a_tag.next_elements: print(repr(element)) - # u'Tillie' - # u';\nand they lived at the bottom of a well.' - # u'\n\n' + # 'Tillie' + # ';\nand they lived at the bottom of a well.' + # '\n' # <p class="story">...</p> - # u'...' - # u'\n' - # None + # '...' + # '\n' Searching the tree ================== @@ -1188,8 +1182,10 @@ If you pass in a function to filter on a specific attribute like value, not the whole tag. Here's a function that finds all ``a`` tags whose ``href`` attribute *does not* match a regular expression:: + import re def not_lacie(href): return href and not re.compile("lacie").search(href) + soup.find_all(href=not_lacie) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] @@ -1204,7 +1200,8 @@ objects:: and isinstance(tag.previous_element, NavigableString)) for tag in soup.find_all(surrounded_by_strings): - print tag.name + print(tag.name) + # body # p # a # a @@ -1216,7 +1213,7 @@ Now we're ready to look at the search methods in detail. ``find_all()`` -------------- -Signature: find_all(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive +Method signature: find_all(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive <recursive>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) The ``find_all()`` method looks through a tag's descendants and @@ -1239,7 +1236,7 @@ examples in `Kinds of filters`_, but here are a few more:: import re soup.find(string=re.compile("sisters")) - # u'Once upon a time there were three little sisters; and their names were\n' + # 'Once upon a time there were three little sisters; and their names were\n' Some of these should look familiar, but others are new. What does it mean to pass in a value for ``string``, or ``id``? Why does @@ -1297,12 +1294,12 @@ You can filter multiple attributes at once by passing in more than one keyword argument:: soup.find_all(href=re.compile("elsie"), id='link1') - # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>] + # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] Some attributes, like the data-* attributes in HTML 5, have names that can't be used as the names of keyword arguments:: - data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') + data_soup = BeautifulSoup('<div data-foo="value">foo!</div>', 'html.parser') data_soup.find_all(data-foo="value") # SyntaxError: keyword can't be an expression @@ -1318,7 +1315,7 @@ because Beautiful Soup uses the ``name`` argument to contain the name of the tag itself. Instead, you can give a value to 'name' in the ``attrs`` argument:: - name_soup = BeautifulSoup('<input name="email"/>') + name_soup = BeautifulSoup('<input name="email"/>', 'html.parser') name_soup.find_all(name="email") # [] name_soup.find_all(attrs={"name": "email"}) @@ -1359,7 +1356,7 @@ values for its "class" attribute. When you search for a tag that matches a certain CSS class, you're matching against `any` of its CSS classes:: - css_soup = BeautifulSoup('<p class="body strikeout"></p>') + css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser') css_soup.find_all("p", class_="strikeout") # [<p class="body strikeout"></p>] @@ -1403,20 +1400,20 @@ regular expression`_, `a list`_, `a function`_, or `the value True`_. Here are some examples:: soup.find_all(string="Elsie") - # [u'Elsie'] + # ['Elsie'] soup.find_all(string=["Tillie", "Elsie", "Lacie"]) - # [u'Elsie', u'Lacie', u'Tillie'] + # ['Elsie', 'Lacie', 'Tillie'] soup.find_all(string=re.compile("Dormouse")) - [u"The Dormouse's story", u"The Dormouse's story"] + # ["The Dormouse's story", "The Dormouse's story"] def is_the_only_string_within_a_tag(s): """Return True if this string is the only child of its parent tag.""" return (s == s.parent.string) soup.find_all(string=is_the_only_string_within_a_tag) - # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...'] + # ["The Dormouse's story", "The Dormouse's story", 'Elsie', 'Lacie', 'Tillie', '...'] Although ``string`` is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose @@ -1509,7 +1506,7 @@ These two lines are also equivalent:: ``find()`` ---------- -Signature: find(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive +Method signature: find(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive <recursive>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`) The ``find_all()`` method scans the entire document looking for @@ -1546,9 +1543,9 @@ names`_? That trick works by repeatedly calling ``find()``:: ``find_parents()`` and ``find_parent()`` ---------------------------------------- -Signature: find_parents(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) +Method signature: find_parents(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) -Signature: find_parent(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`) +Method signature: find_parent(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`) I spent a lot of time above covering ``find_all()`` and ``find()``. The Beautiful Soup API defines ten other methods for @@ -1564,22 +1561,22 @@ do the opposite: they work their way `up` the tree, looking at a tag's (or a string's) parents. Let's try them out, starting from a string buried deep in the "three daughters" document:: - a_string = soup.find(string="Lacie") - a_string - # u'Lacie' + a_string = soup.find(string="Lacie") + a_string + # 'Lacie' - a_string.find_parents("a") - # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] + a_string.find_parents("a") + # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] - a_string.find_parent("p") - # <p class="story">Once upon a time there were three little sisters; and their names were - # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, - # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and - # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; - # and they lived at the bottom of a well.</p> + a_string.find_parent("p") + # <p class="story">Once upon a time there were three little sisters; and their names were + # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, + # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and + # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; + # and they lived at the bottom of a well.</p> - a_string.find_parents("p", class="title") - # [] + a_string.find_parents("p", class_="title") + # [] One of the three <a> tags is the direct parent of the string in question, so our search finds it. One of the three <p> tags is an @@ -1597,9 +1594,9 @@ each one against the provided filter to see if it matches. ``find_next_siblings()`` and ``find_next_sibling()`` ---------------------------------------------------- -Signature: find_next_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) +Method signature: find_next_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) -Signature: find_next_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`) +Method signature: find_next_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`) These methods use :ref:`.next_siblings <sibling-generators>` to iterate over the rest of an element's siblings in the tree. The @@ -1621,9 +1618,9 @@ and ``find_next_sibling()`` only returns the first one:: ``find_previous_siblings()`` and ``find_previous_sibling()`` ------------------------------------------------------------ -Signature: find_previous_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) +Method signature: find_previous_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) -Signature: find_previous_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`) +Method signature: find_previous_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`) These methods use :ref:`.previous_siblings <sibling-generators>` to iterate over an element's siblings that precede it in the tree. The ``find_previous_siblings()`` @@ -1646,9 +1643,9 @@ method returns all the siblings that match, and ``find_all_next()`` and ``find_next()`` --------------------------------------- -Signature: find_all_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) +Method signature: find_all_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) -Signature: find_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`) +Method signature: find_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`) These methods use :ref:`.next_elements <element-generators>` to iterate over whatever tags and strings that come after it in the @@ -1660,8 +1657,8 @@ document. The ``find_all_next()`` method returns all matches, and # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> first_link.find_all_next(string=True) - # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', - # u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n'] + # ['Elsie', ',\n', 'Lacie', ' and\n', 'Tillie', + # ';\nand they lived at the bottom of a well.', '\n', '...', '\n'] first_link.find_next("p") # <p class="story">...</p> @@ -1676,9 +1673,9 @@ show up later in the document than the starting element. ``find_all_previous()`` and ``find_previous()`` ----------------------------------------------- -Signature: find_all_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) +Method signature: find_all_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) -Signature: find_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`) +Method signature: find_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, :ref:`**kwargs <kwargs>`) These methods use :ref:`.previous_elements <element-generators>` to iterate over the tags and strings that came before it in the @@ -1837,9 +1834,9 @@ selectors.:: soup.select("child") # [<ns1:child>I'm in namespace 1</ns1:child>, <ns2:child>I'm in namespace 2</ns2:child>] - soup.select("ns1|child", namespaces=namespaces) + soup.select("ns1|child", namespaces=soup.namespaces) # [<ns1:child>I'm in namespace 1</ns1:child>] - + When handling a CSS selector that uses namespaces, Beautiful Soup uses the namespace abbreviations it found when parsing the document. You can override this by passing in your own dictionary of @@ -1869,7 +1866,7 @@ I covered this earlier, in `Attributes`_, but it bears repeating. You can rename a tag, change the values of its attributes, add new attributes, and delete attributes:: - soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') + soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser') tag = soup.b tag.name = "blockquote" @@ -1889,13 +1886,13 @@ Modifying ``.string`` If you set a tag's ``.string`` attribute to a new string, the tag's contents are replaced with that string:: - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) + markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' + soup = BeautifulSoup(markup, 'html.parser') - tag = soup.a - tag.string = "New link text." - tag - # <a href="http://example.com/">New link text.</a> + tag = soup.a + tag.string = "New link text." + tag + # <a href="http://example.com/">New link text.</a> Be careful: if the tag contained other tags, they and all their contents will be destroyed. @@ -1906,13 +1903,13 @@ contents will be destroyed. You can add to a tag's contents with ``Tag.append()``. It works just like calling ``.append()`` on a Python list:: - soup = BeautifulSoup("<a>Foo</a>") - soup.a.append("Bar") + soup = BeautifulSoup("<a>Foo</a>", 'html.parser') + soup.a.append("Bar") - soup - # <html><head></head><body><a>FooBar</a></body></html> - soup.a.contents - # [u'Foo', u'Bar'] + soup + # <a>FooBar</a> + soup.a.contents + # ['Foo', 'Bar'] ``extend()`` ------------ @@ -1921,13 +1918,13 @@ Starting in Beautiful Soup 4.7.0, ``Tag`` also supports a method called ``.extend()``, which works just like calling ``.extend()`` on a Python list:: - soup = BeautifulSoup("<a>Soup</a>") - soup.a.extend(["'s", " ", "on"]) + soup = BeautifulSoup("<a>Soup</a>", 'html.parser') + soup.a.extend(["'s", " ", "on"]) - soup - # <html><head></head><body><a>Soup's on</a></body></html> - soup.a.contents - # [u'Soup', u''s', u' ', u'on'] + soup + # <a>Soup's on</a> + soup.a.contents + # ['Soup', ''s', ' ', 'on'] ``NavigableString()`` and ``.new_tag()`` ------------------------------------------------- @@ -1936,43 +1933,43 @@ If you need to add a string to a document, no problem--you can pass a Python string in to ``append()``, or you can call the ``NavigableString`` constructor:: - soup = BeautifulSoup("<b></b>") - tag = soup.b - tag.append("Hello") - new_string = NavigableString(" there") - tag.append(new_string) - tag - # <b>Hello there.</b> - tag.contents - # [u'Hello', u' there'] + soup = BeautifulSoup("<b></b>", 'html.parser') + tag = soup.b + tag.append("Hello") + new_string = NavigableString(" there") + tag.append(new_string) + tag + # <b>Hello there.</b> + tag.contents + # ['Hello', ' there'] If you want to create a comment or some other subclass of ``NavigableString``, just call the constructor:: - from bs4 import Comment - new_comment = Comment("Nice to see you.") - tag.append(new_comment) - tag - # <b>Hello there<!--Nice to see you.--></b> - tag.contents - # [u'Hello', u' there', u'Nice to see you.'] + from bs4 import Comment + new_comment = Comment("Nice to see you.") + tag.append(new_comment) + tag + # <b>Hello there<!--Nice to see you.--></b> + tag.contents + # ['Hello', ' there', 'Nice to see you.'] `(This is a new feature in Beautiful Soup 4.4.0.)` What if you need to create a whole new tag? The best solution is to call the factory method ``BeautifulSoup.new_tag()``:: - soup = BeautifulSoup("<b></b>") - original_tag = soup.b + soup = BeautifulSoup("<b></b>", 'html.parser') + original_tag = soup.b - new_tag = soup.new_tag("a", href="http://www.example.com") - original_tag.append(new_tag) - original_tag - # <b><a href="http://www.example.com"></a></b> + new_tag = soup.new_tag("a", href="http://www.example.com") + original_tag.append(new_tag) + original_tag + # <b><a href="http://www.example.com"></a></b> - new_tag.string = "Link text." - original_tag - # <b><a href="http://www.example.com">Link text.</a></b> + new_tag.string = "Link text." + original_tag + # <b><a href="http://www.example.com">Link text.</a></b> Only the first argument, the tag name, is required. @@ -1984,15 +1981,15 @@ doesn't necessarily go at the end of its parent's ``.contents``. It'll be inserted at whatever numeric position you say. It works just like ``.insert()`` on a Python list:: - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - tag = soup.a + markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' + soup = BeautifulSoup(markup, 'html.parser') + tag = soup.a - tag.insert(1, "but did not endorse ") - tag - # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a> - tag.contents - # [u'I linked to ', u'but did not endorse', <i>example.com</i>] + tag.insert(1, "but did not endorse ") + tag + # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a> + tag.contents + # ['I linked to ', 'but did not endorse', <i>example.com</i>] ``insert_before()`` and ``insert_after()`` ------------------------------------------ @@ -2000,36 +1997,36 @@ say. It works just like ``.insert()`` on a Python list:: The ``insert_before()`` method inserts tags or strings immediately before something else in the parse tree:: - soup = BeautifulSoup("<b>stop</b>") - tag = soup.new_tag("i") - tag.string = "Don't" - soup.b.string.insert_before(tag) - soup.b - # <b><i>Don't</i>stop</b> + soup = BeautifulSoup("<b>leave</b>", 'html.parser') + tag = soup.new_tag("i") + tag.string = "Don't" + soup.b.string.insert_before(tag) + soup.b + # <b><i>Don't</i>leave</b> The ``insert_after()`` method inserts tags or strings immediately following something else in the parse tree:: - div = soup.new_tag('div') - div.string = 'ever' - soup.b.i.insert_after(" you ", div) - soup.b - # <b><i>Don't</i> you <div>ever</div> stop</b> - soup.b.contents - # [<i>Don't</i>, u' you', <div>ever</div>, u'stop'] + div = soup.new_tag('div') + div.string = 'ever' + soup.b.i.insert_after(" you ", div) + soup.b + # <b><i>Don't</i> you <div>ever</div> leave</b> + soup.b.contents + # [<i>Don't</i>, ' you', <div>ever</div>, 'leave'] ``clear()`` ----------- ``Tag.clear()`` removes the contents of a tag:: - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - tag = soup.a + markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' + soup = BeautifulSoup(markup, 'html.parser') + tag = soup.a - tag.clear() - tag - # <a href="http://example.com/"></a> + tag.clear() + tag + # <a href="http://example.com/"></a> ``extract()`` ------------- @@ -2037,34 +2034,34 @@ following something else in the parse tree:: ``PageElement.extract()`` removes a tag or string from the tree. It returns the tag or string that was extracted:: - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - a_tag = soup.a + markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' + soup = BeautifulSoup(markup, 'html.parser') + a_tag = soup.a - i_tag = soup.i.extract() + i_tag = soup.i.extract() - a_tag - # <a href="http://example.com/">I linked to</a> + a_tag + # <a href="http://example.com/">I linked to</a> - i_tag - # <i>example.com</i> + i_tag + # <i>example.com</i> - print(i_tag.parent) - None + print(i_tag.parent) + # None At this point you effectively have two parse trees: one rooted at the ``BeautifulSoup`` object you used to parse the document, and one rooted at the tag that was extracted. You can go on to call ``extract`` on a child of the element you extracted:: - my_string = i_tag.string.extract() - my_string - # u'example.com' + my_string = i_tag.string.extract() + my_string + # 'example.com' - print(my_string.parent) - # None - i_tag - # <i></i> + print(my_string.parent) + # None + i_tag + # <i></i> ``decompose()`` @@ -2073,25 +2070,25 @@ a child of the element you extracted:: ``Tag.decompose()`` removes a tag from the tree, then `completely destroys it and its contents`:: - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - a_tag = soup.a - i_tag = soup.i + markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' + soup = BeautifulSoup(markup, 'html.parser') + a_tag = soup.a + i_tag = soup.i - i_tag.decompose() - a_tag - # <a href="http://example.com/">I linked to</a> + i_tag.decompose() + a_tag + # <a href="http://example.com/">I linked to</a> The behavior of a decomposed ``Tag`` or ``NavigableString`` is not defined and you should not use it for anything. If you're not sure whether something has been decomposed, you can check its ``.decomposed`` property `(new in Beautiful Soup 4.9.0)`:: - i_tag.decomposed - # True + i_tag.decomposed + # True - a_tag.decomposed - # False + a_tag.decomposed + # False .. _replace_with(): @@ -2102,16 +2099,16 @@ whether something has been decomposed, you can check its ``PageElement.replace_with()`` removes a tag or string from the tree, and replaces it with the tag or string of your choice:: - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - a_tag = soup.a + markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' + soup = BeautifulSoup(markup, 'html.parser') + a_tag = soup.a - new_tag = soup.new_tag("b") - new_tag.string = "example.net" - a_tag.i.replace_with(new_tag) + new_tag = soup.new_tag("b") + new_tag.string = "example.net" + a_tag.i.replace_with(new_tag) - a_tag - # <a href="http://example.com/">I linked to <b>example.net</b></a> + a_tag + # <a href="http://example.com/">I linked to <b>example.net</b></a> ``replace_with()`` returns the tag or string that was replaced, so that you can examine it or add it back to another part of the tree. @@ -2122,11 +2119,11 @@ that you can examine it or add it back to another part of the tree. ``PageElement.wrap()`` wraps an element in the tag you specify. It returns the new wrapper:: - soup = BeautifulSoup("<p>I wish I was bold.</p>") + soup = BeautifulSoup("<p>I wish I was bold.</p>", 'html.parser') soup.p.string.wrap(soup.new_tag("b")) # <b>I wish I was bold.</b> - soup.p.wrap(soup.new_tag("div") + soup.p.wrap(soup.new_tag("div")) # <div><p><b>I wish I was bold.</b></p></div> This method is new in Beautiful Soup 4.0.5. @@ -2137,13 +2134,13 @@ This method is new in Beautiful Soup 4.0.5. ``Tag.unwrap()`` is the opposite of ``wrap()``. It replaces a tag with whatever's inside that tag. It's good for stripping out markup:: - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - a_tag = soup.a + markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' + soup = BeautifulSoup(markup, 'html.parser') + a_tag = soup.a - a_tag.i.unwrap() - a_tag - # <a href="http://example.com/">I linked to example.com</a> + a_tag.i.unwrap() + a_tag + # <a href="http://example.com/">I linked to example.com</a> Like ``replace_with()``, ``unwrap()`` returns the tag that was replaced. @@ -2153,27 +2150,27 @@ that was replaced. After calling a bunch of methods that modify the parse tree, you may end up with two or more ``NavigableString`` objects next to each other. Beautiful Soup doesn't have any problems with this, but since it can't happen in a freshly parsed document, you might not expect behavior like the following:: - soup = BeautifulSoup("<p>A one</p>") - soup.p.append(", a two") + soup = BeautifulSoup("<p>A one</p>", 'html.parser') + soup.p.append(", a two") - soup.p.contents - # [u'A one', u', a two'] + soup.p.contents + # ['A one', ', a two'] - print(soup.p.encode()) - # <p>A one, a two</p> + print(soup.p.encode()) + # b'<p>A one, a two</p>' - print(soup.p.prettify()) - # <p> - # A one - # , a two - # </p> + print(soup.p.prettify()) + # <p> + # A one + # , a two + # </p> You can call ``Tag.smooth()`` to clean up the parse tree by consolidating adjacent strings:: soup.smooth() soup.p.contents - # [u'A one, a two'] + # ['A one, a two'] print(soup.p.prettify()) # <p> @@ -2194,35 +2191,35 @@ The ``prettify()`` method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string:: - markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' - soup = BeautifulSoup(markup) - soup.prettify() - # '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...' - - print(soup.prettify()) - # <html> - # <head> - # </head> - # <body> - # <a href="http://example.com/"> - # I linked to - # <i> - # example.com - # </i> - # </a> - # </body> - # </html> + markup = '<html><head><body><a href="http://example.com/">I linked to <i>example.com</i></a>' + soup = BeautifulSoup(markup, 'html.parser') + soup.prettify() + # '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...' + + print(soup.prettify()) + # <html> + # <head> + # </head> + # <body> + # <a href="http://example.com/"> + # I linked to + # <i> + # example.com + # </i> + # </a> + # </body> + # </html> You can call ``prettify()`` on the top-level ``BeautifulSoup`` object, or on any of its ``Tag`` objects:: - print(soup.a.prettify()) - # <a href="http://example.com/"> - # I linked to - # <i> - # example.com - # </i> - # </a> + print(soup.a.prettify()) + # <a href="http://example.com/"> + # I linked to + # <i> + # example.com + # </i> + # </a> Since it adds whitespace (in the form of newlines), ``prettify()`` changes the meaning of an HTML document and should not be used to @@ -2233,14 +2230,14 @@ Non-pretty printing ------------------- If you just want a string, with no fancy formatting, you can call -``unicode()`` or ``str()`` on a ``BeautifulSoup`` object, or a ``Tag`` -within it:: +``str()`` on a ``BeautifulSoup`` object (``unicode()`` in Python 2), +or on a ``Tag`` within it:: str(soup) # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>' - unicode(soup.a) - # u'<a href="http://example.com/">I linked to <i>example.com</i></a>' + str(soup.a) + # '<a href="http://example.com/">I linked to <i>example.com</i></a>' The ``str()`` function returns a string encoded in UTF-8. See `Encodings`_ for other options. @@ -2256,26 +2253,26 @@ Output formatters If you give Beautiful Soup a document that contains HTML entities like "&lquot;", they'll be converted to Unicode characters:: - soup = BeautifulSoup("“Dammit!” he said.") - unicode(soup) - # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>' + soup = BeautifulSoup("“Dammit!” he said.", 'html.parser') + str(soup) + # '“Dammit!” he said.' -If you then convert the document to a string, the Unicode characters +If you then convert the document to a bytestring, the Unicode characters will be encoded as UTF-8. You won't get the HTML entities back:: - str(soup) - # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>' + soup.encode("utf8") + # b'\xe2\x80\x9cDammit!\xe2\x80\x9d he said.' By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into "&", "<", and ">", so that Beautiful Soup doesn't inadvertently generate invalid HTML or XML:: - soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>") + soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>", 'html.parser') soup.p # <p>The law firm of Dewey, Cheatem, & Howe</p> - soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>') + soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>', 'html.parser') soup.a # <a href="http://example.com/?foo=val1&bar=val2">A link</a> @@ -2288,56 +2285,44 @@ The default is ``formatter="minimal"``. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML:: french = "<p>Il a dit <<Sacré bleu!>></p>" - soup = BeautifulSoup(french) + soup = BeautifulSoup(french, 'html.parser') print(soup.prettify(formatter="minimal")) - # <html> - # <body> - # <p> - # Il a dit <<Sacré bleu!>> - # </p> - # </body> - # </html> + # <p> + # Il a dit <<Sacré bleu!>> + # </p> If you pass in ``formatter="html"``, Beautiful Soup will convert Unicode characters to HTML entities whenever possible:: print(soup.prettify(formatter="html")) - # <html> - # <body> - # <p> - # Il a dit <<Sacré bleu!>> - # </p> - # </body> - # </html> + # <p> + # Il a dit <<Sacré bleu!>> + # </p> If you pass in ``formatter="html5"``, it's the same as ``formatter="html"``, but Beautiful Soup will omit the closing slash in HTML void tags like "br":: - soup = BeautifulSoup("<br>") + br = BeautifulSoup("<br>", 'html.parser').br - print(soup.encode(formatter="html")) - # <html><body><br/></body></html> + print(br.encode(formatter="html")) + # b'<br/>' - print(soup.encode(formatter="html5")) - # <html><body><br></body></html> + print(br.encode(formatter="html5")) + # b'<br>' If you pass in ``formatter=None``, Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in these examples:: print(soup.prettify(formatter=None)) - # <html> - # <body> - # <p> - # Il a dit <<Sacré bleu!>> - # </p> - # </body> - # </html> + # <p> + # Il a dit <<Sacré bleu!>> + # </p> - link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>') + link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>', 'html.parser') print(link_soup.a.encode(formatter=None)) - # <a href="http://example.com/?foo=val1&bar=val2">A link</a> + # b'<a href="http://example.com/?foo=val1&bar=val2">A link</a>' If you need more sophisticated control over your output, you can use Beautiful Soup's ``Formatter`` class. Here's a formatter that @@ -2347,16 +2332,13 @@ attribute value:: from bs4.formatter import HTMLFormatter def uppercase(str): return str.upper() + formatter = HTMLFormatter(uppercase) print(soup.prettify(formatter=formatter)) - # <html> - # <body> - # <p> - # IL A DIT <<SACRÉ BLEU!>> - # </p> - # </body> - # </html> + # <p> + # IL A DIT <<SACRÉ BLEU!>> + # </p> print(link_soup.a.prettify(formatter=formatter)) # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2"> @@ -2367,7 +2349,7 @@ Subclassing ``HTMLFormatter`` or ``XMLFormatter`` will give you even more control over the output. For example, Beautiful Soup sorts the attributes in every tag by default:: - attr_soup = BeautifulSoup(b'<p z="1" m="2" a="3"></p>') + attr_soup = BeautifulSoup(b'<p z="1" m="2" a="3"></p>', 'html.parser') print(attr_soup.p.encode()) # <p a="3" m="2" z="1"></p> @@ -2380,8 +2362,9 @@ whenever it appears:: def attributes(self, tag): for k, v in tag.attrs.items(): if k == 'm': - continue + continue yield k, v + print(attr_soup.p.encode(formatter=UnsortedAttributes())) # <p z="1" a="3"></p> @@ -2393,9 +2376,9 @@ all the strings in the document or something, but it will ignore the return value:: from bs4.element import CData - soup = BeautifulSoup("<a></a>") + soup = BeautifulSoup("<a></a>", 'html.parser') soup.a.string = CData("one < three") - print(soup.a.prettify(formatter="xml")) + print(soup.a.prettify(formatter="html")) # <a> # <![CDATA[one < three]]> # </a> @@ -2408,31 +2391,31 @@ If you only want the human-readable text inside a document or tag, you can use t ``get_text()`` method. It returns all the text in a document or beneath a tag, as a single Unicode string:: - markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>' - soup = BeautifulSoup(markup) + markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>' + soup = BeautifulSoup(markup, 'html.parser') - soup.get_text() - u'\nI linked to example.com\n' - soup.i.get_text() - u'example.com' + soup.get_text() + '\nI linked to example.com\n' + soup.i.get_text() + 'example.com' You can specify a string to be used to join the bits of text together:: # soup.get_text("|") - u'\nI linked to |example.com|\n' + '\nI linked to |example.com|\n' You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:: # soup.get_text("|", strip=True) - u'I linked to|example.com' + 'I linked to|example.com' But at that point you might want to use the :ref:`.stripped_strings <string-generators>` generator instead, and process the text yourself:: [text for text in soup.stripped_strings] - # [u'I linked to', u'example.com'] + # ['I linked to', 'example.com'] *As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, <style>, and <template> @@ -2549,11 +2532,11 @@ or UTF-8. But when you load that document into Beautiful Soup, you'll discover it's been converted to Unicode:: markup = "<h1>Sacr\xc3\xa9 bleu!</h1>" - soup = BeautifulSoup(markup) + soup = BeautifulSoup(markup, 'html.parser') soup.h1 # <h1>Sacré bleu!</h1> soup.h1.string - # u'Sacr\xe9 bleu!' + # 'Sacr\xe9 bleu!' It's not magic. (That sure would be nice.) Beautiful Soup uses a sub-library called `Unicode, Dammit`_ to detect a document's encoding @@ -2575,29 +2558,29 @@ Unicode, Dammit can't get a lock on it, and misidentifies it as ISO-8859-7:: markup = b"<h1>\xed\xe5\xec\xf9</h1>" - soup = BeautifulSoup(markup) - soup.h1 - <h1>νεμω</h1> - soup.original_encoding - 'ISO-8859-7' + soup = BeautifulSoup(markup, 'html.parser') + print(soup.h1) + # <h1>νεμω</h1> + print(soup.original_encoding) + # iso-8859-7 We can fix this by passing in the correct ``from_encoding``:: - soup = BeautifulSoup(markup, from_encoding="iso-8859-8") - soup.h1 - <h1>םולש</h1> - soup.original_encoding - 'iso8859-8' + soup = BeautifulSoup(markup, 'html.parser', from_encoding="iso-8859-8") + print(soup.h1) + # <h1>םולש</h1> + print(soup.original_encoding) + # iso8859-8 If you don't know what the correct encoding is, but you know that Unicode, Dammit is guessing wrong, you can pass the wrong guesses in as ``exclude_encodings``:: - soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"]) - soup.h1 - <h1>םולש</h1> - soup.original_encoding - 'WINDOWS-1255' + soup = BeautifulSoup(markup, 'html.parser', exclude_encodings=["iso-8859-7"]) + print(soup.h1) + # <h1>םולש</h1> + print(soup.original_encoding) + # WINDOWS-1255 Windows-1255 isn't 100% correct, but that encoding is a compatible superset of ISO-8859-8, so it's close enough. (``exclude_encodings`` @@ -2633,7 +2616,7 @@ document written in the Latin-1 encoding:: </html> ''' - soup = BeautifulSoup(markup) + soup = BeautifulSoup(markup, 'html.parser') print(soup.prettify()) # <html> # <head> @@ -2661,17 +2644,17 @@ You can also call encode() on the ``BeautifulSoup`` object, or any element in the soup, just as if it were a Python string:: soup.p.encode("latin-1") - # '<p>Sacr\xe9 bleu!</p>' + # b'<p>Sacr\xe9 bleu!</p>' soup.p.encode("utf-8") - # '<p>Sacr\xc3\xa9 bleu!</p>' + # b'<p>Sacr\xc3\xa9 bleu!</p>' Any characters that can't be represented in your chosen encoding will be converted into numeric XML entity references. Here's a document that includes the Unicode character SNOWMAN:: markup = u"<b>\N{SNOWMAN}</b>" - snowman_soup = BeautifulSoup(markup) + snowman_soup = BeautifulSoup(markup, 'html.parser') tag = snowman_soup.b The SNOWMAN character can be part of a UTF-8 document (it looks like @@ -2679,13 +2662,13 @@ The SNOWMAN character can be part of a UTF-8 document (it looks like ASCII, so it's converted into "☃" for those encodings:: print(tag.encode("utf-8")) - # <b>☃</b> + # b'<b>\xe2\x98\x83</b>' - print tag.encode("latin-1") - # <b>☃</b> + print(tag.encode("latin-1")) + # b'<b>☃</b>' - print tag.encode("ascii") - # <b>☃</b> + print(tag.encode("ascii")) + # b'<b>☃</b>' Unicode, Dammit --------------- @@ -2725,15 +2708,15 @@ entities:: markup = b"<p>I just \x93love\x94 Microsoft Word\x92s smart quotes</p>" UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup - # u'<p>I just “love” Microsoft Word’s smart quotes</p>' + # '<p>I just “love” Microsoft Word’s smart quotes</p>' UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup - # u'<p>I just “love” Microsoft Word’s smart quotes</p>' + # '<p>I just “love” Microsoft Word’s smart quotes</p>' You can also convert Microsoft smart quotes to ASCII quotes:: UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup - # u'<p>I just "love" Microsoft Word\'s smart quotes</p>' + # '<p>I just "love" Microsoft Word\'s smart quotes</p>' Hopefully you'll find this feature useful, but Beautiful Soup doesn't use it. Beautiful Soup prefers the default behavior, which is to @@ -2741,7 +2724,7 @@ convert Microsoft smart quotes to Unicode characters along with everything else:: UnicodeDammit(markup, ["windows-1252"]).unicode_markup - # u'<p>I just \u201clove\u201d Microsoft Word\u2019s smart quotes</p>' + # '<p>I just “love” Microsoft Word’s smart quotes</p>' Inconsistent encodings ^^^^^^^^^^^^^^^^^^^^^^ @@ -2798,31 +2781,31 @@ the original document each Tag was found. You can access this information as ``Tag.sourceline`` (line number) and ``Tag.sourcepos`` (position of the start tag within a line):: - markup = "<p\n>Paragraph 1</p>\n <p>Paragraph 2</p>" - soup = BeautifulSoup(markup, 'html.parser') - for tag in soup.find_all('p'): - print(tag.sourceline, tag.sourcepos, tag.string) - # (1, 0, u'Paragraph 1') - # (2, 3, u'Paragraph 2') + markup = "<p\n>Paragraph 1</p>\n <p>Paragraph 2</p>" + soup = BeautifulSoup(markup, 'html.parser') + for tag in soup.find_all('p'): + print(repr((tag.sourceline, tag.sourcepos, tag.string))) + # (1, 0, 'Paragraph 1') + # (3, 4, 'Paragraph 2') Note that the two parsers mean slightly different things by ``sourceline`` and ``sourcepos``. For html.parser, these numbers represent the position of the initial less-than sign. For html5lib, these numbers represent the position of the final greater-than sign:: - soup = BeautifulSoup(markup, 'html5lib') - for tag in soup.find_all('p'): - print(tag.sourceline, tag.sourcepos, tag.string) - # (2, 1, u'Paragraph 1') - # (3, 7, u'Paragraph 2') + soup = BeautifulSoup(markup, 'html5lib') + for tag in soup.find_all('p'): + print(repr((tag.sourceline, tag.sourcepos, tag.string))) + # (2, 0, 'Paragraph 1') + # (3, 6, 'Paragraph 2') You can shut off this feature by passing ``store_line_numbers=False` into the ``BeautifulSoup`` constructor:: - markup = "<p\n>Paragraph 1</p>\n <p>Paragraph 2</p>" - soup = BeautifulSoup(markup, 'html.parser', store_line_numbers=False) - soup.p.sourceline - # None + markup = "<p\n>Paragraph 1</p>\n <p>Paragraph 2</p>" + soup = BeautifulSoup(markup, 'html.parser', store_line_numbers=False) + print(soup.p.sourceline) + # None `This feature is new in 4.8.1, and the parsers based on lxml don't support it.` @@ -2839,16 +2822,16 @@ in different parts of the object tree, because they both look like markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>" soup = BeautifulSoup(markup, 'html.parser') first_b, second_b = soup.find_all('b') - print first_b == second_b + print(first_b == second_b) # True - print first_b.previous_element == second_b.previous_element + print(first_b.previous_element == second_b.previous_element) # False If you want to see whether two variables refer to exactly the same object, use `is`:: - print first_b is second_b + print(first_b is second_b) # False Copying Beautiful Soup objects @@ -2859,23 +2842,23 @@ You can use ``copy.copy()`` to create a copy of any ``Tag`` or import copy p_copy = copy.copy(soup.p) - print p_copy + print(p_copy) # <p>I want <b>pizza</b> and more <b>pizza</b>!</p> The copy is considered equal to the original, since it represents the same markup as the original, but it's not the same object:: - print soup.p == p_copy + print(soup.p == p_copy) # True - print soup.p is p_copy + print(soup.p is p_copy) # False The only real difference is that the copy is completely detached from the original Beautiful Soup object tree, just as if ``extract()`` had been called on it:: - print p_copy.parent + print(p_copy.parent) # None This is because two different ``Tag`` objects can't occupy the same @@ -2922,7 +2905,7 @@ three ``SoupStrainer`` objects:: only_tags_with_id_link2 = SoupStrainer(id="link2") def is_short_string(string): - return len(string) < 10 + return string is not None and len(string) < 10 only_short_strings = SoupStrainer(string=is_short_string) @@ -2930,8 +2913,7 @@ I'm going to bring back the "three sisters" document one more time, and we'll see what the document looks like when it's parsed with these three ``SoupStrainer`` objects:: - html_doc = """ - <html><head><title>The Dormouse's story</title></head> + html_doc = """<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> @@ -2973,10 +2955,10 @@ You can also pass a ``SoupStrainer`` into any of the methods covered in `Searching the tree`_. This probably isn't terribly useful, but I thought I'd mention it:: - soup = BeautifulSoup(html_doc) + soup = BeautifulSoup(html_doc, 'html.parser') soup.find_all(only_short_strings) - # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', - # u'\n\n', u'...', u'\n'] + # ['\n\n', '\n\n', 'Elsie', ',\n', 'Lacie', ' and\n', 'Tillie', + # '\n\n', '...', '\n'] Customizing multi-valued attributes ----------------------------------- @@ -2985,22 +2967,22 @@ In an HTML document, an attribute like ``class`` is given a list of values, and an attribute like ``id`` is given a single value, because the HTML specification treats those attributes differently:: - markup = '<a class="cls1 cls2" id="id1 id2">' - soup = BeautifulSoup(markup) - soup.a['class'] - # ['cls1', 'cls2'] - soup.a['id'] - # 'id1 id2' + markup = '<a class="cls1 cls2" id="id1 id2">' + soup = BeautifulSoup(markup, 'html.parser') + soup.a['class'] + # ['cls1', 'cls2'] + soup.a['id'] + # 'id1 id2' You can turn this off by passing in ``multi_valued_attributes=None``. Than all attributes will be given a single value:: - soup = BeautifulSoup(markup, multi_valued_attributes=None) - soup.a['class'] - # 'cls1 cls2' - soup.a['id'] - # 'id1 id2' + soup = BeautifulSoup(markup, 'html.parser', multi_valued_attributes=None) + soup.a['class'] + # 'cls1 cls2' + soup.a['id'] + # 'id1 id2' You can customize this behavior quite a bit by passing in a dictionary for ``multi_valued_attributes``. If you need this, look at @@ -3018,38 +3000,38 @@ When using the ``html.parser`` parser, you can use the Beautiful Soup does when it encounters a tag that defines the same attribute more than once:: - markup = '<a href="http://url1/" href="http://url2/">' + markup = '<a href="http://url1/" href="http://url2/">' The default behavior is to use the last value found for the tag:: - soup = BeautifulSoup(markup, 'html.parser') - soup.a['href'] - # http://url2/ + soup = BeautifulSoup(markup, 'html.parser') + soup.a['href'] + # http://url2/ - soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='replace') - soup.a['href'] - # http://url2/ + soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='replace') + soup.a['href'] + # http://url2/ With ``on_duplicate_attribute='ignore'`` you can tell Beautiful Soup to use the `first` value found and ignore the rest:: - soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='ignore') - soup.a['href'] - # http://url1/ + soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='ignore') + soup.a['href'] + # http://url1/ (lxml and html5lib always do it this way; their behavior can't be configured from within Beautiful Soup.) If you need more, you can pass in a function that's called on each duplicate value:: - def accumulate(attributes_so_far, key, value): - if not isinstance(attributes_so_far[key], list): - attributes_so_far[key] = [attributes_so_far[key]] - attributes_so_far[key].append(value) + def accumulate(attributes_so_far, key, value): + if not isinstance(attributes_so_far[key], list): + attributes_so_far[key] = [attributes_so_far[key]] + attributes_so_far[key].append(value) - soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute=accumulate) - soup.a['href'] - # ["http://url1/", "http://url2/"] + soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute=accumulate) + soup.a['href'] + # ["http://url1/", "http://url2/"] `(This is a new feature in Beautiful Soup 4.9.1.)` @@ -3062,26 +3044,28 @@ contain that information. Instead of that default behavior, you can tell Beautiful Soup to instantiate `subclasses` of ``Tag`` or ``NavigableString``, subclasses you define with custom behavior:: - from bs4 import Tag, NavigableString - class MyTag(Tag): - pass - - class MyString(NavigableString): - pass - - markup = "<div>some text</div>" - soup = BeautifulSoup(markup) - isinstance(soup.div, MyTag) - # False - isinstance(soup.div.string, MyString) - # False - - my_classes = { Tag: MyTag, NavigableString: MyString } - soup = BeautifulSoup(markup, element_classes=my_classes) - isinstance(soup.div, MyTag) - # True - isinstance(soup.div.string, MyString) - # True + from bs4 import Tag, NavigableString + class MyTag(Tag): + pass + + + class MyString(NavigableString): + pass + + + markup = "<div>some text</div>" + soup = BeautifulSoup(markup, 'html.parser') + isinstance(soup.div, MyTag) + # False + isinstance(soup.div.string, MyString) + # False + + my_classes = { Tag: MyTag, NavigableString: MyString } + soup = BeautifulSoup(markup, 'html.parser', element_classes=my_classes) + isinstance(soup.div, MyTag) + # True + isinstance(soup.div.string, MyString) + # True This can be useful when incorporating Beautiful Soup into a test framework. @@ -3105,6 +3089,7 @@ missing a parser that Beautiful Soup could be using:: from bs4.diagnose import diagnose with open("bad.html") as fp: data = fp.read() + diagnose(data) # Diagnostic running on Beautiful Soup 4.2.0 @@ -3154,7 +3139,7 @@ Version mismatch problems ------------------------- * ``SyntaxError: Invalid syntax`` (on the line ``ROOT_TAG_NAME = - u'[document]'``): Caused by running the Python 2 version of + '[document]'``): Caused by running the Python 2 version of Beautiful Soup under Python 3, without converting the code. * ``ImportError: No module named HTMLParser`` - Caused by running the @@ -3210,7 +3195,7 @@ Miscellaneous ------------- * ``UnicodeEncodeError: 'charmap' codec can't encode character - u'\xfoo' in position bar`` (or just about any other + '\xfoo' in position bar`` (or just about any other ``UnicodeEncodeError``) - This problem shows up in two main situations. First, when you try to print a Unicode character that your console doesn't know how to display. (See `this page on the @@ -3222,8 +3207,8 @@ Miscellaneous * ``KeyError: [attr]`` - Caused by accessing ``tag['attr']`` when the tag in question doesn't define the ``attr`` attribute. The most - common errors are ``KeyError: 'href'`` and ``KeyError: - 'class'``. Use ``tag.get('attr')`` if you're not sure ``attr`` is + common errors are ``KeyError: 'href'`` and ``KeyError: 'class'``. + Use ``tag.get('attr')`` if you're not sure ``attr`` is defined, just as you would with a Python dictionary. * ``AttributeError: 'ResultSet' object has no attribute 'foo'`` - This @@ -3323,11 +3308,11 @@ Most code written against Beautiful Soup 3 will work against Beautiful Soup 4 with one simple change. All you should have to do is change the package name from ``BeautifulSoup`` to ``bs4``. So this:: - from BeautifulSoup import BeautifulSoup + from BeautifulSoup import BeautifulSoup becomes this:: - from bs4 import BeautifulSoup + from bs4 import BeautifulSoup * If you get the ``ImportError`` "No module named BeautifulSoup", your problem is that you're trying to run Beautiful Soup 3 code, but you |