The Dormouse's story

From bd479f6ba3ed9db76d26cf36f12f1e9744f85ce4 Mon Sep 17 00:00:00 2001 From: Leonard Richardson Date: Wed, 29 Jul 2020 22:43:48 -0400 Subject: Ran through all of the documentation code examples using Python 3, corrected discrepancies and errors, and updated representations. --- doc/source/index.rst | 931 +++++++++++++++++++++++++-------------------------- 1 file changed, 458 insertions(+), 473 deletions(-) (limited to 'doc') diff --git a/doc/source/index.rst b/doc/source/index.rst index f655327..76a32e9 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -54,8 +54,7 @@ Quick Start Here's an HTML document I'll be using as an example throughout this document. It's part of a story from `Alice in Wonderland`:: - html_doc = """ - The Dormouse's story + html_doc = """The Dormouse's story

The Dormouse's story

@@ -186,7 +185,7 @@ works on Python 2 and Python 3. Make sure you use the right version of :kbd:`$ pip install beautifulsoup4` -(The ``BeautifulSoup`` package is probably `not` what you want. That's +(The ``BeautifulSoup`` package is `not` what you want. That's the previous major release, `Beautiful Soup 3`_. Lots of software uses BS3, so it's still available, but if you're writing new code you should install ``beautifulsoup4``.) @@ -307,14 +306,14 @@ constructor. You can pass in a string or an open filehandle:: from bs4 import BeautifulSoup with open("index.html") as fp: - soup = BeautifulSoup(fp) + soup = BeautifulSoup(fp, 'html.parser') - soup = BeautifulSoup("a web page") + soup = BeautifulSoup("a web page", 'html.parser') First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:: - print(BeautifulSoup("Sacré bleu!")) + print(BeautifulSoup("Sacré bleu!", "html.parser")) # Sacré bleu! Beautiful Soup then parses the document using the best available @@ -336,7 +335,7 @@ and ``Comment``. A ``Tag`` object corresponds to an XML or HTML tag in the original document:: - soup = BeautifulSoup('Extremely bold') + soup = BeautifulSoup('Extremely bold', 'html.parser') tag = soup.b type(tag) # @@ -351,7 +350,7 @@ Name Every tag has a name, accessible as ``.name``:: tag.name - # u'b' + # 'b' If you change a tag's name, the change will be reflected in any HTML markup generated by Beautiful Soup:: @@ -368,13 +367,14 @@ id="boldest">`` has an attribute "id" whose value is "boldest". You can access a tag's attributes by treating the tag like a dictionary:: + tag = BeautifulSoup('bold', 'html.parser').b tag['id'] - # u'boldest' + # 'boldest' You can access that dictionary directly as ``.attrs``:: tag.attrs - # {u'id': 'boldest'} + # {'id': 'boldest'} You can add, remove, and modify a tag's attributes. Again, this is done by treating the tag as a dictionary:: @@ -387,11 +387,11 @@ done by treating the tag as a dictionary:: del tag['id'] del tag['another-attribute'] tag - # + # bold tag['id'] # KeyError: 'id' - print(tag.get('id')) + tag.get('id') # None .. _multivalue: @@ -406,26 +406,26 @@ one CSS class). Others include ``rel``, ``rev``, ``accept-charset``, ``headers``, and ``accesskey``. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:: - css_soup = BeautifulSoup('

') + css_soup = BeautifulSoup('

', 'html.parser') css_soup.p['class'] - # ["body"] + # ['body'] - css_soup = BeautifulSoup('

') + css_soup = BeautifulSoup('

', 'html.parser') css_soup.p['class'] - # ["body", "strikeout"] + # ['body', 'strikeout'] If an attribute `looks` like it has more than one value, but it's not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:: - id_soup = BeautifulSoup('

') + id_soup = BeautifulSoup('

', 'html.parser') id_soup.p['id'] # 'my id' When you turn a tag back into a string, multiple attribute values are consolidated:: - rel_soup = BeautifulSoup('

Back to the homepage

') + rel_soup = BeautifulSoup('

Back to the homepage

', 'html.parser') rel_soup.a['rel'] # ['index'] rel_soup.a['rel'] = ['index', 'contents'] @@ -435,34 +435,34 @@ consolidated:: You can disable this by passing ``multi_valued_attributes=None`` as a keyword argument into the ``BeautifulSoup`` constructor:: - no_list_soup = BeautifulSoup('

', 'html', multi_valued_attributes=None) - no_list_soup.p['class'] - # u'body strikeout' + no_list_soup = BeautifulSoup('

', 'html.parser', multi_valued_attributes=None) + no_list_soup.p['class'] + # 'body strikeout' You can use ```get_attribute_list`` to get a value that's always a list, whether or not it's a multi-valued atribute:: - id_soup.p.get_attribute_list('id') - # ["my id"] + id_soup.p.get_attribute_list('id') + # ["my id"] If you parse a document as XML, there are no multi-valued attributes:: xml_soup = BeautifulSoup('

', 'xml') xml_soup.p['class'] - # u'body strikeout' + # 'body strikeout' Again, you can configure this using the ``multi_valued_attributes`` argument:: - class_is_multi= { '*' : 'class'} - xml_soup = BeautifulSoup('

', 'xml', multi_valued_attributes=class_is_multi) - xml_soup.p['class'] - # [u'body', u'strikeout'] + class_is_multi= { '*' : 'class'} + xml_soup = BeautifulSoup('

', 'xml', multi_valued_attributes=class_is_multi) + xml_soup.p['class'] + # ['body', 'strikeout'] You probably won't need to do this, but if you do, use the defaults as a guide. They implement the rules described in the HTML specification:: - from bs4.builder import builder_registry - builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES + from bs4.builder import builder_registry + builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES ``NavigableString`` @@ -471,28 +471,31 @@ a guide. They implement the rules described in the HTML specification:: A string corresponds to a bit of text within a tag. Beautiful Soup uses the ``NavigableString`` class to contain these bits of text:: + soup = BeautifulSoup('Extremely bold', 'html.parser') + tag = soup.b tag.string - # u'Extremely bold' + # 'Extremely bold' type(tag.string) # A ``NavigableString`` is just like a Python Unicode string, except that it also supports some of the features described in `Navigating the tree`_ and `Searching the tree`_. You can convert a -``NavigableString`` to a Unicode string with ``unicode()``:: +``NavigableString`` to a Unicode string with ``unicode()`` (in +Python 2) or ``str`` (in Python 3):: - unicode_string = unicode(tag.string) + unicode_string = str(tag.string) unicode_string - # u'Extremely bold' + # 'Extremely bold' type(unicode_string) - # + # You can't edit a string in place, but you can replace one string with another, using :ref:`replace_with()`:: tag.string.replace_with("No longer bold") tag - #

No longer bold

+ # No longer bold ``NavigableString`` supports most of the features described in `Navigating the tree`_ and `Searching the tree`_, but not all of @@ -518,13 +521,13 @@ You can also pass a ``BeautifulSoup`` object into one of the methods defined in `Modifying the tree`_, just as you would a :ref:`Tag`. This lets you do things like combine two parsed documents:: - doc = BeautifulSoup("INSERT FOOTER HEREHere's the footer", "xml") - doc.find(text="INSERT FOOTER HERE").replace_with(footer) - # u'INSERT FOOTER HERE' - print(doc) - # - # + doc = BeautifulSoup("INSERT FOOTER HEREHere's the footer", "xml") + doc.find(text="INSERT FOOTER HERE").replace_with(footer) + # 'INSERT FOOTER HERE' + print(doc) + # + # Since the ``BeautifulSoup`` object doesn't correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it's @@ -532,7 +535,7 @@ useful to look at its ``.name``, so it's been given the special ``.name`` "[document]":: soup.name - # u'[document]' + # '[document]' Comments and other special strings ---------------------------------- @@ -543,7 +546,7 @@ leftover bits. The main one you'll probably encounter is the comment:: markup = "" - soup = BeautifulSoup(markup) + soup = BeautifulSoup(markup, 'html.parser') comment = soup.b.string type(comment) # @@ -551,7 +554,7 @@ is the comment:: The ``Comment`` object is just a special type of ``NavigableString``:: comment - # u'Hey, buddy. Want to buy a used parser' + # 'Hey, buddy. Want to buy a used parser' But when it appears as part of an HTML document, a ``Comment`` is displayed with special formatting:: @@ -666,13 +669,13 @@ A tag's children are available in a list called ``.contents``:: # The Dormouse's story head_tag.contents - [The Dormouse's story] + # [The Dormouse's story] title_tag = head_tag.contents[0] title_tag # The Dormouse's story title_tag.contents - # [u'The Dormouse's story'] + # ['The Dormouse's story'] The ``BeautifulSoup`` object itself has children. In this case, the tag is the child of the ``BeautifulSoup`` object.:: @@ -680,7 +683,7 @@ The ``BeautifulSoup`` object itself has children. In this case, the len(soup.contents) # 1 soup.contents[0].name - # u'html' + # 'html' A string does not have ``.contents``, because it can't contain anything:: @@ -725,7 +728,7 @@ descendants:: len(list(soup.children)) # 1 len(list(soup.descendants)) - # 25 + # 26 .. _.string: @@ -736,7 +739,7 @@ If a tag has only one child, and that child is a ``NavigableString``, the child is made available as ``.string``:: title_tag.string - # u'The Dormouse's story' + # 'The Dormouse's story' If a tag's only child is another tag, and `that` tag has a ``.string``, then the parent tag is considered to have the same @@ -746,7 +749,7 @@ If a tag's only child is another tag, and `that` tag has a # [The Dormouse's story] head_tag.string - # u'The Dormouse's story' + # 'The Dormouse's story' If a tag contains more than one thing, then it's not clear what ``.string`` should refer to, so ``.string`` is defined to be @@ -765,36 +768,38 @@ just the strings. Use the ``.strings`` generator:: for string in soup.strings: print(repr(string)) - # u"The Dormouse's story" - # u'\n\n' - # u"The Dormouse's story" - # u'\n\n' - # u'Once upon a time there were three little sisters; and their names were\n' - # u'Elsie' - # u',\n' - # u'Lacie' - # u' and\n' - # u'Tillie' - # u';\nand they lived at the bottom of a well.' - # u'\n\n' - # u'...' - # u'\n' + '\n' + # "The Dormouse's story" + # '\n' + # '\n' + # "The Dormouse's story" + # '\n' + # 'Once upon a time there were three little sisters; and their names were\n' + # 'Elsie' + # ',\n' + # 'Lacie' + # ' and\n' + # 'Tillie' + # ';\nand they lived at the bottom of a well.' + # '\n' + # '...' + # '\n' These strings tend to have a lot of extra whitespace, which you can remove by using the ``.stripped_strings`` generator instead:: for string in soup.stripped_strings: print(repr(string)) - # u"The Dormouse's story" - # u"The Dormouse's story" - # u'Once upon a time there were three little sisters; and their names were' - # u'Elsie' - # u',' - # u'Lacie' - # u'and' - # u'Tillie' - # u';\nand they lived at the bottom of a well.' - # u'...' + # "The Dormouse's story" + # "The Dormouse's story" + # 'Once upon a time there were three little sisters; and their names were' + # 'Elsie' + # ',' + # 'Lacie' + # 'and' + # 'Tillie' + # ';\n and they lived at the bottom of a well.' + # '...' Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed. @@ -851,25 +856,19 @@ buried deep within the document, to the very top of the document:: link # Elsie for parent in link.parents: - if parent is None: - print(parent) - else: - print(parent.name) + print(parent.name) # p # body # html # [document] - # None Going sideways -------------- Consider a simple document like this:: - sibling_soup = BeautifulSoup("text1text2") + sibling_soup = BeautifulSoup("text1text2", 'html.parser') print(sibling_soup.prettify()) - # - # # # # text1 @@ -878,8 +877,6 @@ Consider a simple document like this:: # text2 # # - # - # The tag and the tag are at the same level: they're both direct children of the same tag. We call them `siblings`. When a document is @@ -912,7 +909,7 @@ The strings "text1" and "text2" are `not` siblings, because they don't have the same parent:: sibling_soup.b.string - # u'text1' + # 'text1' print(sibling_soup.b.string.next_sibling) # None @@ -921,9 +918,9 @@ In real documents, the ``.next_sibling`` or ``.previous_sibling`` of a tag will usually be a string containing whitespace. Going back to the "three sisters" document:: - Elsie - Lacie - Tillie + # Elsie + # Lacie + # Tillie You might think that the ``.next_sibling`` of the first tag would be the second tag. But actually, it's a string: the comma and @@ -934,7 +931,7 @@ newline that separate the first tag from the second:: # Elsie link.next_sibling - # u',\n' + # ',\n ' The second tag is actually the ``.next_sibling`` of the comma:: @@ -951,29 +948,27 @@ You can iterate over a tag's siblings with ``.next_siblings`` or for sibling in soup.a.next_siblings: print(repr(sibling)) - # u',\n' + # ',\n' # Lacie - # u' and\n' + # ' and\n' # Tillie - # u'; and they lived at the bottom of a well.' - # None + # '; and they lived at the bottom of a well.' for sibling in soup.find(id="link3").previous_siblings: print(repr(sibling)) # ' and\n' # Lacie - # u',\n' + # ',\n' # Elsie - # u'Once upon a time there were three little sisters; and their names were\n' - # None + # 'Once upon a time there were three little sisters; and their names were\n' Going back and forth -------------------- Take a look at the beginning of the "three sisters" document:: - The Dormouse's story -
The Dormouse's story
+ # The Dormouse's story + #
The Dormouse's story
An HTML parser takes this string of characters and turns it into a series of events: "open an tag", "open a tag", "open a @@ -999,14 +994,14 @@ interrupted by the start of the tag.:: # Tillie last_a_tag.next_sibling - # '; and they lived at the bottom of a well.' + # ';\nand they lived at the bottom of a well.' But the ``.next_element`` of that tag, the thing that was parsed immediately after the tag, is `not` the rest of that sentence: it's the word "Tillie":: last_a_tag.next_element - # u'Tillie' + # 'Tillie' That's because in the original markup, the word "Tillie" appeared before that semicolon. The parser encountered an tag, then the @@ -1019,7 +1014,7 @@ The ``.previous_element`` attribute is the exact opposite of immediately before this one:: last_a_tag.previous_element - # u' and\n' + # ' and\n' last_a_tag.previous_element.next_element # Tillie @@ -1031,13 +1026,12 @@ forward or backward in the document as it was parsed:: for element in last_a_tag.next_elements: print(repr(element)) - # u'Tillie' - # u';\nand they lived at the bottom of a well.' - # u'\n\n' + # 'Tillie' + # ';\nand they lived at the bottom of a well.' + # '\n' #
...
- # u'...' - # u'\n' - # None + # '...' + # '\n' Searching the tree ================== @@ -1188,8 +1182,10 @@ If you pass in a function to filter on a specific attribute like value, not the whole tag. Here's a function that finds all ``a`` tags whose ``href`` attribute *does not* match a regular expression:: + import re def not_lacie(href): return href and not re.compile("lacie").search(href) + soup.find_all(href=not_lacie) # [Elsie, # Tillie] @@ -1204,7 +1200,8 @@ objects:: and isinstance(tag.previous_element, NavigableString)) for tag in soup.find_all(surrounded_by_strings): - print tag.name + print(tag.name) + # body # p # a # a @@ -1216,7 +1213,7 @@ Now we're ready to look at the search methods in detail. ``find_all()`` -------------- -Signature: find_all(:ref:`name `, :ref:`attrs `, :ref:`recursive +Method signature: find_all(:ref:`name `, :ref:`attrs `, :ref:`recursive `, :ref:`string `, :ref:`limit `, :ref:`**kwargs `) The ``find_all()`` method looks through a tag's descendants and @@ -1239,7 +1236,7 @@ examples in `Kinds of filters`_, but here are a few more:: import re soup.find(string=re.compile("sisters")) - # u'Once upon a time there were three little sisters; and their names were\n' + # 'Once upon a time there were three little sisters; and their names were\n' Some of these should look familiar, but others are new. What does it mean to pass in a value for ``string``, or ``id``? Why does @@ -1297,12 +1294,12 @@ You can filter multiple attributes at once by passing in more than one keyword argument:: soup.find_all(href=re.compile("elsie"), id='link1') - # [three] + # [Elsie] Some attributes, like the data-* attributes in HTML 5, have names that can't be used as the names of keyword arguments:: - data_soup = BeautifulSoup('
foo!
') + data_soup = BeautifulSoup('
foo!
', 'html.parser') data_soup.find_all(data-foo="value") # SyntaxError: keyword can't be an expression @@ -1318,7 +1315,7 @@ because Beautiful Soup uses the ``name`` argument to contain the name of the tag itself. Instead, you can give a value to 'name' in the ``attrs`` argument:: - name_soup = BeautifulSoup('') + name_soup = BeautifulSoup('', 'html.parser') name_soup.find_all(name="email") # [] name_soup.find_all(attrs={"name": "email"}) @@ -1359,7 +1356,7 @@ values for its "class" attribute. When you search for a tag that matches a certain CSS class, you're matching against `any` of its CSS classes:: - css_soup = BeautifulSoup('
') + css_soup = BeautifulSoup('
', 'html.parser') css_soup.find_all("p", class_="strikeout") # [
] @@ -1403,20 +1400,20 @@ regular expression`_, `a list`_, `a function`_, or `the value True`_. Here are some examples:: soup.find_all(string="Elsie") - # [u'Elsie'] + # ['Elsie'] soup.find_all(string=["Tillie", "Elsie", "Lacie"]) - # [u'Elsie', u'Lacie', u'Tillie'] + # ['Elsie', 'Lacie', 'Tillie'] soup.find_all(string=re.compile("Dormouse")) - [u"The Dormouse's story", u"The Dormouse's story"] + # ["The Dormouse's story", "The Dormouse's story"] def is_the_only_string_within_a_tag(s): """Return True if this string is the only child of its parent tag.""" return (s == s.parent.string) soup.find_all(string=is_the_only_string_within_a_tag) - # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...'] + # ["The Dormouse's story", "The Dormouse's story", 'Elsie', 'Lacie', 'Tillie', '...'] Although ``string`` is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose @@ -1509,7 +1506,7 @@ These two lines are also equivalent:: ``find()`` ---------- -Signature: find(:ref:`name `, :ref:`attrs `, :ref:`recursive +Method signature: find(:ref:`name `, :ref:`attrs `, :ref:`recursive `, :ref:`string `, :ref:`**kwargs `) The ``find_all()`` method scans the entire document looking for @@ -1546,9 +1543,9 @@ names`_? That trick works by repeatedly calling ``find()``:: ``find_parents()`` and ``find_parent()`` ---------------------------------------- -Signature: find_parents(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`limit `, :ref:`**kwargs `) +Method signature: find_parents(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`limit `, :ref:`**kwargs `) -Signature: find_parent(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`**kwargs `) +Method signature: find_parent(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`**kwargs `) I spent a lot of time above covering ``find_all()`` and ``find()``. The Beautiful Soup API defines ten other methods for @@ -1564,22 +1561,22 @@ do the opposite: they work their way `up` the tree, looking at a tag's (or a string's) parents. Let's try them out, starting from a string buried deep in the "three daughters" document:: - a_string = soup.find(string="Lacie") - a_string - # u'Lacie' + a_string = soup.find(string="Lacie") + a_string + # 'Lacie' - a_string.find_parents("a") - # [Lacie] + a_string.find_parents("a") + # [Lacie] - a_string.find_parent("p") - #
Once upon a time there were three little sisters; and their names were - # Elsie, - # Lacie and - # Tillie; - # and they lived at the bottom of a well.
+ a_string.find_parent("p") + #
Once upon a time there were three little sisters; and their names were + # Elsie, + # Lacie and + # Tillie; + # and they lived at the bottom of a well.
- a_string.find_parents("p", class="title") - # [] + a_string.find_parents("p", class_="title") + # [] One of the three tags is the direct parent of the string in question, so our search finds it. One of the three
tags is an @@ -1597,9 +1594,9 @@ each one against the provided filter to see if it matches. ``find_next_siblings()`` and ``find_next_sibling()`` ---------------------------------------------------- -Signature: find_next_siblings(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`limit `, :ref:`**kwargs `) +Method signature: find_next_siblings(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`limit `, :ref:`**kwargs `) -Signature: find_next_sibling(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`**kwargs `) +Method signature: find_next_sibling(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`**kwargs `) These methods use :ref:`.next_siblings ` to iterate over the rest of an element's siblings in the tree. The @@ -1621,9 +1618,9 @@ and ``find_next_sibling()`` only returns the first one:: ``find_previous_siblings()`` and ``find_previous_sibling()`` ------------------------------------------------------------ -Signature: find_previous_siblings(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`limit `, :ref:`**kwargs `) +Method signature: find_previous_siblings(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`limit `, :ref:`**kwargs `) -Signature: find_previous_sibling(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`**kwargs `) +Method signature: find_previous_sibling(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`**kwargs `) These methods use :ref:`.previous_siblings ` to iterate over an element's siblings that precede it in the tree. The ``find_previous_siblings()`` @@ -1646,9 +1643,9 @@ method returns all the siblings that match, and ``find_all_next()`` and ``find_next()`` --------------------------------------- -Signature: find_all_next(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`limit `, :ref:`**kwargs `) +Method signature: find_all_next(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`limit `, :ref:`**kwargs `) -Signature: find_next(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`**kwargs `) +Method signature: find_next(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`**kwargs `) These methods use :ref:`.next_elements ` to iterate over whatever tags and strings that come after it in the @@ -1660,8 +1657,8 @@ document. The ``find_all_next()`` method returns all matches, and # Elsie first_link.find_all_next(string=True) - # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', - # u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n'] + # ['Elsie', ',\n', 'Lacie', ' and\n', 'Tillie', + # ';\nand they lived at the bottom of a well.', '\n', '...', '\n'] first_link.find_next("p") #
...
@@ -1676,9 +1673,9 @@ show up later in the document than the starting element. ``find_all_previous()`` and ``find_previous()`` ----------------------------------------------- -Signature: find_all_previous(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`limit `, :ref:`**kwargs `) +Method signature: find_all_previous(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`limit `, :ref:`**kwargs `) -Signature: find_previous(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`**kwargs `) +Method signature: find_previous(:ref:`name `, :ref:`attrs `, :ref:`string `, :ref:`**kwargs `) These methods use :ref:`.previous_elements ` to iterate over the tags and strings that came before it in the @@ -1837,9 +1834,9 @@ selectors.:: soup.select("child") # [I'm in namespace 1, I'm in namespace 2] - soup.select("ns1|child", namespaces=namespaces) + soup.select("ns1|child", namespaces=soup.namespaces) # [I'm in namespace 1] - + When handling a CSS selector that uses namespaces, Beautiful Soup uses the namespace abbreviations it found when parsing the document. You can override this by passing in your own dictionary of @@ -1869,7 +1866,7 @@ I covered this earlier, in `Attributes`_, but it bears repeating. You can rename a tag, change the values of its attributes, add new attributes, and delete attributes:: - soup = BeautifulSoup('Extremely bold') + soup = BeautifulSoup('Extremely bold', 'html.parser') tag = soup.b tag.name = "blockquote" @@ -1889,13 +1886,13 @@ Modifying ``.string`` If you set a tag's ``.string`` attribute to a new string, the tag's contents are replaced with that string:: - markup = 'I linked to example.com' - soup = BeautifulSoup(markup) + markup = 'I linked to example.com' + soup = BeautifulSoup(markup, 'html.parser') - tag = soup.a - tag.string = "New link text." - tag - # New link text. + tag = soup.a + tag.string = "New link text." + tag + # New link text. Be careful: if the tag contained other tags, they and all their contents will be destroyed. @@ -1906,13 +1903,13 @@ contents will be destroyed. You can add to a tag's contents with ``Tag.append()``. It works just like calling ``.append()`` on a Python list:: - soup = BeautifulSoup("Foo") - soup.a.append("Bar") + soup = BeautifulSoup("Foo", 'html.parser') + soup.a.append("Bar") - soup - # FooBar - soup.a.contents - # [u'Foo', u'Bar'] + soup + # FooBar + soup.a.contents + # ['Foo', 'Bar'] ``extend()`` ------------ @@ -1921,13 +1918,13 @@ Starting in Beautiful Soup 4.7.0, ``Tag`` also supports a method called ``.extend()``, which works just like calling ``.extend()`` on a Python list:: - soup = BeautifulSoup("Soup") - soup.a.extend(["'s", " ", "on"]) + soup = BeautifulSoup("Soup", 'html.parser') + soup.a.extend(["'s", " ", "on"]) - soup - # Soup's on - soup.a.contents - # [u'Soup', u''s', u' ', u'on'] + soup + # Soup's on + soup.a.contents + # ['Soup', ''s', ' ', 'on'] ``NavigableString()`` and ``.new_tag()`` ------------------------------------------------- @@ -1936,43 +1933,43 @@ If you need to add a string to a document, no problem--you can pass a Python string in to ``append()``, or you can call the ``NavigableString`` constructor:: - soup = BeautifulSoup("") - tag = soup.b - tag.append("Hello") - new_string = NavigableString(" there") - tag.append(new_string) - tag - # Hello there. - tag.contents - # [u'Hello', u' there'] + soup = BeautifulSoup("", 'html.parser') + tag = soup.b + tag.append("Hello") + new_string = NavigableString(" there") + tag.append(new_string) + tag + # Hello there. + tag.contents + # ['Hello', ' there'] If you want to create a comment or some other subclass of ``NavigableString``, just call the constructor:: - from bs4 import Comment - new_comment = Comment("Nice to see you.") - tag.append(new_comment) - tag - # Hello there - tag.contents - # [u'Hello', u' there', u'Nice to see you.'] + from bs4 import Comment + new_comment = Comment("Nice to see you.") + tag.append(new_comment) + tag + # Hello there + tag.contents + # ['Hello', ' there', 'Nice to see you.'] `(This is a new feature in Beautiful Soup 4.4.0.)` What if you need to create a whole new tag? The best solution is to call the factory method ``BeautifulSoup.new_tag()``:: - soup = BeautifulSoup("") - original_tag = soup.b + soup = BeautifulSoup("", 'html.parser') + original_tag = soup.b - new_tag = soup.new_tag("a", href="http://www.example.com") - original_tag.append(new_tag) - original_tag - # + new_tag = soup.new_tag("a", href="http://www.example.com") + original_tag.append(new_tag) + original_tag + # - new_tag.string = "Link text." - original_tag - # Link text. + new_tag.string = "Link text." + original_tag + # Link text. Only the first argument, the tag name, is required. @@ -1984,15 +1981,15 @@ doesn't necessarily go at the end of its parent's ``.contents``. It'll be inserted at whatever numeric position you say. It works just like ``.insert()`` on a Python list:: - markup = 'I linked to example.com' - soup = BeautifulSoup(markup) - tag = soup.a + markup = 'I linked to example.com' + soup = BeautifulSoup(markup, 'html.parser') + tag = soup.a - tag.insert(1, "but did not endorse ") - tag - # I linked to but did not endorse example.com - tag.contents - # [u'I linked to ', u'but did not endorse', example.com] + tag.insert(1, "but did not endorse ") + tag + # I linked to but did not endorse example.com + tag.contents + # ['I linked to ', 'but did not endorse', example.com] ``insert_before()`` and ``insert_after()`` ------------------------------------------ @@ -2000,36 +1997,36 @@ say. It works just like ``.insert()`` on a Python list:: The ``insert_before()`` method inserts tags or strings immediately before something else in the parse tree:: - soup = BeautifulSoup("stop") - tag = soup.new_tag("i") - tag.string = "Don't" - soup.b.string.insert_before(tag) - soup.b - # Don'tstop + soup = BeautifulSoup("leave", 'html.parser') + tag = soup.new_tag("i") + tag.string = "Don't" + soup.b.string.insert_before(tag) + soup.b + # Don'tleave The ``insert_after()`` method inserts tags or strings immediately following something else in the parse tree:: - div = soup.new_tag('div') - div.string = 'ever' - soup.b.i.insert_after(" you ", div) - soup.b - # Don't you
ever
stop - soup.b.contents - # [Don't, u' you',
ever
, u'stop'] + div = soup.new_tag('div') + div.string = 'ever' + soup.b.i.insert_after(" you ", div) + soup.b + # Don't you
ever
leave + soup.b.contents + # [Don't, ' you',
ever
, 'leave'] ``clear()`` ----------- ``Tag.clear()`` removes the contents of a tag:: - markup = 'I linked to example.com' - soup = BeautifulSoup(markup) - tag = soup.a + markup = 'I linked to example.com' + soup = BeautifulSoup(markup, 'html.parser') + tag = soup.a - tag.clear() - tag - # + tag.clear() + tag + # ``extract()`` ------------- @@ -2037,34 +2034,34 @@ following something else in the parse tree:: ``PageElement.extract()`` removes a tag or string from the tree. It returns the tag or string that was extracted:: - markup = 'I linked to example.com' - soup = BeautifulSoup(markup) - a_tag = soup.a + markup = 'I linked to example.com' + soup = BeautifulSoup(markup, 'html.parser') + a_tag = soup.a - i_tag = soup.i.extract() + i_tag = soup.i.extract() - a_tag - # I linked to + a_tag + # I linked to - i_tag - # example.com + i_tag + # example.com - print(i_tag.parent) - None + print(i_tag.parent) + # None At this point you effectively have two parse trees: one rooted at the ``BeautifulSoup`` object you used to parse the document, and one rooted at the tag that was extracted. You can go on to call ``extract`` on a child of the element you extracted:: - my_string = i_tag.string.extract() - my_string - # u'example.com' + my_string = i_tag.string.extract() + my_string + # 'example.com' - print(my_string.parent) - # None - i_tag - # + print(my_string.parent) + # None + i_tag + # ``decompose()`` @@ -2073,25 +2070,25 @@ a child of the element you extracted:: ``Tag.decompose()`` removes a tag from the tree, then `completely destroys it and its contents`:: - markup = 'I linked to example.com' - soup = BeautifulSoup(markup) - a_tag = soup.a - i_tag = soup.i + markup = 'I linked to example.com' + soup = BeautifulSoup(markup, 'html.parser') + a_tag = soup.a + i_tag = soup.i - i_tag.decompose() - a_tag - # I linked to + i_tag.decompose() + a_tag + # I linked to The behavior of a decomposed ``Tag`` or ``NavigableString`` is not defined and you should not use it for anything. If you're not sure whether something has been decomposed, you can check its ``.decomposed`` property `(new in Beautiful Soup 4.9.0)`:: - i_tag.decomposed - # True + i_tag.decomposed + # True - a_tag.decomposed - # False + a_tag.decomposed + # False .. _replace_with(): @@ -2102,16 +2099,16 @@ whether something has been decomposed, you can check its ``PageElement.replace_with()`` removes a tag or string from the tree, and replaces it with the tag or string of your choice:: - markup = 'I linked to example.com' - soup = BeautifulSoup(markup) - a_tag = soup.a + markup = 'I linked to example.com' + soup = BeautifulSoup(markup, 'html.parser') + a_tag = soup.a - new_tag = soup.new_tag("b") - new_tag.string = "example.net" - a_tag.i.replace_with(new_tag) + new_tag = soup.new_tag("b") + new_tag.string = "example.net" + a_tag.i.replace_with(new_tag) - a_tag - # I linked to example.net + a_tag + # I linked to example.net ``replace_with()`` returns the tag or string that was replaced, so that you can examine it or add it back to another part of the tree. @@ -2122,11 +2119,11 @@ that you can examine it or add it back to another part of the tree. ``PageElement.wrap()`` wraps an element in the tag you specify. It returns the new wrapper:: - soup = BeautifulSoup("
I wish I was bold.
") + soup = BeautifulSoup("
I wish I was bold.
", 'html.parser') soup.p.string.wrap(soup.new_tag("b")) # I wish I was bold. - soup.p.wrap(soup.new_tag("div") + soup.p.wrap(soup.new_tag("div")) #
I wish I was bold.
This method is new in Beautiful Soup 4.0.5. @@ -2137,13 +2134,13 @@ This method is new in Beautiful Soup 4.0.5. ``Tag.unwrap()`` is the opposite of ``wrap()``. It replaces a tag with whatever's inside that tag. It's good for stripping out markup:: - markup = 'I linked to example.com' - soup = BeautifulSoup(markup) - a_tag = soup.a + markup = 'I linked to example.com' + soup = BeautifulSoup(markup, 'html.parser') + a_tag = soup.a - a_tag.i.unwrap() - a_tag - # I linked to example.com + a_tag.i.unwrap() + a_tag + # I linked to example.com Like ``replace_with()``, ``unwrap()`` returns the tag that was replaced. @@ -2153,27 +2150,27 @@ that was replaced. After calling a bunch of methods that modify the parse tree, you may end up with two or more ``NavigableString`` objects next to each other. Beautiful Soup doesn't have any problems with this, but since it can't happen in a freshly parsed document, you might not expect behavior like the following:: - soup = BeautifulSoup("
A one
") - soup.p.append(", a two") + soup = BeautifulSoup("
A one
", 'html.parser') + soup.p.append(", a two") - soup.p.contents - # [u'A one', u', a two'] + soup.p.contents + # ['A one', ', a two'] - print(soup.p.encode()) - #
A one, a two
+ print(soup.p.encode()) + # b'
A one, a two
' - print(soup.p.prettify()) - #
- # A one - # , a two - #
+ print(soup.p.prettify()) + #
+ # A one + # , a two + #
You can call ``Tag.smooth()`` to clean up the parse tree by consolidating adjacent strings:: soup.smooth() soup.p.contents - # [u'A one, a two'] + # ['A one, a two'] print(soup.p.prettify()) #
@@ -2194,35 +2191,35 @@ The ``prettify()`` method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string:: - markup = 'I linked to example.com' - soup = BeautifulSoup(markup) - soup.prettify() - # '\n \n \n \n \n...' - - print(soup.prettify()) - # - # - # - # - # - # I linked to - # - # example.com - # - # - # - # + markup = 'I linked to example.com' + soup = BeautifulSoup(markup, 'html.parser') + soup.prettify() + # '\n \n \n \n \n...' + + print(soup.prettify()) + # + # + # + # + # + # I linked to + # + # example.com + # + # + # + # You can call ``prettify()`` on the top-level ``BeautifulSoup`` object, or on any of its ``Tag`` objects:: - print(soup.a.prettify()) - # - # I linked to - # - # example.com - # - # + print(soup.a.prettify()) + # + # I linked to + # + # example.com + # + # Since it adds whitespace (in the form of newlines), ``prettify()`` changes the meaning of an HTML document and should not be used to @@ -2233,14 +2230,14 @@ Non-pretty printing ------------------- If you just want a string, with no fancy formatting, you can call -``unicode()`` or ``str()`` on a ``BeautifulSoup`` object, or a ``Tag`` -within it:: +``str()`` on a ``BeautifulSoup`` object (``unicode()`` in Python 2), +or on a ``Tag`` within it:: str(soup) # 'I linked to example.com' - unicode(soup.a) - # u'I linked to example.com' + str(soup.a) + # 'I linked to example.com' The ``str()`` function returns a string encoded in UTF-8. See `Encodings`_ for other options. @@ -2256,26 +2253,26 @@ Output formatters If you give Beautiful Soup a document that contains HTML entities like "&lquot;", they'll be converted to Unicode characters:: - soup = BeautifulSoup("“Dammit!” he said.") - unicode(soup) - # u'\u201cDammit!\u201d he said.' + soup = BeautifulSoup("“Dammit!” he said.", 'html.parser') + str(soup) + # '“Dammit!” he said.' -If you then convert the document to a string, the Unicode characters +If you then convert the document to a bytestring, the Unicode characters will be encoded as UTF-8. You won't get the HTML entities back:: - str(soup) - # '\xe2\x80\x9cDammit!\xe2\x80\x9d he said.' + soup.encode("utf8") + # b'\xe2\x80\x9cDammit!\xe2\x80\x9d he said.' By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into "&", "<", and ">", so that Beautiful Soup doesn't inadvertently generate invalid HTML or XML:: - soup = BeautifulSoup("
The law firm of Dewey, Cheatem, & Howe
") + soup = BeautifulSoup("
The law firm of Dewey, Cheatem, & Howe
", 'html.parser') soup.p #
The law firm of Dewey, Cheatem, & Howe
- soup = BeautifulSoup('A link') + soup = BeautifulSoup('A link', 'html.parser') soup.a # A link @@ -2288,56 +2285,44 @@ The default is ``formatter="minimal"``. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML:: french = "
Il a dit <<Sacré bleu!>>
" - soup = BeautifulSoup(french) + soup = BeautifulSoup(french, 'html.parser') print(soup.prettify(formatter="minimal")) - # - # - #
- # Il a dit <<Sacré bleu!>> - #
- # - # + #
+ # Il a dit <<Sacré bleu!>> + #
If you pass in ``formatter="html"``, Beautiful Soup will convert Unicode characters to HTML entities whenever possible:: print(soup.prettify(formatter="html")) - # - # - #
- # Il a dit <<Sacré bleu!>> - #
- # - # + #
+ # Il a dit <<Sacré bleu!>> + #
If you pass in ``formatter="html5"``, it's the same as ``formatter="html"``, but Beautiful Soup will omit the closing slash in HTML void tags like "br":: - soup = BeautifulSoup("
") + br = BeautifulSoup("
", 'html.parser').br - print(soup.encode(formatter="html")) - #
+ print(br.encode(formatter="html")) + # b'
' - print(soup.encode(formatter="html5")) - #
+ print(br.encode(formatter="html5")) + # b'
' If you pass in ``formatter=None``, Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in these examples:: print(soup.prettify(formatter=None)) - # - # - #
- # Il a dit <> - #
- # - # + #
+ # Il a dit <> + #
- link_soup = BeautifulSoup('A link') + link_soup = BeautifulSoup('A link', 'html.parser') print(link_soup.a.encode(formatter=None)) - # A link + # b'A link' If you need more sophisticated control over your output, you can use Beautiful Soup's ``Formatter`` class. Here's a formatter that @@ -2347,16 +2332,13 @@ attribute value:: from bs4.formatter import HTMLFormatter def uppercase(str): return str.upper() + formatter = HTMLFormatter(uppercase) print(soup.prettify(formatter=formatter)) - # - # - #
- # IL A DIT <> - #
- # - # + #
+ # IL A DIT <> + #
print(link_soup.a.prettify(formatter=formatter)) # @@ -2367,7 +2349,7 @@ Subclassing ``HTMLFormatter`` or ``XMLFormatter`` will give you even more control over the output. For example, Beautiful Soup sorts the attributes in every tag by default:: - attr_soup = BeautifulSoup(b'
') + attr_soup = BeautifulSoup(b'
', 'html.parser') print(attr_soup.p.encode()) #
@@ -2380,8 +2362,9 @@ whenever it appears:: def attributes(self, tag): for k, v in tag.attrs.items(): if k == 'm': - continue + continue yield k, v + print(attr_soup.p.encode(formatter=UnsortedAttributes())) #
@@ -2393,9 +2376,9 @@ all the strings in the document or something, but it will ignore the return value:: from bs4.element import CData - soup = BeautifulSoup("") + soup = BeautifulSoup("", 'html.parser') soup.a.string = CData("one < three") - print(soup.a.prettify(formatter="xml")) + print(soup.a.prettify(formatter="html")) # # # @@ -2408,31 +2391,31 @@ If you only want the human-readable text inside a document or tag, you can use t ``get_text()`` method. It returns all the text in a document or beneath a tag, as a single Unicode string:: - markup = '\nI linked to example.com\n' - soup = BeautifulSoup(markup) + markup = '\nI linked to example.com\n' + soup = BeautifulSoup(markup, 'html.parser') - soup.get_text() - u'\nI linked to example.com\n' - soup.i.get_text() - u'example.com' + soup.get_text() + '\nI linked to example.com\n' + soup.i.get_text() + 'example.com' You can specify a string to be used to join the bits of text together:: # soup.get_text("|") - u'\nI linked to |example.com|\n' + '\nI linked to |example.com|\n' You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:: # soup.get_text("|", strip=True) - u'I linked to|example.com' + 'I linked to|example.com' But at that point you might want to use the :ref:`.stripped_strings ` generator instead, and process the text yourself:: [text for text in soup.stripped_strings] - # [u'I linked to', u'example.com'] + # ['I linked to', 'example.com'] *As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of