diff options
-rw-r--r-- | doc/source/conf.py | 2 | ||||
-rw-r--r-- | doc/source/index.rst | 307 |
2 files changed, 180 insertions, 129 deletions
diff --git a/doc/source/conf.py b/doc/source/conf.py index 7ba53ac..e32d6b8 100644 --- a/doc/source/conf.py +++ b/doc/source/conf.py @@ -41,7 +41,7 @@ master_doc = 'index' # General information about the project. project = u'Beautiful Soup' -copyright = u'2004-2020, Leonard Richardson' +copyright = u'2004-2023, Leonard Richardson' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the diff --git a/doc/source/index.rst b/doc/source/index.rst index 37ec7d8..a916413 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -3,6 +3,8 @@ Beautiful Soup Documentation ============================ +.. py:module:: bs4 + .. image:: 6.1.jpg :align: right :alt: "The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself." @@ -70,7 +72,7 @@ document. It's part of a story from `Alice in Wonderland`:: """ Running the "three sisters" document through Beautiful Soup gives us a -``BeautifulSoup`` object, which represents the document as a nested +:py:class:`BeautifulSoup` object, which represents the document as a nested data structure:: from bs4 import BeautifulSoup @@ -184,7 +186,7 @@ right version of ``pip`` or ``easy_install`` for your Python version :kbd:`$ pip install beautifulsoup4` -(The ``BeautifulSoup`` package is `not` what you want. That's +(The :py:class:`BeautifulSoup` package is `not` what you want. That's the previous major release, `Beautiful Soup 3`_. Lots of software uses BS3, so it's still available, but if you're writing new code you should install ``beautifulsoup4``.) @@ -201,7 +203,7 @@ package the entire library with your application. You can download the tarball, copy its ``bs4`` directory into your application's codebase, and use Beautiful Soup without installing it at all. -I use Python 3.8 to develop Beautiful Soup, but it should work with +I use Python 3.10 to develop Beautiful Soup, but it should work with other recent versions. .. _parser-installation: @@ -254,10 +256,7 @@ This table summarizes the advantages and disadvantages of each parser library: | | | * Creates valid HTML5 | | +----------------------+--------------------------------------------+--------------------------------+--------------------------+ -If you can, I recommend you install and use lxml for speed. If you're -using a very old version of Python -- earlier than 3.2.2 -- it's -`essential` that you install lxml or html5lib. Python's built-in HTML -parser is just not very good in those old versions. +If you can, I recommend you install and use lxml for speed. Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See `Differences @@ -266,7 +265,7 @@ between parsers`_ for details. Making the soup =============== -To parse a document, pass it into the ``BeautifulSoup`` +To parse a document, pass it into the :py:class:`BeautifulSoup` constructor. You can pass in a string or an open filehandle:: from bs4 import BeautifulSoup @@ -291,15 +290,14 @@ Kinds of objects Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you'll only ever have to deal with about four -`kinds` of objects: ``Tag``, ``NavigableString``, ``BeautifulSoup``, -and ``Comment``. +`kinds` of objects: :py:class:`Tag`, :py:class:`NavigableString`, :py:class:`BeautifulSoup`, +and :py:class:`Comment`. -.. _Tag: +.. py:class:: Tag -``Tag`` -------- +A :py:class:`Tag` object corresponds to an XML or HTML tag in the original document. -A ``Tag`` object corresponds to an XML or HTML tag in the original document:: +:: soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser') tag = soup.b @@ -311,7 +309,7 @@ in `Navigating the tree`_ and `Searching the tree`_. For now, the most important features of a tag are its name and attributes. Name -^^^^ +---- Every tag has a name, accessible as ``.name``:: @@ -326,7 +324,7 @@ markup generated by Beautiful Soup:: # <blockquote class="boldest">Extremely bold</blockquote> Attributes -^^^^^^^^^^ +---------- A tag may have any number of attributes. The tag ``<b id="boldest">`` has an attribute "id" whose value is @@ -363,7 +361,7 @@ done by treating the tag as a dictionary:: .. _multivalue: Multi-valued attributes -&&&&&&&&&&&&&&&&&&&&&&& +----------------------- HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common @@ -400,7 +398,7 @@ consolidated:: You can force all attributes to be parsed as strings by passing ``multi_valued_attributes=None`` as a keyword argument into the -``BeautifulSoup`` constructor:: +:py:class:`BeautifulSoup` constructor:: no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser', multi_valued_attributes=None) no_list_soup.p['class'] @@ -432,11 +430,12 @@ a guide. They implement the rules described in the HTML specification:: builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES -``NavigableString`` -------------------- +.. py:class:: NavigableString + +----------------------------- A string corresponds to a bit of text within a tag. Beautiful Soup -uses the ``NavigableString`` class to contain these bits of text:: +uses the :py:class:`NavigableString` class to contain these bits of text:: soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser') tag = soup.b @@ -445,10 +444,10 @@ uses the ``NavigableString`` class to contain these bits of text:: type(tag.string) # <class 'bs4.element.NavigableString'> -A ``NavigableString`` is just like a Python Unicode string, except +A :py:class:`NavigableString` is just like a Python Unicode string, except that it also supports some of the features described in `Navigating the tree`_ and `Searching the tree`_. You can convert a -``NavigableString`` to a Unicode string with ``str``:: +:py:class:`NavigableString` to a Unicode string with ``str``:: unicode_string = str(tag.string) unicode_string @@ -463,27 +462,28 @@ another, using :ref:`replace_with()`:: tag # <b class="boldest">No longer bold</b> -``NavigableString`` supports most of the features described in +:py:class:`NavigableString` supports most of the features described in `Navigating the tree`_ and `Searching the tree`_, but not all of them. In particular, since a string can't contain anything (the way a tag may contain a string or another tag), strings don't support the ``.contents`` or ``.string`` attributes, or the ``find()`` method. -If you want to use a ``NavigableString`` outside of Beautiful Soup, +If you want to use a :py:class:`NavigableString` outside of Beautiful Soup, you should call ``unicode()`` on it to turn it into a normal Python Unicode string. If you don't, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you're done using Beautiful Soup. This is a big waste of memory. -``BeautifulSoup`` ------------------ +.. py:class:: BeautifulSoup + +--------------------------- -The ``BeautifulSoup`` object represents the parsed document as a +The :py:class:`BeautifulSoup` object represents the parsed document as a whole. For most purposes, you can treat it as a :ref:`Tag` object. This means it supports most of the methods described in `Navigating the tree`_ and `Searching the tree`_. -You can also pass a ``BeautifulSoup`` object into one of the methods +You can also pass a :py:class:`BeautifulSoup` object into one of the methods defined in `Modifying the tree`_, just as you would a :ref:`Tag`. This lets you do things like combine two parsed documents:: @@ -495,7 +495,7 @@ lets you do things like combine two parsed documents:: # <?xml version="1.0" encoding="utf-8"?> # <document><content/><footer>Here's the footer</footer></document> -Since the ``BeautifulSoup`` object doesn't correspond to an actual +Since the :py:class:`BeautifulSoup` object doesn't correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it's useful to look at its ``.name``, so it's been given the special ``.name`` "[document]":: @@ -503,13 +503,17 @@ useful to look at its ``.name``, so it's been given the special soup.name # '[document]' -Comments and other special strings ----------------------------------- +Comments +-------- + +:py:class:`Tag`, :py:class:`NavigableString`, and +:py:class:`BeautifulSoup` cover almost everything you'll see in an +HTML or XML file, but there are a few leftover bits. The main one +you'll probably encounter is the :py:class:`Comment`. -``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost -everything you'll see in an HTML or XML file, but there are a few -leftover bits. The main one you'll probably encounter -is the comment:: +.. py:class:: Comment + +:: markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" soup = BeautifulSoup(markup, 'html.parser') @@ -517,12 +521,12 @@ is the comment:: type(comment) # <class 'bs4.element.Comment'> -The ``Comment`` object is just a special type of ``NavigableString``:: +The :py:class:`Comment` object is just a special type of :py:class:`NavigableString`:: comment # 'Hey, buddy. Want to buy a used parser' -But when it appears as part of an HTML document, a ``Comment`` is +But when it appears as part of an HTML document, a :py:class:`Comment` is displayed with special formatting:: print(soup.b.prettify()) @@ -530,32 +534,64 @@ displayed with special formatting:: # <!--Hey, buddy. Want to buy a used parser?--> # </b> -Beautiful Soup also defines classes called ``Stylesheet``, ``Script``, -and ``TemplateString``, for embedded CSS stylesheets (any strings -found inside a ``<style>`` tag), embedded Javascript (any strings -found in a ``<script>`` tag), and HTML templates (any strings inside a -``<template>`` tag). These classes work exactly the same way as -``NavigableString``; their only purpose is to make it easier to pick -out the main body of the page, by ignoring strings that represent -something else. `(These classes are new in Beautiful Soup 4.9.0, and -the html5lib parser doesn't use them.)` +Special strings for HTML documents +---------------------------------- -Beautiful Soup defines classes for anything else that might show up in -an XML document: ``CData``, ``ProcessingInstruction``, -``Declaration``, and ``Doctype``. Like ``Comment``, these classes -are subclasses of ``NavigableString`` that add something extra to the -string. Here's an example that replaces the comment with a CDATA -block:: +Beautiful Soup defines a few :py:class:`NavigableString` subclasses to +contain strings found inside specific HTML tags. This makes it easier +to pick out the main body of the page, by ignoring strings that +probably represent programming directives found within the +page. `(These classes are new in Beautiful Soup 4.9.0, and the +html5lib parser doesn't use them.)` - from bs4 import CData - cdata = CData("A CDATA block") - comment.replace_with(cdata) +.. py:class:: Stylesheet + +A :py:class:`NavigableString` subclass that represents embedded CSS +stylesheets; that is, any strings found inside a ``<style>`` tag +during document parsing. + +.. py:class:: Script + +A :py:class:`NavigableString` subclass that represents embedded +Javascript; that is, any strings found inside a ``<script>`` tag +during document parsing. + +.. py:class:: Template + +A :py:class:`NavigableString` subclass that represents embedded HTML +templates; that is, any strings found inside a ``<template>`` tag during +document parsing. + +Special strings for XML documents +--------------------------------- + +Beautiful Soup defines some :py:class:`NavigableString` classes for +holding special types of strings that can be found in XML +documents. Like :py:class:`Comment`, these classes are subclasses of +:py:class:`NavigableString` that add something extra to the string on +output. + +.. py:class:: Declaration + +A :py:class:`NavigableString` subclass representing the `declaration +<https://www.w3.org/TR/REC-xml/#sec-prolog-dtd>`_ at the beginning of +an XML document. + +.. py:class:: Doctype + +A :py:class:`NavigableString` subclass representing the `document type +declaration <https://www.w3.org/TR/REC-xml/#dt-doctype>`_ which may +be found near the beginning of an XML document. + +.. py:class:: CData + +A :py:class:`NavigableString` subclass that represents a `CData section <https://www.w3.org/TR/REC-xml/#sec-cdata-sect>`_. + +.. py:class:: ProcessingInstruction + +A :py:class:`NavigableString` subclass that represents the contents +of an `XML processing instruction <https://www.w3.org/TR/REC-xml/#sec-pi>`_. - print(soup.b.prettify()) - # <b> - # <![CDATA[A CDATA block]]> - # </b> - Navigating the tree =================== @@ -643,8 +679,8 @@ A tag's children are available in a list called ``.contents``:: title_tag.contents # ['The Dormouse's story'] -The ``BeautifulSoup`` object itself has children. In this case, the -<html> tag is the child of the ``BeautifulSoup`` object.:: +The :py:class:`BeautifulSoup` object itself has children. In this case, the +<html> tag is the child of the :py:class:`BeautifulSoup` object.:: len(soup.contents) # 1 @@ -693,7 +729,7 @@ its direct children, and so on:: # The Dormouse's story The <head> tag has only one child, but it has two descendants: the -<title> tag and the <title> tag's child. The ``BeautifulSoup`` object +<title> tag and the <title> tag's child. The :py:class:`BeautifulSoup` object only has one direct child (the <html> tag), but it has a whole lot of descendants:: @@ -707,7 +743,7 @@ descendants:: ``.string`` ^^^^^^^^^^^ -If a tag has only one child, and that child is a ``NavigableString``, +If a tag has only one child, and that child is a :py:class:`NavigableString`, the child is made available as ``.string``:: title_tag.string @@ -803,14 +839,14 @@ it:: title_tag.string.parent # <title>The Dormouse's story</title> -The parent of a top-level tag like <html> is the ``BeautifulSoup`` object +The parent of a top-level tag like <html> is the :py:class:`BeautifulSoup` object itself:: html_tag = soup.html type(html_tag.parent) # <class 'bs4.BeautifulSoup'> -And the ``.parent`` of a ``BeautifulSoup`` object is defined as None:: +And the ``.parent`` of a :py:class:`BeautifulSoup` object is defined as None:: print(soup.parent) # None @@ -1463,7 +1499,7 @@ Calling a tag is like calling ``find_all()`` Because ``find_all()`` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the -``BeautifulSoup`` object or a ``Tag`` object as though it were a +:py:class:`BeautifulSoup` object or a :py:class:`Tag` object as though it were a function, then it's the same as calling ``find_all()`` on that object. These two lines of code are equivalent:: @@ -1676,7 +1712,7 @@ tag it contains. CSS selectors through the ``.css`` property ------------------------------------------- -``BeautifulSoup`` and ``Tag`` objects support CSS selectors through +:py:class:`BeautifulSoup` and :py:class:`Tag` objects support CSS selectors through their ``.css`` property. The actual selector implementation is handled by the `Soup Sieve <https://facelessuser.github.io/soupsieve/>`_ package, available on PyPI as ``soupsieve``. If you installed @@ -1787,7 +1823,7 @@ first tag that matches a selector:: # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> As a convenience, you can call ``select()`` and ``select_one()`` can -directly on the ``BeautifulSoup`` or ``Tag`` object, omitting the +directly on the :py:class:`BeautifulSoup` or :py:class:`Tag` object, omitting the ``.css`` property:: soup.select('a[href$="tillie"]') @@ -1808,7 +1844,7 @@ Advanced Soup Sieve features Soup Sieve offers a substantial API beyond the ``select()`` and ``select_one()`` methods, and you can access most of that API through -the ``.css`` attribute of ``Tag`` or ``BeautifulSoup``. What follows +the ``.css`` attribute of :py:class:`Tag` or :py:class:`BeautifulSoup`. What follows is just a list of the supported methods; see `the Soup Sieve documentation <https://facelessuser.github.io/soupsieve/>`_ for full documentation. @@ -1819,7 +1855,7 @@ returns a generator instead of a list:: [tag['id'] for tag in soup.css.iselect(".sister")] # ['link1', 'link2', 'link3'] -The ``closest()`` method returns the nearest parent of a given ``Tag`` +The ``closest()`` method returns the nearest parent of a given :py:class:`Tag` that matches a CSS selector, similar to Beautiful Soup's ``find_parent()`` method:: @@ -1832,7 +1868,7 @@ that matches a CSS selector, similar to Beautiful Soup's # and they lived at the bottom of a well.</p> The ``match()`` method returns a boolean depending on whether or not a -specific ``Tag`` matches a selector:: +specific :py:class:`Tag` matches a selector:: # elsie.css.match("#link1") True @@ -1953,8 +1989,8 @@ like calling ``.append()`` on a Python list:: ``extend()`` ------------ -Starting in Beautiful Soup 4.7.0, ``Tag`` also supports a method -called ``.extend()``, which adds every element of a list to a ``Tag``, +Starting in Beautiful Soup 4.7.0, :py:class:`Tag` also supports a method +called ``.extend()``, which adds every element of a list to a :py:class:`Tag`, in order:: soup = BeautifulSoup("<a>Soup</a>", 'html.parser') @@ -1969,7 +2005,7 @@ in order:: ------------------------------------------------- If you need to add a string to a document, no problem--you can pass a -Python string in to ``append()``, or you can call the ``NavigableString`` +Python string in to ``append()``, or you can call the :py:class:`NavigableString` constructor:: from bs4 import NavigableString @@ -1984,7 +2020,7 @@ constructor:: # ['Hello', ' there'] If you want to create a comment or some other subclass of -``NavigableString``, just call the constructor:: +:py:class:`NavigableString`, just call the constructor:: from bs4 import Comment new_comment = Comment("Nice to see you.") @@ -2090,7 +2126,7 @@ returns the tag or string that was extracted:: # None At this point you effectively have two parse trees: one rooted at the -``BeautifulSoup`` object you used to parse the document, and one rooted +:py:class:`BeautifulSoup` object you used to parse the document, and one rooted at the tag that was extracted. You can go on to call ``extract`` on a child of the element you extracted:: @@ -2119,7 +2155,7 @@ destroys it and its contents`:: a_tag # <a href="http://example.com/">I linked to</a> -The behavior of a decomposed ``Tag`` or ``NavigableString`` is not +The behavior of a decomposed :py:class:`Tag` or :py:class:`NavigableString` is not defined and you should not use it for anything. If you're not sure whether something has been decomposed, you can check its ``.decomposed`` property `(new in Beautiful Soup 4.9.0)`:: @@ -2201,7 +2237,7 @@ that was replaced. ``smooth()`` --------------------------- -After calling a bunch of methods that modify the parse tree, you may end up with two or more ``NavigableString`` objects next to each other. Beautiful Soup doesn't have any problems with this, but since it can't happen in a freshly parsed document, you might not expect behavior like the following:: +After calling a bunch of methods that modify the parse tree, you may end up with two or more :py:class:`NavigableString` objects next to each other. Beautiful Soup doesn't have any problems with this, but since it can't happen in a freshly parsed document, you might not expect behavior like the following:: soup = BeautifulSoup("<p>A one</p>", 'html.parser') soup.p.append(", a two") @@ -2263,8 +2299,8 @@ tag and each string:: # </body> # </html> -You can call ``prettify()`` on the top-level ``BeautifulSoup`` object, -or on any of its ``Tag`` objects:: +You can call ``prettify()`` on the top-level :py:class:`BeautifulSoup` object, +or on any of its :py:class:`Tag` objects:: print(soup.a.prettify()) # <a href="http://example.com/"> @@ -2283,7 +2319,7 @@ Non-pretty printing ------------------- If you just want a string, with no fancy formatting, you can call -``str()`` on a ``BeautifulSoup`` object, or on a ``Tag`` within it:: +``str()`` on a :py:class:`BeautifulSoup` object, or on a :py:class:`Tag` within it:: str(soup) # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>' @@ -2388,10 +2424,19 @@ to Beautiful Soup generating invalid HTML/XML, as in these examples:: print(link_soup.a.encode(formatter=None)) # b'<a href="http://example.com/?foo=val1&bar=val2">A link</a>' +Formatter objects +^^^^^^^^^^^^^^^^^ + If you need more sophisticated control over your output, you can -use Beautiful Soup's ``Formatter`` class. Here's a formatter that -converts strings to uppercase, whether they occur in a text node or in an -attribute value:: +instantiate one of Beautiful Soup's formatter classes and pass that +object in as ``formatter``. + +.. py:class:: HTMLFormatter + +Used to customize the formatting rules for HTML documents. + +Here's a formatter that converts strings to uppercase, whether they +occur in a text node or in an attribute value:: from bs4.formatter import HTMLFormatter def uppercase(str): @@ -2416,10 +2461,17 @@ Here's a formatter that increases the indentation when pretty-printing:: # <a href="http://example.com/?foo=val1&bar=val2"> # A link # </a> - -Subclassing ``HTMLFormatter`` or ``XMLFormatter`` will give you even -more control over the output. For example, Beautiful Soup sorts the -attributes in every tag by default:: + +.. py:class:: XMLFormatter + +Used to customize the formatting rules for XML documents. + +Writing your own formatter +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Subclassing :py:class:`HTMLFormatter` or :py:class:`XMLFormatter` will +give you even more control over the output. For example, Beautiful +Soup sorts the attributes in every tag by default:: attr_soup = BeautifulSoup(b'<p z="1" m="2" a="3"></p>', 'html.parser') print(attr_soup.p.encode()) @@ -2440,7 +2492,7 @@ whenever it appears:: print(attr_soup.p.encode(formatter=UnsortedAttributes())) # <p z="1" a="3"></p> -One last caveat: if you create a ``CData`` object, the text inside +One last caveat: if you create a :py:class:`CData` object, the text inside that object is always presented `exactly as it appears, with no formatting`. Beautiful Soup will call your entity substitution function, just in case you've written a custom function that counts @@ -2504,12 +2556,12 @@ Specifying the parser to use ============================ If you just need to parse some HTML, you can dump the markup into the -``BeautifulSoup`` constructor, and it'll probably be fine. Beautiful +:py:class:`BeautifulSoup` constructor, and it'll probably be fine. Beautiful Soup will pick a parser for you and parse the data. But there are a few additional arguments you can pass in to the constructor to change which parser is used. -The first argument to the ``BeautifulSoup`` constructor is a string or +The first argument to the :py:class:`BeautifulSoup` constructor is a string or an open filehandle--the markup you want parsed. The second argument is `how` you'd like the markup parsed. @@ -2597,7 +2649,7 @@ the 'correct' way, but all three techniques are legitimate. Differences between parsers can affect your script. If you're planning on distributing your script to other people, or running it on multiple -machines, you should specify a parser in the ``BeautifulSoup`` +machines, you should specify a parser in the :py:class:`BeautifulSoup` constructor. That will reduce the chances that your users parse a document differently from the way you parse it. @@ -2618,7 +2670,7 @@ discover it's been converted to Unicode:: It's not magic. (That sure would be nice.) Beautiful Soup uses a sub-library called `Unicode, Dammit`_ to detect a document's encoding and convert it to Unicode. The autodetected encoding is available as -the ``.original_encoding`` attribute of the ``BeautifulSoup`` object:: +the ``.original_encoding`` attribute of the :py:class:`BeautifulSoup` object:: soup.original_encoding 'utf-8' @@ -2627,7 +2679,7 @@ Unicode, Dammit guesses correctly most of the time, but sometimes it makes mistakes. Sometimes it guesses correctly, but only after a byte-by-byte search of the document that takes a very long time. If you happen to know a document's encoding ahead of time, you can avoid -mistakes and delays by passing it to the ``BeautifulSoup`` constructor +mistakes and delays by passing it to the :py:class:`BeautifulSoup` constructor as ``from_encoding``. Here's a document written in ISO-8859-8. The document is so short that @@ -2668,7 +2720,7 @@ a completely different encoding), the only way to get Unicode may be to replace some characters with the special Unicode character "REPLACEMENT CHARACTER" (U+FFFD, �). If Unicode, Dammit needs to do this, it will set the ``.contains_replacement_characters`` attribute -to ``True`` on the ``UnicodeDammit`` or ``BeautifulSoup`` object. This +to ``True`` on the ``UnicodeDammit`` or :py:class:`BeautifulSoup` object. This lets you know that the Unicode representation is not an exact representation of the original--some data was lost. If a document contains �, but ``.contains_replacement_characters`` is ``False``, @@ -2717,7 +2769,7 @@ If you don't want UTF-8, you can pass an encoding into ``prettify()``:: # <meta content="text/html; charset=latin-1" http-equiv="Content-type" /> # ... -You can also call encode() on the ``BeautifulSoup`` object, or any +You can also call encode() on the :py:class:`BeautifulSoup` object, or any element in the soup, just as if it were a Python string:: soup.p.encode("latin-1") @@ -2841,7 +2893,7 @@ embedded in UTF-8 (or vice versa, I suppose), but this is the most common case. Note that you must know to call ``UnicodeDammit.detwingle()`` on your -data before passing it into ``BeautifulSoup`` or the ``UnicodeDammit`` +data before passing it into :py:class:`BeautifulSoup` or the ``UnicodeDammit`` constructor. Beautiful Soup assumes that a document has a single encoding, whatever it might be. If you pass it a document that contains both UTF-8 and Windows-1252, it's likely to think the whole @@ -2877,7 +2929,7 @@ these numbers represent the position of the final greater-than sign:: # (3, 6, 'Paragraph 2') You can shut off this feature by passing ``store_line_numbers=False` -into the ``BeautifulSoup`` constructor:: +into the :py:class:`BeautifulSoup` constructor:: markup = "<p\n>Paragraph 1</p>\n <p>Paragraph 2</p>" soup = BeautifulSoup(markup, 'html.parser', store_line_numbers=False) @@ -2890,7 +2942,7 @@ support it.` Comparing objects for equality ============================== -Beautiful Soup says that two ``NavigableString`` or ``Tag`` objects +Beautiful Soup says that two :py:class:`NavigableString` or :py:class:`Tag` objects are equal when they represent the same HTML or XML markup. In this example, the two <b> tags are treated as equal, even though they live in different parts of the object tree, because they both look like @@ -2914,8 +2966,8 @@ object, use `is`:: Copying Beautiful Soup objects ============================== -You can use ``copy.copy()`` to create a copy of any ``Tag`` or -``NavigableString``:: +You can use ``copy.copy()`` to create a copy of any :py:class:`Tag` or +:py:class:`NavigableString`:: import copy p_copy = copy.copy(soup.p) @@ -2938,7 +2990,7 @@ been called on it:: print(p_copy.parent) # None -This is because two different ``Tag`` objects can't occupy the same +This is because two different :py:class:`Tag` objects can't occupy the same space at the same time. Advanced parser customization @@ -2955,9 +3007,9 @@ Let's say you want to use Beautiful Soup look at a document's <a> tags. It's a waste of time and memory to parse the entire document and then go over it again looking for <a> tags. It would be much faster to ignore everything that wasn't an <a> tag in the first place. The -``SoupStrainer`` class allows you to choose which parts of an incoming -document are parsed. You just create a ``SoupStrainer`` and pass it in -to the ``BeautifulSoup`` constructor as the ``parse_only`` argument. +:py:class:`SoupStrainer` class allows you to choose which parts of an incoming +document are parsed. You just create a :py:class:`SoupStrainer` and pass it in +to the :py:class:`BeautifulSoup` constructor as the ``parse_only`` argument. (Note that *this feature won't work if you're using the html5lib parser*. If you use html5lib, the whole document will be parsed, no @@ -2967,13 +3019,12 @@ make it into the parse tree, it'll crash. To avoid confusion, in the examples below I'll be forcing Beautiful Soup to use Python's built-in parser.) -``SoupStrainer`` -^^^^^^^^^^^^^^^^ +.. py:class:: SoupStrainer -The ``SoupStrainer`` class takes the same arguments as a typical +The :py:class:`SoupStrainer` class takes the same arguments as a typical method from `Searching the tree`_: :ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`string <string>`, and :ref:`**kwargs <kwargs>`. Here are -three ``SoupStrainer`` objects:: +three :py:class:`SoupStrainer` objects:: from bs4 import SoupStrainer @@ -2988,7 +3039,7 @@ three ``SoupStrainer`` objects:: I'm going to bring back the "three sisters" document one more time, and we'll see what the document looks like when it's parsed with these -three ``SoupStrainer`` objects:: +three :py:class:`SoupStrainer` objects:: html_doc = """<html><head><title>The Dormouse's story</title></head> <body> @@ -3028,7 +3079,7 @@ three ``SoupStrainer`` objects:: # ... # -You can also pass a ``SoupStrainer`` into any of the methods covered +You can also pass a :py:class:`SoupStrainer` into any of the methods covered in `Searching the tree`_. This probably isn't terribly useful, but I thought I'd mention it:: @@ -3116,10 +3167,10 @@ Instantiating custom subclasses ------------------------------- When a parser tells Beautiful Soup about a tag or a string, Beautiful -Soup will instantiate a ``Tag`` or ``NavigableString`` object to +Soup will instantiate a :py:class:`Tag` or :py:class:`NavigableString` object to contain that information. Instead of that default behavior, you can -tell Beautiful Soup to instantiate `subclasses` of ``Tag`` or -``NavigableString``, subclasses you define with custom behavior:: +tell Beautiful Soup to instantiate `subclasses` of :py:class:`Tag` or +:py:class:`NavigableString`, subclasses you define with custom behavior:: from bs4 import Tag, NavigableString class MyTag(Tag): @@ -3240,7 +3291,7 @@ Parsing XML By default, Beautiful Soup parses documents as HTML. To parse a document as XML, pass in "xml" as the second argument to the -``BeautifulSoup`` constructor:: +:py:class:`BeautifulSoup` constructor:: soup = BeautifulSoup(markup, "xml") @@ -3257,7 +3308,7 @@ Other parser problems installed, and then tried to run it on a computer that only has html5lib installed. See `Differences between parsers`_ for why this matters, and fix the problem by mentioning a specific parser library - in the ``BeautifulSoup`` constructor. + in the :py:class:`BeautifulSoup` constructor. * Because `HTML tags and attributes are case-insensitive <http://www.w3.org/TR/html5/syntax.html#syntax>`_, all three HTML @@ -3290,7 +3341,7 @@ Miscellaneous * ``AttributeError: 'ResultSet' object has no attribute 'foo'`` - This usually happens because you expected ``find_all()`` to return a - single tag or string. But ``find_all()`` returns a _list_ of tags + single tag or string. But ``find_all()`` returns a `list` of tags and strings--a ``ResultSet`` object. You need to iterate over the list and look at the ``.foo`` of each one. Or, if you really only want one result, you need to use ``find()`` instead of @@ -3362,7 +3413,7 @@ distributions: :kbd:`$ apt-get install python-beautifulsoup` -It's also published through PyPi as ``BeautifulSoup``.: +It's also published through PyPi as :py:class:`BeautifulSoup`.: :kbd:`$ easy_install BeautifulSoup` @@ -3383,7 +3434,7 @@ Porting code to BS4 Most code written against Beautiful Soup 3 will work against Beautiful Soup 4 with one simple change. All you should have to do is change the -package name from ``BeautifulSoup`` to ``bs4``. So this:: +package name from :py:class:`BeautifulSoup` to ``bs4``. So this:: from BeautifulSoup import BeautifulSoup @@ -3516,8 +3567,8 @@ XML There is no longer a ``BeautifulStoneSoup`` class for parsing XML. To parse XML you pass in "xml" as the second argument to the -``BeautifulSoup`` constructor. For the same reason, the -``BeautifulSoup`` constructor no longer recognizes the ``isHTML`` +:py:class:`BeautifulSoup` constructor. For the same reason, the +:py:class:`BeautifulSoup` constructor no longer recognizes the ``isHTML`` argument. Beautiful Soup's handling of empty-element XML tags has been @@ -3534,7 +3585,7 @@ Entities An incoming HTML or XML entity is always converted into the corresponding Unicode character. Beautiful Soup 3 had a number of overlapping ways of dealing with entities, which have been -removed. The ``BeautifulSoup`` constructor no longer recognizes the +removed. The :py:class:`BeautifulSoup` constructor no longer recognizes the ``smartQuotesTo`` or ``convertEntities`` arguments. (`Unicode, Dammit`_ still has ``smart_quotes_to``, but its default is now to turn smart quotes into Unicode.) The constants ``HTML_ENTITIES``, @@ -3557,9 +3608,9 @@ B.string. (Previously, it was None.) their values, not strings. This may affect the way you search by CSS class. -``Tag`` objects now implement the ``__hash__`` method, such that two -``Tag`` objects are considered equal if they generate the same -markup. This may change your script's behavior if you put ``Tag`` +:py:class:`Tag` objects now implement the ``__hash__`` method, such that two +:py:class:`Tag` objects are considered equal if they generate the same +markup. This may change your script's behavior if you put :py:class:`Tag` objects into a dictionary or set. If you pass one of the ``find*`` methods both :ref:`string <string>` `and` @@ -3570,7 +3621,7 @@ search for tags that match your tag-specific criteria and whose Beautiful Soup ignored the tag-specific arguments and looked for strings. -The ``BeautifulSoup`` constructor no longer recognizes the +The :py:class:`BeautifulSoup` constructor no longer recognizes the `markupMassage` argument. It's now the parser's responsibility to handle markup correctly. |