From 305133e16a4fc035f4b2301a5eb9cdc40812e214 Mon Sep 17 00:00:00 2001 From: Leonard Richardson Date: Wed, 15 Mar 2023 12:00:16 -0400 Subject: Add documentation references for the bs4 module itself as well as all currently documented classes. --- doc/source/conf.py | 2 +- doc/source/index.rst | 307 ++++++++++++++++++++++++++++++--------------------- 2 files changed, 180 insertions(+), 129 deletions(-) (limited to 'doc') diff --git a/doc/source/conf.py b/doc/source/conf.py index 7ba53ac..e32d6b8 100644 --- a/doc/source/conf.py +++ b/doc/source/conf.py @@ -41,7 +41,7 @@ master_doc = 'index' # General information about the project. project = u'Beautiful Soup' -copyright = u'2004-2020, Leonard Richardson' +copyright = u'2004-2023, Leonard Richardson' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the diff --git a/doc/source/index.rst b/doc/source/index.rst index 37ec7d8..a916413 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -3,6 +3,8 @@ Beautiful Soup Documentation ============================ +.. py:module:: bs4 + .. image:: 6.1.jpg :align: right :alt: "The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself." @@ -70,7 +72,7 @@ document. It's part of a story from `Alice in Wonderland`:: """ Running the "three sisters" document through Beautiful Soup gives us a -``BeautifulSoup`` object, which represents the document as a nested +:py:class:`BeautifulSoup` object, which represents the document as a nested data structure:: from bs4 import BeautifulSoup @@ -184,7 +186,7 @@ right version of ``pip`` or ``easy_install`` for your Python version :kbd:`$ pip install beautifulsoup4` -(The ``BeautifulSoup`` package is `not` what you want. That's +(The :py:class:`BeautifulSoup` package is `not` what you want. That's the previous major release, `Beautiful Soup 3`_. Lots of software uses BS3, so it's still available, but if you're writing new code you should install ``beautifulsoup4``.) @@ -201,7 +203,7 @@ package the entire library with your application. You can download the tarball, copy its ``bs4`` directory into your application's codebase, and use Beautiful Soup without installing it at all. -I use Python 3.8 to develop Beautiful Soup, but it should work with +I use Python 3.10 to develop Beautiful Soup, but it should work with other recent versions. .. _parser-installation: @@ -254,10 +256,7 @@ This table summarizes the advantages and disadvantages of each parser library: | | | * Creates valid HTML5 | | +----------------------+--------------------------------------------+--------------------------------+--------------------------+ -If you can, I recommend you install and use lxml for speed. If you're -using a very old version of Python -- earlier than 3.2.2 -- it's -`essential` that you install lxml or html5lib. Python's built-in HTML -parser is just not very good in those old versions. +If you can, I recommend you install and use lxml for speed. Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See `Differences @@ -266,7 +265,7 @@ between parsers`_ for details. Making the soup =============== -To parse a document, pass it into the ``BeautifulSoup`` +To parse a document, pass it into the :py:class:`BeautifulSoup` constructor. You can pass in a string or an open filehandle:: from bs4 import BeautifulSoup @@ -291,15 +290,14 @@ Kinds of objects Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you'll only ever have to deal with about four -`kinds` of objects: ``Tag``, ``NavigableString``, ``BeautifulSoup``, -and ``Comment``. +`kinds` of objects: :py:class:`Tag`, :py:class:`NavigableString`, :py:class:`BeautifulSoup`, +and :py:class:`Comment`. -.. _Tag: +.. py:class:: Tag -``Tag`` -------- +A :py:class:`Tag` object corresponds to an XML or HTML tag in the original document. -A ``Tag`` object corresponds to an XML or HTML tag in the original document:: +:: soup = BeautifulSoup('Extremely bold', 'html.parser') tag = soup.b @@ -311,7 +309,7 @@ in `Navigating the tree`_ and `Searching the tree`_. For now, the most important features of a tag are its name and attributes. Name -^^^^ +---- Every tag has a name, accessible as ``.name``:: @@ -326,7 +324,7 @@ markup generated by Beautiful Soup:: #
Extremely bold
Attributes -^^^^^^^^^^ +---------- A tag may have any number of attributes. The tag ```` has an attribute "id" whose value is @@ -363,7 +361,7 @@ done by treating the tag as a dictionary:: .. _multivalue: Multi-valued attributes -&&&&&&&&&&&&&&&&&&&&&&& +----------------------- HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common @@ -400,7 +398,7 @@ consolidated:: You can force all attributes to be parsed as strings by passing ``multi_valued_attributes=None`` as a keyword argument into the -``BeautifulSoup`` constructor:: +:py:class:`BeautifulSoup` constructor:: no_list_soup = BeautifulSoup('

', 'html.parser', multi_valued_attributes=None) no_list_soup.p['class'] @@ -432,11 +430,12 @@ a guide. They implement the rules described in the HTML specification:: builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES -``NavigableString`` -------------------- +.. py:class:: NavigableString + +----------------------------- A string corresponds to a bit of text within a tag. Beautiful Soup -uses the ``NavigableString`` class to contain these bits of text:: +uses the :py:class:`NavigableString` class to contain these bits of text:: soup = BeautifulSoup('Extremely bold', 'html.parser') tag = soup.b @@ -445,10 +444,10 @@ uses the ``NavigableString`` class to contain these bits of text:: type(tag.string) # -A ``NavigableString`` is just like a Python Unicode string, except +A :py:class:`NavigableString` is just like a Python Unicode string, except that it also supports some of the features described in `Navigating the tree`_ and `Searching the tree`_. You can convert a -``NavigableString`` to a Unicode string with ``str``:: +:py:class:`NavigableString` to a Unicode string with ``str``:: unicode_string = str(tag.string) unicode_string @@ -463,27 +462,28 @@ another, using :ref:`replace_with()`:: tag # No longer bold -``NavigableString`` supports most of the features described in +:py:class:`NavigableString` supports most of the features described in `Navigating the tree`_ and `Searching the tree`_, but not all of them. In particular, since a string can't contain anything (the way a tag may contain a string or another tag), strings don't support the ``.contents`` or ``.string`` attributes, or the ``find()`` method. -If you want to use a ``NavigableString`` outside of Beautiful Soup, +If you want to use a :py:class:`NavigableString` outside of Beautiful Soup, you should call ``unicode()`` on it to turn it into a normal Python Unicode string. If you don't, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you're done using Beautiful Soup. This is a big waste of memory. -``BeautifulSoup`` ------------------ +.. py:class:: BeautifulSoup + +--------------------------- -The ``BeautifulSoup`` object represents the parsed document as a +The :py:class:`BeautifulSoup` object represents the parsed document as a whole. For most purposes, you can treat it as a :ref:`Tag` object. This means it supports most of the methods described in `Navigating the tree`_ and `Searching the tree`_. -You can also pass a ``BeautifulSoup`` object into one of the methods +You can also pass a :py:class:`BeautifulSoup` object into one of the methods defined in `Modifying the tree`_, just as you would a :ref:`Tag`. This lets you do things like combine two parsed documents:: @@ -495,7 +495,7 @@ lets you do things like combine two parsed documents:: # #
Here's the footer
-Since the ``BeautifulSoup`` object doesn't correspond to an actual +Since the :py:class:`BeautifulSoup` object doesn't correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it's useful to look at its ``.name``, so it's been given the special ``.name`` "[document]":: @@ -503,13 +503,17 @@ useful to look at its ``.name``, so it's been given the special soup.name # '[document]' -Comments and other special strings ----------------------------------- +Comments +-------- + +:py:class:`Tag`, :py:class:`NavigableString`, and +:py:class:`BeautifulSoup` cover almost everything you'll see in an +HTML or XML file, but there are a few leftover bits. The main one +you'll probably encounter is the :py:class:`Comment`. -``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost -everything you'll see in an HTML or XML file, but there are a few -leftover bits. The main one you'll probably encounter -is the comment:: +.. py:class:: Comment + +:: markup = "" soup = BeautifulSoup(markup, 'html.parser') @@ -517,12 +521,12 @@ is the comment:: type(comment) # -The ``Comment`` object is just a special type of ``NavigableString``:: +The :py:class:`Comment` object is just a special type of :py:class:`NavigableString`:: comment # 'Hey, buddy. Want to buy a used parser' -But when it appears as part of an HTML document, a ``Comment`` is +But when it appears as part of an HTML document, a :py:class:`Comment` is displayed with special formatting:: print(soup.b.prettify()) @@ -530,32 +534,64 @@ displayed with special formatting:: # #
-Beautiful Soup also defines classes called ``Stylesheet``, ``Script``, -and ``TemplateString``, for embedded CSS stylesheets (any strings -found inside a ``