summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--doc/source/index.rst96
1 files changed, 55 insertions, 41 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst
index d13bd17..9746fbd 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -175,22 +175,13 @@ should work with other recent versions.
.. _parser-installation:
-Be sure to install a good parser!
----------------------------------
-
-Beautiful Soup uses a plugin system that supports a number of popular
-Python parsers. If no third-party parsers are installed, Beautiful
-Soup uses the HTML parser that comes with Python. In recent releases
-of Python (2.7.3 and 3.2.2), this parser is excellent at handling bad
-HTML. Unfortunately, in older releases, it's not very good at all.
-
-Even if you're using a recent release of Python, I recommend you
-install the `lxml parser <http://lxml.de/>`_ if you can. Its
-reliability is good on both HTML and XML, and it's much faster than
-Python's built-in parser. Beautiful Soup will detect that you have
-lxml installed, and use it instead of Python's built-in parser.
+Choosing a parser
+-----------------
-Depending on your setup, you might install lxml with one of these commands:
+Beautiful Soup supports the HTML parser included in Python's standard
+library, but it also supports a number of third-party Python parsers.
+One is the `lxml parser <http://lxml.de/>`_. Depending on your setup,
+you might install lxml with one of these commands:
:kbd:`$ apt-get install python-lxml`
@@ -209,6 +200,39 @@ install html5lib with one of these commands:
:kbd:`$ pip install html5lib`
+This table summarizes the advantages and disadvantages of each parser library:
+
++----------------------+--------------------------------------------+--------------------------------+--------------------------+
+| Parser | Typical usage | Advantages | Disadvantages |
++----------------------+--------------------------------------------+--------------------------------+--------------------------+
+| Python's html.parser | ``BeautifulSoup(markup, "html.parser")`` | * Batteries included | * Not very lenient |
+| | | * Decent speed | (before Python 2.7.3 |
+| | | * Lenient (as of Python 2.7.3 | or 3.2.2) |
+| | | and 3.2.) | |
++----------------------+--------------------------------------------+--------------------------------+--------------------------+
+| lxml's HTML parser | ``BeautifulSoup(markup, "lxml")`` | * Very fast | * External C dependency |
+| | | * Lenient | |
++----------------------+--------------------------------------------+--------------------------------+--------------------------+
+| lxml's XML parser | ``BeautifulSoup(markup, ["lxml", "xml"])`` | * Very fast | |
+| | ``BeautifulSoup(markup, "xml")`` | * The only currently supported | |
+| | | XML parser | |
++----------------------+--------------------------------------------+--------------------------------+--------------------------+
+| html5lib | ``BeautifulSoup(markup, html5lib)`` | * Extremely lenient | * Very slow |
+| | | * Parses pages the same way a | * External Python |
+| | | web browser does | dependency |
+| | | * Creates valid HTML5 | * Python 2 only |
++----------------------+--------------------------------------------+--------------------------------+--------------------------+
+
+If you can, I recommend you install and use lxml for speed. If you're
+using a version of Python 2 earlier than 2.7.3, or a version of Python
+3 earlier than 3.2.2, it's `essential` that you install lxml or
+html5lib--Python's built-in HTML parser is just not very good in older
+versions.
+
+Note that if a document is invalid, different parsers will generate
+different Beautiful Soup trees for the same document. See `Differences
+between parsers`_ for details.
+
Making the soup
===============
@@ -2013,8 +2037,8 @@ generator instead, and process the text yourself::
[text for text in soup.stripped_strings]
# [u'I linked to', u'example.com']
-Choosing a parser
-=================
+Specifying the parser to use
+============================
If you just need to parse some HTML, you can dump the markup into the
``BeautifulSoup`` constructor, and it'll probably be fine. Beautiful
@@ -2038,28 +2062,21 @@ specifying one of the following:
options are "lxml", "html5lib", and "html.parser" (Python's
built-in HTML parser).
-Some examples::
-
- BeautifulSoup(markup, "lxml")
- BeautifulSoup(markup, "xml")
- BeautifulSoup(markup, "html5")
+The section `Choosing a parser`_ contrasts the supported parsers.
-You can specify a list of the parser features you want, instead of
-just one. Right now this is mostly useful for distinguishing between
-lxml's HTML parser and its XML parser::
+If you don't have an appropriate parser installed, Beautiful Soup will
+ignore your request and pick a different parser. Right now, the only
+supported XML parser is lxml. If you don't have lxml installed, asking
+for an XML parser won't give you one, and asking for "lxml" won't work
+either.
- BeautifulSoup(markup, ["html", "lxml"])
- BeautifulSoup(markup, ["xml", "lxml"])
+Differences between parsers
+---------------------------
-If you don't have an appropriate parser installed, Beautiful Soup will
-ignore your request and pick a different parser. For instance, right
-now the only supported XML parser is lxml, so if you don't have lxml
-installed, asking for an XML parser won't give you one, and asking for
-"lxml" won't work either.
-
-Why would you use one parser over another? Because different parsers
-will create different parse trees from the same document. The biggest
-differences are between HTML parsers and XML parsers. Here's a short
+Beautiful Soup presents the same interface to a number of different
+parsers, but each parser is different. Different parsers will create
+different parse trees from the same document. The biggest differences
+are between the HTML parsers and the XML parsers. Here's a short
document, parsed as HTML::
BeautifulSoup("<a><b /></a>")
@@ -2079,7 +2096,7 @@ into an <html> tag.::
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won't
-matter. One parser may be faster than another, but they'll all give
+matter. One parser will be faster than another, but they'll all give
you a data structure that looks exactly like the original HTML
document.
@@ -2122,7 +2139,6 @@ in the ``BeautifulSoup`` constructor which parser you used during
development. That will reduce the chances that your users parse a
document differently from the way you parse it.
-
Encodings
=========
@@ -2491,9 +2507,7 @@ You need a parser
Beautiful Soup 3 used Python's ``SGMLParser``, a module that was
deprecated and removed in Python 3.0. Beautiful Soup 4 uses
``html.parser`` by default, but you can plug in lxml or html5lib and
-use that instead. Until ``html.parser`` is improved to handle
-real-world HTML better, that's what I recommend you do. See `Be sure
-to install a good parser!`_
+use that instead. See `Choosing a parser`_ for a comparison.
Method names
^^^^^^^^^^^^