summaryrefslogtreecommitdiff
path: root/doc/source/index.rst
diff options
context:
space:
mode:
authorLeonard Richardson <leonardr@segfault.org>2013-05-07 14:12:10 -0400
committerLeonard Richardson <leonardr@segfault.org>2013-05-07 14:12:10 -0400
commit39efcb4b7ab30145b3733ba820f3c0df0da35ace (patch)
treeac8be4a47b4c16b936f94f25fa39a174872e80ce /doc/source/index.rst
parent07bafa37e866876563ecd729c6a2adaa6d6d01ff (diff)
Fixed up diagnose() and added it to the docs.
Diffstat (limited to 'doc/source/index.rst')
-rw-r--r--doc/source/index.rst68
1 files changed, 56 insertions, 12 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst
index 79286ab..c106f00 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -2633,6 +2633,62 @@ thought I'd mention it::
Troubleshooting
===============
+``diagnose()``
+--------------
+
+If you're having trouble understanding what Beautiful Soup does to a
+document, pass it into the ``diagnose()`` function. (New in 4.2.0.)
+Beautiful Soup will print out a report showing you how different
+parsers handle the document, and tell you if you're missing a parser
+that Beautiful Soup could be using::
+
+ from bs4.diagnose import diagnose
+ data = open("bad.html").read()
+ diagnose(data)
+
+ # Diagnostic running on Beautiful Soup 4.2.0
+ # Python version 2.7.3 (default, Aug 1 2012, 05:16:07)
+ # I noticed that html5lib is not installed. Installing it may help.
+ # Found lxml version 2.3.2.0
+ #
+ # Trying to parse your data with html.parser
+ # Here's what html.parser did with the document:
+ # ...
+
+Just looking at the output of diagnose() may show you how to solve the
+problem. Even if not, you can paste the output of ``diagnose()`` when
+asking for help.
+
+Errors when parsing a document
+------------------------------
+
+There are two different kinds of parse errors. There are crashes,
+where you feed a document to Beautiful Soup and it raises an
+exception, usually an ``HTMLParser.HTMLParseError``. And there is
+unexpected behavior, where a Beautiful Soup parse tree looks a lot
+different than the document used to create it.
+
+Almost none of these problems turn out to be problems with Beautiful
+Soup. This is not because Beautiful Soup is an amazingly well-written
+piece of software. It's because Beautiful Soup doesn't include any
+parsing code. Instead, it relies on external parsers. If one parser
+isn't working on a certain document, the best solution is to try a
+different parser. See `Installing a parser`_ for details and a parser
+comparison.
+
+The most common parse errors are ``HTMLParser.HTMLParseError:
+malformed start tag`` and ``HTMLParser.HTMLParseError: bad end
+tag``. These are both generated by Python's built-in HTML parser
+library, and the solution is to :ref:`install lxml or
+html5lib. <parser-installation>`
+
+The most common type of unexpected behavior is that you can't find a
+tag that you know is in the document. You saw it going in, but
+``find_all()`` returns ``[]`` or ``find()`` returns ``None``. This is
+another common problem with Python's built-in HTML parser, which
+sometimes skips tags it doesn't understand. Again, the solution is to
+:ref:`install lxml or html5lib. <parser-installation>`
+
Version mismatch problems
-------------------------
@@ -2678,18 +2734,6 @@ Other parser problems
parsers`_ for why this matters, and fix the problem by mentioning a
specific parser library in the ``BeautifulSoup`` constructor.
-* ``HTMLParser.HTMLParseError: malformed start tag`` or
- ``HTMLParser.HTMLParseError: bad end tag`` - Caused by
- giving Python's built-in HTML parser a document it can't handle. Any
- other ``HTMLParseError`` is probably the same problem. Solution:
- :ref:`Install lxml or html5lib. <parser-installation>`
-
-* If you can't find a tag that you know is in the document (that is,
- ``find_all()`` returned ``[]`` or ``find()`` returned ``None``),
- you're probably using Python's built-in HTML parser, which sometimes
- skips tags it doesn't understand. Solution: :ref:`Install lxml or
- html5lib. <parser-installation>`
-
* Because `HTML tags and attributes are case-insensitive
<http://www.w3.org/TR/html5/syntax.html#syntax>`_, all three HTML
parsers convert tag and attribute names to lowercase. That is, the