diff options
Diffstat (limited to 'doc/source')
-rw-r--r-- | doc/source/index.rst | 68 |
1 files changed, 56 insertions, 12 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst index 79286ab..c106f00 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -2633,6 +2633,62 @@ thought I'd mention it:: Troubleshooting =============== +``diagnose()`` +-------------- + +If you're having trouble understanding what Beautiful Soup does to a +document, pass it into the ``diagnose()`` function. (New in 4.2.0.) +Beautiful Soup will print out a report showing you how different +parsers handle the document, and tell you if you're missing a parser +that Beautiful Soup could be using:: + + from bs4.diagnose import diagnose + data = open("bad.html").read() + diagnose(data) + + # Diagnostic running on Beautiful Soup 4.2.0 + # Python version 2.7.3 (default, Aug 1 2012, 05:16:07) + # I noticed that html5lib is not installed. Installing it may help. + # Found lxml version 2.3.2.0 + # + # Trying to parse your data with html.parser + # Here's what html.parser did with the document: + # ... + +Just looking at the output of diagnose() may show you how to solve the +problem. Even if not, you can paste the output of ``diagnose()`` when +asking for help. + +Errors when parsing a document +------------------------------ + +There are two different kinds of parse errors. There are crashes, +where you feed a document to Beautiful Soup and it raises an +exception, usually an ``HTMLParser.HTMLParseError``. And there is +unexpected behavior, where a Beautiful Soup parse tree looks a lot +different than the document used to create it. + +Almost none of these problems turn out to be problems with Beautiful +Soup. This is not because Beautiful Soup is an amazingly well-written +piece of software. It's because Beautiful Soup doesn't include any +parsing code. Instead, it relies on external parsers. If one parser +isn't working on a certain document, the best solution is to try a +different parser. See `Installing a parser`_ for details and a parser +comparison. + +The most common parse errors are ``HTMLParser.HTMLParseError: +malformed start tag`` and ``HTMLParser.HTMLParseError: bad end +tag``. These are both generated by Python's built-in HTML parser +library, and the solution is to :ref:`install lxml or +html5lib. <parser-installation>` + +The most common type of unexpected behavior is that you can't find a +tag that you know is in the document. You saw it going in, but +``find_all()`` returns ``[]`` or ``find()`` returns ``None``. This is +another common problem with Python's built-in HTML parser, which +sometimes skips tags it doesn't understand. Again, the solution is to +:ref:`install lxml or html5lib. <parser-installation>` + Version mismatch problems ------------------------- @@ -2678,18 +2734,6 @@ Other parser problems parsers`_ for why this matters, and fix the problem by mentioning a specific parser library in the ``BeautifulSoup`` constructor. -* ``HTMLParser.HTMLParseError: malformed start tag`` or - ``HTMLParser.HTMLParseError: bad end tag`` - Caused by - giving Python's built-in HTML parser a document it can't handle. Any - other ``HTMLParseError`` is probably the same problem. Solution: - :ref:`Install lxml or html5lib. <parser-installation>` - -* If you can't find a tag that you know is in the document (that is, - ``find_all()`` returned ``[]`` or ``find()`` returned ``None``), - you're probably using Python's built-in HTML parser, which sometimes - skips tags it doesn't understand. Solution: :ref:`Install lxml or - html5lib. <parser-installation>` - * Because `HTML tags and attributes are case-insensitive <http://www.w3.org/TR/html5/syntax.html#syntax>`_, all three HTML parsers convert tag and attribute names to lowercase. That is, the |