From 197470f217ab994dd0ba8e143418a54d69df8523 Mon Sep 17 00:00:00 2001
From: Leonard Richardson
Date: Fri, 24 Apr 2020 22:13:30 -0400
Subject: If you encode a document with a Python-specific encoding like
'unicode_escape', that encoding is no longer mentioned in the final XML or
HTML document. Instead, encoding information is omitted or left blank.
[bug=1874955]
---
doc/source/index.rst | 44 +++++++++++++++++++++-----------------------
1 file changed, 21 insertions(+), 23 deletions(-)
(limited to 'doc/source')
diff --git a/doc/source/index.rst b/doc/source/index.rst
index dbc8c15..148b30f 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -290,10 +290,9 @@ This table summarizes the advantages and disadvantages of each parser library:
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
If you can, I recommend you install and use lxml for speed. If you're
-using a version of Python 2 earlier than 2.7.3, or a version of Python
-3 earlier than 3.2.2, it's `essential` that you install lxml or
-html5lib--Python's built-in HTML parser is just not very good in older
-versions.
+using a very old version of Python -- earlier than 2.7.3 or 3.2.2 --
+it's `essential` that you install lxml or html5lib. Python's built-in
+HTML parser is just not very good in those old versions.
Note that if a document is invalid, different parsers will generate
different Beautiful Soup trees for it. See `Differences
@@ -310,13 +309,13 @@ constructor. You can pass in a string or an open filehandle::
with open("index.html") as fp:
soup = BeautifulSoup(fp)
- soup = BeautifulSoup("data")
+ soup = BeautifulSoup("a web page")
First, the document is converted to Unicode, and HTML entities are
converted to Unicode characters::
- BeautifulSoup("Sacré bleu!")
- Sacré bleu!
+ print(BeautifulSoup("Sacré bleu!"))
+ # Sacré bleu!
Beautiful Soup then parses the document using the best available
parser. It will use an HTML parser unless you specifically tell it to
@@ -2481,20 +2480,20 @@ Beautiful Soup presents the same interface to a number of different
parsers, but each parser is different. Different parsers will create
different parse trees from the same document. The biggest differences
are between the HTML parsers and the XML parsers. Here's a short
-document, parsed as HTML::
+document, parsed as HTML using the parser that comes with Python::
- BeautifulSoup("")
- #
+ BeautifulSoup("", "html.parser")
+ #
-Since an empty tag is not valid HTML, the parser turns it into a
- tag pair.
+Since a standalone tag is not valid HTML, html.parser turns it into
+a tag pair.
Here's the same document parsed as XML (running this requires that you
-have lxml installed). Note that the empty tag is left alone, and
+have lxml installed). Note that the standalone tag is left alone, and
that the document is given an XML declaration instead of being put
into an tag.::
- BeautifulSoup("", "xml")
+ print(BeautifulSoup("", "xml"))
#
#
@@ -2506,8 +2505,8 @@ document.
But if the document is not perfectly-formed, different parsers will
give different results. Here's a short, invalid document parsed using
-lxml's HTML parser. Note that the dangling
tag is simply
-ignored::
+lxml's HTML parser. Note that the tag gets wrapped in and
+ tags, and the dangling tag is simply ignored::
BeautifulSoup("", "lxml")
#
@@ -2518,8 +2517,8 @@ Here's the same document parsed using html5lib::
#
Instead of ignoring the dangling tag, html5lib pairs it with an
-opening tag. This parser also adds an empty
tag to the
-document.
+opening tag. html5lib also adds an empty
tag; lxml didn't
+bother.
Here's the same document parsed with Python's built-in HTML
parser::
@@ -2528,14 +2527,13 @@ parser::
#
Like html5lib, this parser ignores the closing tag. Unlike
-html5lib, this parser makes no attempt to create a well-formed HTML
-document by adding a tag. Unlike lxml, it doesn't even bother
-to add an tag.
+html5lib or lxml, this parser makes no attempt to create a
+well-formed HTML document by adding or tags.
Since the document "" is invalid, none of these techniques is
-the "correct" way to handle it. The html5lib parser uses techniques
+the 'correct' way to handle it. The html5lib parser uses techniques
that are part of the HTML5 standard, so it has the best claim on being
-the "correct" way, but all three techniques are legitimate.
+the 'correct' way, but all three techniques are legitimate.
Differences between parsers can affect your script. If you're planning
on distributing your script to other people, or running it on multiple
--
cgit v1.2.3