From 197470f217ab994dd0ba8e143418a54d69df8523 Mon Sep 17 00:00:00 2001 From: Leonard Richardson Date: Fri, 24 Apr 2020 22:13:30 -0400 Subject: If you encode a document with a Python-specific encoding like 'unicode_escape', that encoding is no longer mentioned in the final XML or HTML document. Instead, encoding information is omitted or left blank. [bug=1874955] --- doc/source/index.rst | 44 +++++++++++++++++++++----------------------- 1 file changed, 21 insertions(+), 23 deletions(-) (limited to 'doc/source') diff --git a/doc/source/index.rst b/doc/source/index.rst index dbc8c15..148b30f 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -290,10 +290,9 @@ This table summarizes the advantages and disadvantages of each parser library: +----------------------+--------------------------------------------+--------------------------------+--------------------------+ If you can, I recommend you install and use lxml for speed. If you're -using a version of Python 2 earlier than 2.7.3, or a version of Python -3 earlier than 3.2.2, it's `essential` that you install lxml or -html5lib--Python's built-in HTML parser is just not very good in older -versions. +using a very old version of Python -- earlier than 2.7.3 or 3.2.2 -- +it's `essential` that you install lxml or html5lib. Python's built-in +HTML parser is just not very good in those old versions. Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See `Differences @@ -310,13 +309,13 @@ constructor. You can pass in a string or an open filehandle:: with open("index.html") as fp: soup = BeautifulSoup(fp) - soup = BeautifulSoup("data") + soup = BeautifulSoup("a web page") First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:: - BeautifulSoup("Sacré bleu!") - Sacré bleu! + print(BeautifulSoup("Sacré bleu!")) + # Sacré bleu! Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to @@ -2481,20 +2480,20 @@ Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers. Here's a short -document, parsed as HTML:: +document, parsed as HTML using the parser that comes with Python:: - BeautifulSoup("") - # + BeautifulSoup("", "html.parser") + # -Since an empty tag is not valid HTML, the parser turns it into a - tag pair. +Since a standalone tag is not valid HTML, html.parser turns it into +a tag pair. Here's the same document parsed as XML (running this requires that you -have lxml installed). Note that the empty tag is left alone, and +have lxml installed). Note that the standalone tag is left alone, and that the document is given an XML declaration instead of being put into an tag.:: - BeautifulSoup("", "xml") + print(BeautifulSoup("", "xml")) # # @@ -2506,8 +2505,8 @@ document. But if the document is not perfectly-formed, different parsers will give different results. Here's a short, invalid document parsed using -lxml's HTML parser. Note that the dangling
tag is simply -ignored:: +lxml's HTML parser. Note that the tag gets wrapped in and + tags, and the dangling
tag is simply ignored:: BeautifulSoup("
", "lxml") # @@ -2518,8 +2517,8 @@ Here's the same document parsed using html5lib:: #
Instead of ignoring the dangling
tag, html5lib pairs it with an -opening
tag. This parser also adds an empty tag to the -document. +opening
tag. html5lib also adds an empty tag; lxml didn't +bother. Here's the same document parsed with Python's built-in HTML parser:: @@ -2528,14 +2527,13 @@ parser:: # Like html5lib, this parser ignores the closing
tag. Unlike -html5lib, this parser makes no attempt to create a well-formed HTML -document by adding a tag. Unlike lxml, it doesn't even bother -to add an tag. +html5lib or lxml, this parser makes no attempt to create a +well-formed HTML document by adding or tags. Since the document "
" is invalid, none of these techniques is -the "correct" way to handle it. The html5lib parser uses techniques +the 'correct' way to handle it. The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being -the "correct" way, but all three techniques are legitimate. +the 'correct' way, but all three techniques are legitimate. Differences between parsers can affect your script. If you're planning on distributing your script to other people, or running it on multiple -- cgit v1.2.3