From 92b81a790f6b7dcd1b274ecc311366f40a0b4efb Mon Sep 17 00:00:00 2001 From: Leonard Richardson Date: Thu, 2 Feb 2012 11:04:01 -0500 Subject: Added people to AUTHORS whose recognition is overdue. --- doc/source/index.rst | 79 ++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 58 insertions(+), 21 deletions(-) (limited to 'doc/source') diff --git a/doc/source/index.rst b/doc/source/index.rst index ba923dc..625a6f5 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -19,6 +19,18 @@ violates your expectations. The examples in this documentation should work the same way in Python 2.7 and Python 3.2. +You might be looking for the documentation for `Beautiful Soup 3 +`_. If +you want to learn about the differences between Beautiful Soup 3 and +Beautiful Soup 4, see `Porting code to BS4`_. + +Getting help +------------ + +If you have questions about Beautiful Soup, or run into problems, +`send mail to the discussion group +`_. + Quick Start =========== @@ -151,7 +163,7 @@ BS3, so it's still available, but if you're writing new code you should install ``beautifulsoup4``.) You can also `download the Beautiful Soup 4 source tarball -`_ +`_ and install it with ``setup.py``. The license for Beautiful Soup allows you to package the entire library with your application, so you can also download the tarball and insert the ``bs4`` directory into @@ -1951,7 +1963,7 @@ entities:: # u'

I just “love” Microsoft Word

' You might find this feature useful, but Beautiful Soup doesn't use -it. Beautiful Soup prefers the default behavior, which is toconvert +it. Beautiful Soup prefers the default behavior, which is to convert Microsoft smart quotes to Unicode characters along with everything else:: @@ -2073,7 +2085,7 @@ you're not using lxml as the underlying parser, my advice is to :ref:`start `. Beautiful Soup parses documents significantly faster using lxml than using html.parser or html5lib. -Sometimes `Unicode, Dammit` can only detect the encoding of a file by +Sometimes `Unicode, Dammit`_ can only detect the encoding of a file by doing a byte-by-byte examination of the file. This slows Beautiful Soup to a crawl. My tests indicate that this only happened on 2.x versions of Python, and that it happened most often with documents @@ -2127,20 +2139,30 @@ becomes this:: from bs4 import BeautifulSoup -If you get the ``ImportError`` "No module named BeautifulSoup", your -problem is that you're trying to run Beautiful Soup 3 code, but you -only have Beautiful Soup 4 installed. +* If you get the ``ImportError`` "No module named BeautifulSoup", your + problem is that you're trying to run Beautiful Soup 3 code, but you + only have Beautiful Soup 4 installed. + +* If you get the ``ImportError`` "No module named bs4", your problem + is that you're trying to run Beautiful Soup 4 code, but you only + have Beautiful Soup 3 installed. + +Although BS4 is mostly backwards-compatible with BS3, most of its +methods have been deprecated and given new names for `PEP 8 compliance +`_. There are numerous other +renames and changes, and a few of them break backwards compatibility. -If you get the ``ImportError`` "No module named bs4", your problem is -that you're trying to run Beautiful Soup 4 code, but you only have -Beautiful Soup 3 installed. +Here's what you'll need to know to convert your BS3 code and habits to BS4: -Although BS4 is almost entirely backwards-compatible with BS3, most of -its methods have been deprecated and given new names for PEP 8 -compliance. There are numerous other renames and changes, a few of -which break backwards compatibility. +You need a parser +^^^^^^^^^^^^^^^^^ -Here are the changes: +Beautiful Soup 3 used Python's ``SGMLParser``, a module that was +deprecated and removed in Python 3.0. Beautiful Soup 4 uses +``html.parser`` by default, but you can plug in lxml or html5lib and +use that instead. Until ``html.parser`` is improved to handle +real-world HTML better, that's what I recommend you do. See `Be sure +to install a good parser!`_ Method names ^^^^^^^^^^^^ @@ -2210,7 +2232,7 @@ You can write this:: (But the old code will still work.) -Some of the generators used to yield None after they were done, and +Some of the generators used to yield ``None`` after they were done, and then stop. That was a bug. Now the generators just stop. There are two new generators, :ref:`.strings and @@ -2235,6 +2257,22 @@ Beautiful Soup considers any empty tag to be an empty-element tag. If you add a child to an empty-element tag, it stops being an empty-element tag. +Entities +^^^^^^^^ + +An incoming HTML or XML entity is always converted into the +corresponding Unicode character. Beautiful Soup 3 had a number of +overlapping ways of dealing with entities, which have been +removed. The ``BeautifulSoup`` constructor no longer recognizes the +``smartQuotesTo`` or ``convertEntities`` arguments. (`Unicode, +Dammit`_ still has ``smart_quotes_to``, but its default is now to turn +smart quotes into Unicode.) + +If you want to turn those Unicode characters back into HTML entities +on output, rather than turning them into UTF-8 characters, you need to +use ``.encode``, as described in `Substituting HTML entities`. This +may change before the final release. + Miscellaneous ^^^^^^^^^^^^^ @@ -2242,12 +2280,11 @@ Miscellaneous contains a single tag B and nothing else, then A.string is the same as B.string. (Previously, it was None.) -An incoming HTML or XML entity is always converted into the -corresponding Unicode character. The ``BeautifulSoup`` constructor no -longer recognizes the ``smartQuotesTo`` or ``convertEntities`` -arguments. (`Unicode, Dammit`_ still has ``smart_quotes_to``, but its -default is now to turn smart quotes into Unicode.) - The ``BeautifulSoup`` constructor no longer recognizes the `markupMassage` argument. It's now the parser's responsibility to handle markup correctly. + +The rarely-used alternate parser classes like +``ICantBelieveItsBeautifulSoup`` and ``BeautifulSOAP`` have been +removed. It's now the parser's decision how to handle ambiguous +markup. -- cgit v1.2.3