Merged in big encoding-detection refactoring branch.

author: Leonard Richardson <leonard.richardson@canonical.com> 2013-06-02 22:19:37 -0400
committer: Leonard Richardson <leonard.richardson@canonical.com> 2013-06-02 22:19:37 -0400
commit: 4a9444ac0b74fbf84cf86b9fcf6055c85e65f62a (patch)
tree: 570cbcb2c9ab9cf458edee87490afeffd8377560 /doc/source
parent: 11dad27424b319a2034f59f5a7f48286551102d0 (diff)
parent: 4f9a654766df9ddd05e3ef274b4715b42668724f (diff)
1 files changed, 5 insertions, 13 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst
index a91854c..1b38df7 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -2478,9 +2478,11 @@ become Unicode::
  dammit.original_encoding
  # 'utf-8'
 
-The more data you give Unicode, Dammit, the more accurately it will
-guess. If you have your own suspicions as to what the encoding might
-be, you can pass them in as a list::
+Unicode, Dammit's guesses will get a lot more accurate if you install
+the ``chardet`` or ``cchardet`` Python libraries. The more data you
+give Unicode, Dammit, the more accurately it will guess. If you have
+your own suspicions as to what the encoding might be, you can pass
+them in as a list::
 
  dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])
  print(dammit.unicode_markup)
@@ -2823,16 +2825,6 @@ significantly faster using lxml than using html.parser or html5lib.
 You can speed up encoding detection significantly by installing the
 `cchardet <http://pypi.python.org/pypi/cchardet/>`_ library.
 
-Sometimes `Unicode, Dammit`_ can only detect the encoding of a file by
-doing a byte-by-byte examination of the file. This slows Beautiful
-Soup to a crawl. My tests indicate that this only happened on 2.x
-versions of Python, and that it happened most often with documents
-using Russian or Chinese encodings. If this is happening to you, you
-can fix it by installing cchardet, or by using Python 3 for your
-script. If you happen to know a document's encoding, you can pass
-it into the ``BeautifulSoup`` constructor as ``from_encoding``, and
-bypass encoding detection altogether.
-
 `Parsing only part of a document`_ won't save you much time parsing
 the document, but it can save a lot of memory, and it'll make
 `searching` the document much faster.
author	Leonard Richardson <leonard.richardson@canonical.com>	2013-06-02 22:19:37 -0400
committer	Leonard Richardson <leonard.richardson@canonical.com>	2013-06-02 22:19:37 -0400
commit	4a9444ac0b74fbf84cf86b9fcf6055c85e65f62a (patch)
tree	570cbcb2c9ab9cf458edee87490afeffd8377560 /doc/source
parent	11dad27424b319a2034f59f5a7f48286551102d0 (diff)
parent	4f9a654766df9ddd05e3ef274b4715b42668724f (diff)