summaryrefslogtreecommitdiff
path: root/doc/source
diff options
context:
space:
mode:
authorLeonard Richardson <leonardr@segfault.org>2012-07-03 17:59:25 -0400
committerLeonard Richardson <leonardr@segfault.org>2012-07-03 17:59:25 -0400
commit96eaf6e8f54d84b02e0c3c8c334e7cfd29ef343c (patch)
tree7896fbad9bee2fac1f8e89df3d9235cfd4945e40 /doc/source
parentf0102682ece130382500f0ee58fbc3340f221d54 (diff)
Mentioned cchardet in docs.
Diffstat (limited to 'doc/source')
-rw-r--r--doc/source/index.rst10
1 files changed, 7 insertions, 3 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst
index e5e3fbc..7c4b847 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -2685,14 +2685,18 @@ you're not using lxml as the underlying parser, my advice is to
:ref:`start <parser-installation>`. Beautiful Soup parses documents
significantly faster using lxml than using html.parser or html5lib.
+You can speed up encoding detection significantly by installing the
+`cchardet <http://pypi.python.org/pypi/cchardet/>`_ library.
+
Sometimes `Unicode, Dammit`_ can only detect the encoding of a file by
doing a byte-by-byte examination of the file. This slows Beautiful
Soup to a crawl. My tests indicate that this only happened on 2.x
versions of Python, and that it happened most often with documents
using Russian or Chinese encodings. If this is happening to you, you
-can fix it by using Python 3 for your script. Or, if you happen to
-know a document's encoding, you can pass it into the
-``BeautifulSoup`` constructor as ``from_encoding``.
+can fix it by installing cchardet, or by using Python 3 for your
+script. If you happen to know a document's encoding, you can pass
+it into the ``BeautifulSoup`` constructor as ``from_encoding``, and
+bypass encoding detection altogether.
`Parsing only part of a document`_ won't save you much time parsing
the document, but it can save a lot of memory, and it'll make