diff options
author | Leonard Richardson <leonard.richardson@canonical.com> | 2013-06-02 22:19:37 -0400 |
---|---|---|
committer | Leonard Richardson <leonard.richardson@canonical.com> | 2013-06-02 22:19:37 -0400 |
commit | 4a9444ac0b74fbf84cf86b9fcf6055c85e65f62a (patch) | |
tree | 570cbcb2c9ab9cf458edee87490afeffd8377560 /doc/source | |
parent | 11dad27424b319a2034f59f5a7f48286551102d0 (diff) | |
parent | 4f9a654766df9ddd05e3ef274b4715b42668724f (diff) |
Merged in big encoding-detection refactoring branch.
Diffstat (limited to 'doc/source')
-rw-r--r-- | doc/source/index.rst | 18 |
1 files changed, 5 insertions, 13 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst index a91854c..1b38df7 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -2478,9 +2478,11 @@ become Unicode:: dammit.original_encoding # 'utf-8' -The more data you give Unicode, Dammit, the more accurately it will -guess. If you have your own suspicions as to what the encoding might -be, you can pass them in as a list:: +Unicode, Dammit's guesses will get a lot more accurate if you install +the ``chardet`` or ``cchardet`` Python libraries. The more data you +give Unicode, Dammit, the more accurately it will guess. If you have +your own suspicions as to what the encoding might be, you can pass +them in as a list:: dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) print(dammit.unicode_markup) @@ -2823,16 +2825,6 @@ significantly faster using lxml than using html.parser or html5lib. You can speed up encoding detection significantly by installing the `cchardet <http://pypi.python.org/pypi/cchardet/>`_ library. -Sometimes `Unicode, Dammit`_ can only detect the encoding of a file by -doing a byte-by-byte examination of the file. This slows Beautiful -Soup to a crawl. My tests indicate that this only happened on 2.x -versions of Python, and that it happened most often with documents -using Russian or Chinese encodings. If this is happening to you, you -can fix it by installing cchardet, or by using Python 3 for your -script. If you happen to know a document's encoding, you can pass -it into the ``BeautifulSoup`` constructor as ``from_encoding``, and -bypass encoding detection altogether. - `Parsing only part of a document`_ won't save you much time parsing the document, but it can save a lot of memory, and it'll make `searching` the document much faster. |