diff options
author | Leonard Richardson <leonardr@segfault.org> | 2015-06-27 09:55:40 -0400 |
---|---|---|
committer | Leonard Richardson <leonardr@segfault.org> | 2015-06-27 09:55:40 -0400 |
commit | feffc5a1146e2520c90682bc2c33f5fa7d3943f0 (patch) | |
tree | 6dce892919c201b629628647f86843382b29a60a /doc/source | |
parent | d728b9cbd6cd5954acf7c9c32fe2f1878809d6e8 (diff) |
Added an exclude_encodings argument to UnicodeDammit and to the
Beautiful Soup constructor, which lets you prohibit the detection of
an encoding that you know is wrong. [bug=1469408]
Diffstat (limited to 'doc/source')
-rw-r--r-- | doc/source/index.rst | 13 |
1 files changed, 13 insertions, 0 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst index 1b7b1e6..821dad4 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -2397,6 +2397,19 @@ We can fix this by passing in the correct ``from_encoding``:: soup.original_encoding 'iso8859-8' +If you don't know what the correct encoding is, but you know that +Unicode, Dammit is guessing wrong, you can pass the wrong guesses in +as ``exclude_encodings``:: + + soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"]) + soup.h1 + <h1>םולש</h1> + soup.original_encoding + 'WINDOWS-1255' + +(This isn't 100% correct, but Windows-1255 is a compatible superset of +ISO-8859-8, so it's close enough.) + In rare cases (usually when a UTF-8 document contains text written in a completely different encoding), the only way to get Unicode may be to replace some characters with the special Unicode character |