From feffc5a1146e2520c90682bc2c33f5fa7d3943f0 Mon Sep 17 00:00:00 2001 From: Leonard Richardson Date: Sat, 27 Jun 2015 09:55:40 -0400 Subject: Added an exclude_encodings argument to UnicodeDammit and to the Beautiful Soup constructor, which lets you prohibit the detection of an encoding that you know is wrong. [bug=1469408] --- doc/source/index.rst | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'doc/source') diff --git a/doc/source/index.rst b/doc/source/index.rst index 1b7b1e6..821dad4 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -2397,6 +2397,19 @@ We can fix this by passing in the correct ``from_encoding``:: soup.original_encoding 'iso8859-8' +If you don't know what the correct encoding is, but you know that +Unicode, Dammit is guessing wrong, you can pass the wrong guesses in +as ``exclude_encodings``:: + + soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"]) + soup.h1 +

םולש

+ soup.original_encoding + 'WINDOWS-1255' + +(This isn't 100% correct, but Windows-1255 is a compatible superset of +ISO-8859-8, so it's close enough.) + In rare cases (usually when a UTF-8 document contains text written in a completely different encoding), the only way to get Unicode may be to replace some characters with the special Unicode character -- cgit v1.2.3