From 4aff2ee4d6f077e06159c92ab05c0f2ea527c6fa Mon Sep 17 00:00:00 2001 From: Leonard Richardson Date: Thu, 9 Feb 2012 16:15:56 -0500 Subject: As a last-ditch attempt to turn data into Unicode, use errors=replace instead of errors=strict. --- bs4/doc/source/index.rst | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'bs4/doc/source') diff --git a/bs4/doc/source/index.rst b/bs4/doc/source/index.rst index abea5c6..d28787b 100644 --- a/bs4/doc/source/index.rst +++ b/bs4/doc/source/index.rst @@ -2076,6 +2076,15 @@ We can fix this by passing in the correct ``from_encoding``:: soup.original_encoding 'iso8859-8' +In rare cases (usually when a UTF-8 document contains text written in +a completely different encoding), the only way to get Unicode may be +to replace some characters with the special Unicode character +"REPLACEMENT CHARACTER" (U+FFFD, �). If Unicode, Dammit needs to do +this, it will set the ``.characters_were_replaced`` attribute to +``True`` on the ``UnicodeDammit`` or ``BeautifulSoup`` object. This +lets you know that the Unicode representation is not an exact +representation of the original--some data was lost. + Output encoding --------------- -- cgit v1.2.3