diff options
author | Leonard Richardson <leonard.richardson@canonical.com> | 2012-02-16 13:55:20 -0500 |
---|---|---|
committer | Leonard Richardson <leonard.richardson@canonical.com> | 2012-02-16 13:55:20 -0500 |
commit | 1a50d9623831990ae0a78ea3a7e66fa098fe92ac (patch) | |
tree | d31578ac86c753c6e3427f574408a1ad960d80ac /bs4/doc/source | |
parent | ffcebc274b84b85a0b8c93c2aca8756df4baa236 (diff) |
By default, turn unrecognized characters into numeric XML entity refs.
Diffstat (limited to 'bs4/doc/source')
-rw-r--r-- | bs4/doc/source/index.rst | 21 |
1 files changed, 21 insertions, 0 deletions
diff --git a/bs4/doc/source/index.rst b/bs4/doc/source/index.rst index 200317a..0467c00 100644 --- a/bs4/doc/source/index.rst +++ b/bs4/doc/source/index.rst @@ -2160,6 +2160,27 @@ element in the soup, just as if it were a Python string:: soup.p.encode("utf-8") # '<p>Sacr\xc3\xa9 bleu!</p>' +Any characters that can't be represented in your chosen encoding will +be converted into numeric XML entity references. For instance, here's +a document that includes the Unicode character SNOWMAN:: + + markup = u"<b>\N{SNOWMAN}</b>" + snowman_soup = BeautifulSoup(markup) + tag = snowman_soup.b + +The SNOWMAN character can be part of a UTF-8 document (it looks like +☃), but there's no representation for that character in ISO-Latin-1 or +ASCII, so it's converted into "☃" for those encodings:: + + print(tag.encode("utf-8")) + # <b>☃</b> + + print tag.encode("latin-1") + # <b>☃</b> + + print tag.encode("ascii") + # <b>☃</b> + Unicode, Dammit --------------- |