From 3793495c8ea91243f9689d9788d30b9c6e0740d7 Mon Sep 17 00:00:00 2001 From: Leonard Richardson Date: Mon, 16 Apr 2012 10:35:13 -0400 Subject: Unicode, Dammit now has an option to turn MS smart quotes into ASCII characters. --- doc/source/index.rst | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) (limited to 'doc') diff --git a/doc/source/index.rst b/doc/source/index.rst index d4dabb1..a7757d6 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -2391,21 +2391,26 @@ Unicode, Dammit has one special feature that Beautiful Soup doesn't use. You can use it to convert Microsoft smart quotes to HTML or XML entities:: - markup = b"

I just \x93love\x94 Microsoft Word

" + markup = b"

I just \x93love\x94 Microsoft Word\x92s smart quotes

" UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup - # u'

I just “love” Microsoft Word

' + # u'

I just “love” Microsoft Word’s smart quotes

' UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup - # u'

I just “love” Microsoft Word

' + # u'

I just “love” Microsoft Word’s smart quotes

' -You might find this feature useful, but Beautiful Soup doesn't use -it. Beautiful Soup prefers the default behavior, which is to convert -Microsoft smart quotes to Unicode characters along with everything -else:: +You can also convert Microsoft smart quotes to ASCII quotes:: + + UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup + # u'

I just "love" Microsoft Word\'s smart quotes

' + +Hopefully you'll find this feature useful, but Beautiful Soup doesn't +use it. Beautiful Soup prefers the default behavior, which is to +convert Microsoft smart quotes to Unicode characters along with +everything else:: UnicodeDammit(markup, ["windows-1252"]).unicode_markup - # u'

I just \u201clove\u201d Microsoft Word

' + # u'

I just \u201clove\u201d Microsoft Word\u2019s smart quotes

' Parsing only part of a document =============================== -- cgit v1.2.3