Bare ampersands should be converted to HTML entities upon output. It should also be possible to, on output, convert to HTML entities any Unicode characters found in htmlentitydefs.codepoint2name. (This algorithm would allow me to simplify Unicode, Dammit--convert everything to Unicode, and then convert to entities upon output, not treating smart quotes differently from any other Unicode character that can be represented as an entity.) XML handling: The elementtree XMLParser has a strip_cdata argument that, when set to False, should allow Beautiful Soup to preserve CDATA sections instead of treating them as text. (This argument is also present for HTMLParser, but does nothing.) Later: Currently, htm5lib converts CDATA sections into comments. An as-yet-unreleased version of html5lib changes the parser's handling of CDATA sections to allow CDATA sections in tags like and . The HTML5TreeBuilder will need to be updated to create CData objects instead of Comment objects in this situation.