html5lib has its own Unicode, Dammit-like system. Converting the input to Unicode should be up to the builder. The lxml builder would use Unicode, Dammit, and the html5lib builder would be a no-op. Bare ampersands should be converted to HTML entities upon output. It should also be possible to convert certain Unicode characters to HTML entities upon output. --- Here are some unit tests that fail with HTMLParser. def testValidButBogusDeclarationFAILS(self): self.assertSoupEquals('a', 'a') def testIncompleteDeclarationAtEndFAILS(self): self.assertSoupEquals('a, baz""", ', baz') # SGMLParser generates bogus parse events when attribute values # contain embedded brackets, but at least Beautiful Soup fixes # it up a little. self.assertSoupEquals('', '">') self.assertSoupEquals(' and blah and blah""") invalidEntity = "foo&#bar;baz" soup = BeautifulStoneSoup\ (invalidEntity, convertEntities=htmlEnt) self.assertEquals(str(soup), invalidEntity) Tag names that contain Unicode characters crash the parser: def testUnicodeTagNamesFAILS(self): self.assertSoupEquals("<デダ芻デダtext>2PM") Here's the implementation of NavigableString.__unicode__: def __unicode__(self): return unicode(str(self)) It converts the Unicode to a string, and then back to Unicode. I can't find any other way of turning an element of a Unicode subclass into a normal Unicode object. This is pretty bad and a better technique is welcome.