Here are some unit tests that fail with HTMLParser.
def testValidButBogusDeclarationFAILS(self):
self.assertSoupEquals('a', 'a')
def testIncompleteDeclarationAtEndFAILS(self):
self.assertSoupEquals('a, baz""",
', baz')
# SGMLParser generates bogus parse events when attribute values
# contain embedded brackets, but at least Beautiful Soup fixes
# it up a little.
self.assertSoupEquals('', '">')
self.assertSoupEquals(' and blah and blah""")
invalidEntity = "foobar;baz"
soup = BeautifulStoneSoup\
(invalidEntity,
convertEntities=htmlEnt)
self.assertEquals(str(soup), invalidEntity)
Tag names that contain Unicode characters crash the parser:
def testUnicodeTagNamesFAILS(self):
self.assertSoupEquals("<デダ芻デダtext>2PMデダ芻デダtext>")
Here's the implementation of NavigableString.__unicode__:
def __unicode__(self):
return unicode(str(self))
It converts the Unicode to a string, and then back to Unicode. I can't
find any other way of turning an element of a Unicode subclass into a
normal Unicode object. This is pretty bad and a better technique is
welcome.