TODO


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

html5lib has its own Unicode, Dammit-like system. Converting the input
to Unicode should be up to the builder. The lxml builder would use
Unicode, Dammit, and the html5lib builder would be a no-op.

Bare ampersands should be converted to HTML entities upon output.

It should also be possible to convert certain Unicode characters to
HTML entities upon output.

---

Here are some unit tests that fail with HTMLParser.

    def testValidButBogusDeclarationFAILS(self):
        self.assertSoupEquals('<! Foo >a', '<!Foo >a')

    def testIncompleteDeclarationAtEndFAILS(self):
        self.assertSoupEquals('a<!b')

    def testIncompleteEntityAtEndFAILS(self):
        self.assertSoupEquals('&lt;Hello&gt')

        # This is not what the original author had in mind, but it's
        # a legitimate interpretation of what they wrote.
        self.assertSoupEquals("""<a href="foo</a>, </a><a href="bar">baz</a>""",
        '<a href="foo&lt;/a&gt;, &lt;/a&gt;&lt;a href="></a>, <a href="bar">baz</a>')
        # SGMLParser generates bogus parse events when attribute values
        # contain embedded brackets, but at least Beautiful Soup fixes
        # it up a little.
        self.assertSoupEquals('<a b="<a>">', '<a b="&lt;a&gt;"></a><a>"></a>')
        self.assertSoupEquals('<a href="http://foo.com/<a> and blah and blah',
                              """<a href='"http://foo.com/'></a><a> and blah and blah</a>""")

        invalidEntity = "foo&#bar;baz"
        soup = BeautifulStoneSoup\
               (invalidEntity,
                convertEntities=htmlEnt)
        self.assertEquals(str(soup), invalidEntity)


Tag names that contain Unicode characters crash the parser:
    def testUnicodeTagNamesFAILS(self):
	self.assertSoupEquals("<デダ芻デダtext>2PM</デダ芻デダtext>")

Here's the implementation of NavigableString.__unicode__:

    def __unicode__(self):
        return unicode(str(self))

It converts the Unicode to a string, and then back to Unicode. I can't
find any other way of turning an element of a Unicode subclass into a
normal Unicode object. This is pretty bad and a better technique is
welcome.