summaryrefslogtreecommitdiff
path: root/TODO
blob: ea32bbb9dec22d9cab56f049a2642f5fd841134a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
html5lib has its own Unicode, Dammit-like system. Converting the input
to Unicode should be up to the builder. The lxml builder would use
Unicode, Dammit, and the html5lib builder would be a no-op.

Bare ampersands should be converted to HTML entities upon output.

It should also be possible to convert certain Unicode characters to
HTML entities upon output.

XML handling:

The elementtree XMLParser has a strip_cdata argument that, when set to
False, should allow Beautiful Soup to preserve CDATA sections instead
of treating them as text. (This argument is also present for
HTMLParser, but does nothing.)

Later:

Currently, htm5lib converts CDATA sections into comments. An
as-yet-unreleased version of html5lib changes the parser's handling of
CDATA sections to allow CDATA sections in tags like <svg> and
<math>. The HTML5TreeBuilder will need to be updated to create CData
objects instead of Comment objects in this situation.



---

Here are some unit tests that fail with HTMLParser.

    def testValidButBogusDeclarationFAILS(self):
        self.assertSoupEquals('<! Foo >a', '<!Foo >a')

    def testIncompleteDeclarationAtEndFAILS(self):
        self.assertSoupEquals('a<!b')

    def testIncompleteEntityAtEndFAILS(self):
        self.assertSoupEquals('&lt;Hello&gt')

        # This is not what the original author had in mind, but it's
        # a legitimate interpretation of what they wrote.
        self.assertSoupEquals("""<a href="foo</a>, </a><a href="bar">baz</a>""",
        '<a href="foo&lt;/a&gt;, &lt;/a&gt;&lt;a href="></a>, <a href="bar">baz</a>')
        # SGMLParser generates bogus parse events when attribute values
        # contain embedded brackets, but at least Beautiful Soup fixes
        # it up a little.
        self.assertSoupEquals('<a b="<a>">', '<a b="&lt;a&gt;"></a><a>"></a>')
        self.assertSoupEquals('<a href="http://foo.com/<a> and blah and blah',
                              """<a href='"http://foo.com/'></a><a> and blah and blah</a>""")

        invalidEntity = "foo&#bar;baz"
        soup = BeautifulStoneSoup\
               (invalidEntity,
                convertEntities=htmlEnt)
        self.assertEquals(str(soup), invalidEntity)


Tag names that contain Unicode characters crash the parser:
    def testUnicodeTagNamesFAILS(self):
	self.assertSoupEquals("<デダ芻デダtext>2PM</デダ芻デダtext>")

Here's the implementation of NavigableString.__unicode__:

    def __unicode__(self):
        return unicode(str(self))

It converts the Unicode to a string, and then back to Unicode. I can't
find any other way of turning an element of a Unicode subclass into a
normal Unicode object. This is pretty bad and a better technique is
welcome.