summaryrefslogtreecommitdiff
path: root/TODO
diff options
context:
space:
mode:
Diffstat (limited to 'TODO')
-rw-r--r--TODO59
1 files changed, 6 insertions, 53 deletions
diff --git a/TODO b/TODO
index ea32bbb..a799bbb 100644
--- a/TODO
+++ b/TODO
@@ -1,11 +1,11 @@
-html5lib has its own Unicode, Dammit-like system. Converting the input
-to Unicode should be up to the builder. The lxml builder would use
-Unicode, Dammit, and the html5lib builder would be a no-op.
-
Bare ampersands should be converted to HTML entities upon output.
-It should also be possible to convert certain Unicode characters to
-HTML entities upon output.
+It should also be possible to, on output, convert to HTML entities any
+Unicode characters found in htmlentitydefs.codepoint2name. (This
+algorithm would allow me to simplify Unicode, Dammit--convert
+everything to Unicode, and then convert to entities upon output, not
+treating smart quotes differently from any other Unicode character
+that can be represented as an entity.)
XML handling:
@@ -21,50 +21,3 @@ as-yet-unreleased version of html5lib changes the parser's handling of
CDATA sections to allow CDATA sections in tags like <svg> and
<math>. The HTML5TreeBuilder will need to be updated to create CData
objects instead of Comment objects in this situation.
-
-
-
----
-
-Here are some unit tests that fail with HTMLParser.
-
- def testValidButBogusDeclarationFAILS(self):
- self.assertSoupEquals('<! Foo >a', '<!Foo >a')
-
- def testIncompleteDeclarationAtEndFAILS(self):
- self.assertSoupEquals('a<!b')
-
- def testIncompleteEntityAtEndFAILS(self):
- self.assertSoupEquals('&lt;Hello&gt')
-
- # This is not what the original author had in mind, but it's
- # a legitimate interpretation of what they wrote.
- self.assertSoupEquals("""<a href="foo</a>, </a><a href="bar">baz</a>""",
- '<a href="foo&lt;/a&gt;, &lt;/a&gt;&lt;a href="></a>, <a href="bar">baz</a>')
- # SGMLParser generates bogus parse events when attribute values
- # contain embedded brackets, but at least Beautiful Soup fixes
- # it up a little.
- self.assertSoupEquals('<a b="<a>">', '<a b="&lt;a&gt;"></a><a>"></a>')
- self.assertSoupEquals('<a href="http://foo.com/<a> and blah and blah',
- """<a href='"http://foo.com/'></a><a> and blah and blah</a>""")
-
- invalidEntity = "foo&#bar;baz"
- soup = BeautifulStoneSoup\
- (invalidEntity,
- convertEntities=htmlEnt)
- self.assertEquals(str(soup), invalidEntity)
-
-
-Tag names that contain Unicode characters crash the parser:
- def testUnicodeTagNamesFAILS(self):
- self.assertSoupEquals("<デダ芻デダtext>2PM</デダ芻デダtext>")
-
-Here's the implementation of NavigableString.__unicode__:
-
- def __unicode__(self):
- return unicode(str(self))
-
-It converts the Unicode to a string, and then back to Unicode. I can't
-find any other way of turning an element of a Unicode subclass into a
-normal Unicode object. This is pretty bad and a better technique is
-welcome.