diff options
Diffstat (limited to 'TODO')
-rw-r--r-- | TODO | 59 |
1 files changed, 6 insertions, 53 deletions
@@ -1,11 +1,11 @@ -html5lib has its own Unicode, Dammit-like system. Converting the input -to Unicode should be up to the builder. The lxml builder would use -Unicode, Dammit, and the html5lib builder would be a no-op. - Bare ampersands should be converted to HTML entities upon output. -It should also be possible to convert certain Unicode characters to -HTML entities upon output. +It should also be possible to, on output, convert to HTML entities any +Unicode characters found in htmlentitydefs.codepoint2name. (This +algorithm would allow me to simplify Unicode, Dammit--convert +everything to Unicode, and then convert to entities upon output, not +treating smart quotes differently from any other Unicode character +that can be represented as an entity.) XML handling: @@ -21,50 +21,3 @@ as-yet-unreleased version of html5lib changes the parser's handling of CDATA sections to allow CDATA sections in tags like <svg> and <math>. The HTML5TreeBuilder will need to be updated to create CData objects instead of Comment objects in this situation. - - - ---- - -Here are some unit tests that fail with HTMLParser. - - def testValidButBogusDeclarationFAILS(self): - self.assertSoupEquals('<! Foo >a', '<!Foo >a') - - def testIncompleteDeclarationAtEndFAILS(self): - self.assertSoupEquals('a<!b') - - def testIncompleteEntityAtEndFAILS(self): - self.assertSoupEquals('<Hello>') - - # This is not what the original author had in mind, but it's - # a legitimate interpretation of what they wrote. - self.assertSoupEquals("""<a href="foo</a>, </a><a href="bar">baz</a>""", - '<a href="foo</a>, </a><a href="></a>, <a href="bar">baz</a>') - # SGMLParser generates bogus parse events when attribute values - # contain embedded brackets, but at least Beautiful Soup fixes - # it up a little. - self.assertSoupEquals('<a b="<a>">', '<a b="<a>"></a><a>"></a>') - self.assertSoupEquals('<a href="http://foo.com/<a> and blah and blah', - """<a href='"http://foo.com/'></a><a> and blah and blah</a>""") - - invalidEntity = "foo&#bar;baz" - soup = BeautifulStoneSoup\ - (invalidEntity, - convertEntities=htmlEnt) - self.assertEquals(str(soup), invalidEntity) - - -Tag names that contain Unicode characters crash the parser: - def testUnicodeTagNamesFAILS(self): - self.assertSoupEquals("<デダ芻デダtext>2PM</デダ芻デダtext>") - -Here's the implementation of NavigableString.__unicode__: - - def __unicode__(self): - return unicode(str(self)) - -It converts the Unicode to a string, and then back to Unicode. I can't -find any other way of turning an element of a Unicode subclass into a -normal Unicode object. This is pretty bad and a better technique is -welcome. |