diff options
author | Leonard Richardson <leonardr@segfault.org> | 2013-08-12 11:49:10 -0400 |
---|---|---|
committer | Leonard Richardson <leonardr@segfault.org> | 2013-08-12 11:49:10 -0400 |
commit | b7fae1bd115492eb489359715ed74a742e664f46 (patch) | |
tree | e91713d1e2ff8ac8961d90c94f8869c2aeaa92db | |
parent | 2eef2140cb422406b68be05a913b88045ba34025 (diff) |
A little cleanup.
-rw-r--r-- | NEWS.txt | 15 | ||||
-rw-r--r-- | bs4/__init__.py | 6 | ||||
-rwxr-xr-x | convert-py3k | 2 | ||||
-rw-r--r-- | doc/source/index.rst | 3 |
4 files changed, 19 insertions, 7 deletions
@@ -3,9 +3,18 @@ * Instead of converting incoming data to Unicode and feeding it to the lxml tree builder, Beautiful Soup now makes successive guesses at the encoding of the incoming data, and tells lxml to parse the data - as that encoding. This improves performance and avoids an issue in - which lxml was refusing to parse strings because they were Unicode - strings. + as that encoding. Giving lxml more control over the parsing process + improves performance and avoids a number of bugs and issues with the + lxml parser which had previously required elaborate workarounds: + + - An issue in which lxml refuses to parse Unicode strings. + [bug=1180527] + + - A returning bug that truncated documents longer than a (very + small) size. [bug=963880] + + - A returning bug in which extra spaces were added to a document if + the document defined a charset other than UTF-8. [bug=972466] This required a major overhaul of the tree builder architecture. If you wrote your own tree builder and didn't tell me, you'll need to diff --git a/bs4/__init__.py b/bs4/__init__.py index cd4a692..ace72f1 100644 --- a/bs4/__init__.py +++ b/bs4/__init__.py @@ -162,11 +162,11 @@ class BeautifulSoup(Tag): elif len(markup) <= 256: # Print out warnings for a couple beginner problems # involving passing non-markup to Beautiful Soup. - # Beautiful Soup will still parse the input as markup, + # Beautiful Soup will still parse the input as markup, # just in case that's what the user really wants. if os.path.exists(markup): warnings.warn( - '"%s" looks like a filename, not markup. You should probably open a filehandle and pass the filehandle into Beautiful Soup.' % markup) + '"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup) if markup[:5] == "http:" or markup[:6] == "https:": # TODO: This is ugly but I couldn't get it to work in # Python 3 otherwise. @@ -182,7 +182,7 @@ class BeautifulSoup(Tag): try: self._feed() break - except ParserRejectedMarkup, e: + except ParserRejectedMarkup: pass # Clear out the markup and remove the builder's circular diff --git a/convert-py3k b/convert-py3k index 4f79051..05fab53 100755 --- a/convert-py3k +++ b/convert-py3k @@ -9,7 +9,7 @@ echo "If you've got stuff in there, Ctrl-C out of this script or answer 'n'." mkdir -p py3k rm -rfI py3k/bs4 cp -r bs4/ py3k/ -2to3-3.2 -w py3k +2to3 -w py3k echo "" echo "OK, conversion is done." echo "Now running the unit tests." diff --git a/doc/source/index.rst b/doc/source/index.rst index 1b38df7..9071d6b 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -26,6 +26,9 @@ developed, and that Beautiful Soup 4 is recommended for all new projects. If you want to learn about the differences between Beautiful Soup 3 and Beautiful Soup 4, see `Porting code to BS4`_. +이 문서는 한국어 번역도 가능합니다. (`외부 +링크`<http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>) + Getting help ------------ |