summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLeonard Richardson <leonardr@segfault.org>2013-08-12 11:49:10 -0400
committerLeonard Richardson <leonardr@segfault.org>2013-08-12 11:49:10 -0400
commitb7fae1bd115492eb489359715ed74a742e664f46 (patch)
treee91713d1e2ff8ac8961d90c94f8869c2aeaa92db
parent2eef2140cb422406b68be05a913b88045ba34025 (diff)
A little cleanup.
-rw-r--r--NEWS.txt15
-rw-r--r--bs4/__init__.py6
-rwxr-xr-xconvert-py3k2
-rw-r--r--doc/source/index.rst3
4 files changed, 19 insertions, 7 deletions
diff --git a/NEWS.txt b/NEWS.txt
index d24dfb1..248befb 100644
--- a/NEWS.txt
+++ b/NEWS.txt
@@ -3,9 +3,18 @@
* Instead of converting incoming data to Unicode and feeding it to the
lxml tree builder, Beautiful Soup now makes successive guesses at
the encoding of the incoming data, and tells lxml to parse the data
- as that encoding. This improves performance and avoids an issue in
- which lxml was refusing to parse strings because they were Unicode
- strings.
+ as that encoding. Giving lxml more control over the parsing process
+ improves performance and avoids a number of bugs and issues with the
+ lxml parser which had previously required elaborate workarounds:
+
+ - An issue in which lxml refuses to parse Unicode strings.
+ [bug=1180527]
+
+ - A returning bug that truncated documents longer than a (very
+ small) size. [bug=963880]
+
+ - A returning bug in which extra spaces were added to a document if
+ the document defined a charset other than UTF-8. [bug=972466]
This required a major overhaul of the tree builder architecture. If
you wrote your own tree builder and didn't tell me, you'll need to
diff --git a/bs4/__init__.py b/bs4/__init__.py
index cd4a692..ace72f1 100644
--- a/bs4/__init__.py
+++ b/bs4/__init__.py
@@ -162,11 +162,11 @@ class BeautifulSoup(Tag):
elif len(markup) <= 256:
# Print out warnings for a couple beginner problems
# involving passing non-markup to Beautiful Soup.
- # Beautiful Soup will still parse the input as markup,
+ # Beautiful Soup will still parse the input as markup,
# just in case that's what the user really wants.
if os.path.exists(markup):
warnings.warn(
- '"%s" looks like a filename, not markup. You should probably open a filehandle and pass the filehandle into Beautiful Soup.' % markup)
+ '"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup)
if markup[:5] == "http:" or markup[:6] == "https:":
# TODO: This is ugly but I couldn't get it to work in
# Python 3 otherwise.
@@ -182,7 +182,7 @@ class BeautifulSoup(Tag):
try:
self._feed()
break
- except ParserRejectedMarkup, e:
+ except ParserRejectedMarkup:
pass
# Clear out the markup and remove the builder's circular
diff --git a/convert-py3k b/convert-py3k
index 4f79051..05fab53 100755
--- a/convert-py3k
+++ b/convert-py3k
@@ -9,7 +9,7 @@ echo "If you've got stuff in there, Ctrl-C out of this script or answer 'n'."
mkdir -p py3k
rm -rfI py3k/bs4
cp -r bs4/ py3k/
-2to3-3.2 -w py3k
+2to3 -w py3k
echo ""
echo "OK, conversion is done."
echo "Now running the unit tests."
diff --git a/doc/source/index.rst b/doc/source/index.rst
index 1b38df7..9071d6b 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -26,6 +26,9 @@ developed, and that Beautiful Soup 4 is recommended for all new
projects. If you want to learn about the differences between Beautiful
Soup 3 and Beautiful Soup 4, see `Porting code to BS4`_.
+이 문서는 한국어 번역도 가능합니다. (`외부
+링크`<http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>)
+
Getting help
------------