summaryrefslogtreecommitdiff
path: root/bs4/dammit.py
AgeCommit message (Collapse)Author
2015-09-28Add a __license__ statement to all source files.Leonard Richardson
2015-07-03Unicode data cannot have a byte-order mark. Returning early stops a warning ↵Leonard Richardson
from happening.
2015-06-27Added an exclude_encodings argument to UnicodeDammit and to theLeonard Richardson
Beautiful Soup constructor, which lets you prohibit the detection of an encoding that you know is wrong. [bug=1469408]
2015-06-26Added a sanity check helper method that makes sure all the elements of a ↵Leonard Richardson
tree are properly connected via .next_element and .previous_element.
2015-06-25Fixed a crash in Unicode, Dammit's encoding detector when the nameLeonard Richardson
of the encoding itself contained invalid bytes. [bug=1360913]
2013-10-02Fixed a bug that caused Unicode data put into UnicodeDammit toLeonard Richardson
return None instead of the original data. [bug=1214983]
2013-06-03Inlined some commonly called code to save a function call.Leonard Richardson
2013-06-03Limit how much of the document is searched via regular expression for a ↵Leonard Richardson
declared encoding.
2013-06-02Turns out we had two bits of code to strip byte-order marks.Leonard Richardson
2013-06-02It turns out most of the untested code wasn't doing anything useful.Leonard Richardson
2013-05-31Create a new lxml parser object for every new parsing strategy.Leonard Richardson
2013-05-30Refactored code a bit.Leonard Richardson
2013-05-30Split out the code that guesses at encodings from the code that tries to ↵Leonard Richardson
decode a bytestring based on those encodings. This is necessary because lxml wants to do the decoding itself.
2013-05-20The default XML formatter will now replace ampersands even if they appear to ↵Leonard Richardson
be part of entities. That is, "<" will become "<".[bug=1182183]
2012-11-03Doc fixes.Leonard Richardson
2012-08-17Fixed cchardet import.Leonard Richardson
2012-07-03Mentioned cchardet in docs.Leonard Richardson
2012-07-03When sniffing encodings, if the cchardet library is installed, use it ↵Leonard Richardson
instead of chardet. It's much faster. [bug=1020748]
2012-07-03Use logging.warning() instead of warning.warn() to notify the user that ↵Leonard Richardson
characters were replaced with REPLACEMENT CHARACTER. [bug=1013862]
2012-05-24Comments, processing instructions, document type declarations, and markup ↵Leonard Richardson
declarations are now treated as preformatted strings, the way CData blocks are. [bug=1001025] Also in this commit: renamed detwingle method to detwingle().
2012-05-03Fixed the handling of " with the built-in parser. [bug=993871]Leonard Richardson
2012-04-27Added experimental support for fixing Windows-1252 characters embedded in ↵Leonard Richardson
UTF-8 documents.
2012-04-26Fixed a bug in decoding data that contained a byte-order mark, such as data ↵Leonard Richardson
encoded in UTF-16LE. [bug=988980]
2012-04-16Unicode, Dammit now has an option to turn MS smart quotes into ASCII characters.Leonard Richardson
2012-04-16Attribute values are now run through the provided output formatter. ↵Leonard Richardson
Previously they were always run through the 'minimal' formatter. [bug=980237]
2012-02-16Issue a warning if characters were replaced with REPLACEMENT CHARACTER ↵Leonard Richardson
during Unicode conversion.
2012-02-09As a last-ditch attempt to turn data into Unicode, use errors=replace ↵Leonard Richardson
instead of errors=strict.
2012-02-09Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags like ↵Leonard Richardson
<meta charset="utf-8" />. [bug=837268]
2012-02-09Minor Unicode, Dammit cleanup.Leonard Richardson
2012-02-09Improved Unicode, Dammit's behavior when you give it Unicode to begin with.Leonard Richardson
2011-06-29Various changes so most tests pass on Python 3.Thomas Kluyver
2011-05-21OK, figured that out.Leonard Richardson
2011-05-21Changed dammit.py to require fewer changes to be Python 3 compatible.Leonard Richardson
2011-03-05PEP8ifyingAaron DeVore
2011-02-27Added a tree builder for the built-in HTMLParser, and tests.Leonard Richardson