summaryrefslogtreecommitdiff
path: root/bs4/dammit.py
AgeCommit message (Collapse)Author
2021-12-19Remove a huge list of HTML entities that was only necessary under Python 2.Leonard Richardson
2021-12-19Removed support for the iconv_codec library, which doesn't seemLeonard Richardson
to exist anymore and was never put up on PyPI. (The closest replacement on PyPI, iconv_codecs, is GPL-licensed, so we can't use it.)
2021-12-19If the charset-normalizer Python moduleLeonard Richardson
(https://pypi.org/project/charset-normalizer/) is installed, Beautiful Soup will use it to detect the character sets of incoming documents. This is also the module used by newer versions of the Requests library. For the sake of backwards compatibility, chardet and cchardet both take precedence if installed. [bug=1955346]
2021-09-07Goodbye, Python 2. [bug=1942919]Leonard Richardson
2021-05-31The html.parser tree builder can now handles named entitiesLeonard Richardson
found in the HTML5 spec in much the same way that the html5lib tree builder does. Note that the lxml tree builder still handles named entities differently. [bug=1924908]
2021-02-13Added a second way to pass specify encodings to UnicodeDammit andLeonard Richardson
EncodingDetector, based on the order of precedence defined in the HTML5 spec, starting at: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding Encodings in 'known_definite_encodings' are tried first, then byte-order-mark sniffing is run, then encodings in 'user_encodings' are tried. The old argument, 'override_encodings', is now a deprecated alias for 'known_definite_encodings'. This changes the default behavior of the html.parser and lxml tree builders, in a way that may slightly improve encoding detection but will probably have no effect. [bug=1889014]
2020-05-17Switch entirely to Python 3-style print statements, even in Python 2.Leonard Richardson
2019-12-24Fixed deprecation warning. [bug=1855301]Leonard Richardson
2019-12-24Added docstrings to all public methods in dammit.py.Leonard Richardson
2019-09-02Avoid a crash when trying to detect the declared encoding of aLeonard Richardson
Unicode document. Raise an explanatory exception when the underlying parser completely rejects the incoming markup. [bug=1838877]
2019-07-07' (which is valid in XML and XHTML, but not HTML 4) is nowLeonard Richardson
recognized as a named entity and converted to a single quote. [bug=1818721]
2018-12-24Clarified the software license.Leonard Richardson
2018-07-14Fixed code that was causing deprecation warnings in recent Python 3Leonard Richardson
versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496]
2016-12-19Indentation change contributed by Pranav Salunke.Leonard Richardson
2016-07-17Use a dedicated logger instead of the root logger. [bug=1511661]Leonard Richardson
2016-07-17Use a dedicated logger instead of the root logger. [bug=1511661]Leonard Richardson
2016-07-16Removed imports to pdb, since pdb is not available in some environments. ↵Leonard Richardson
[bug=1491700]
2016-07-16Rename COPYING.txt to LICENSE. Add a reference to LICENSE in every source file.Leonard Richardson
2016-04-06Minor change. Extra indent for character so it looks nicer.Pranav Salunke
2015-09-28Add a __license__ statement to all source files.Leonard Richardson
2015-07-03Unicode data cannot have a byte-order mark. Returning early stops a warning ↵Leonard Richardson
from happening.
2015-06-27Added an exclude_encodings argument to UnicodeDammit and to theLeonard Richardson
Beautiful Soup constructor, which lets you prohibit the detection of an encoding that you know is wrong. [bug=1469408]
2015-06-26Added a sanity check helper method that makes sure all the elements of a ↵Leonard Richardson
tree are properly connected via .next_element and .previous_element.
2015-06-25Fixed a crash in Unicode, Dammit's encoding detector when the nameLeonard Richardson
of the encoding itself contained invalid bytes. [bug=1360913]
2013-10-02Fixed a bug that caused Unicode data put into UnicodeDammit toLeonard Richardson
return None instead of the original data. [bug=1214983]
2013-06-03Inlined some commonly called code to save a function call.Leonard Richardson
2013-06-03Limit how much of the document is searched via regular expression for a ↵Leonard Richardson
declared encoding.
2013-06-02Turns out we had two bits of code to strip byte-order marks.Leonard Richardson
2013-06-02It turns out most of the untested code wasn't doing anything useful.Leonard Richardson
2013-05-31Create a new lxml parser object for every new parsing strategy.Leonard Richardson
2013-05-30Refactored code a bit.Leonard Richardson
2013-05-30Split out the code that guesses at encodings from the code that tries to ↵Leonard Richardson
decode a bytestring based on those encodings. This is necessary because lxml wants to do the decoding itself.
2013-05-20The default XML formatter will now replace ampersands even if they appear to ↵Leonard Richardson
be part of entities. That is, "<" will become "<".[bug=1182183]
2012-11-03Doc fixes.Leonard Richardson
2012-08-17Fixed cchardet import.Leonard Richardson
2012-07-03Mentioned cchardet in docs.Leonard Richardson
2012-07-03When sniffing encodings, if the cchardet library is installed, use it ↵Leonard Richardson
instead of chardet. It's much faster. [bug=1020748]
2012-07-03Use logging.warning() instead of warning.warn() to notify the user that ↵Leonard Richardson
characters were replaced with REPLACEMENT CHARACTER. [bug=1013862]
2012-05-24Comments, processing instructions, document type declarations, and markup ↵Leonard Richardson
declarations are now treated as preformatted strings, the way CData blocks are. [bug=1001025] Also in this commit: renamed detwingle method to detwingle().
2012-05-03Fixed the handling of " with the built-in parser. [bug=993871]Leonard Richardson
2012-04-27Added experimental support for fixing Windows-1252 characters embedded in ↵Leonard Richardson
UTF-8 documents.
2012-04-26Fixed a bug in decoding data that contained a byte-order mark, such as data ↵Leonard Richardson
encoded in UTF-16LE. [bug=988980]
2012-04-16Unicode, Dammit now has an option to turn MS smart quotes into ASCII characters.Leonard Richardson
2012-04-16Attribute values are now run through the provided output formatter. ↵Leonard Richardson
Previously they were always run through the 'minimal' formatter. [bug=980237]
2012-02-16Issue a warning if characters were replaced with REPLACEMENT CHARACTER ↵Leonard Richardson
during Unicode conversion.
2012-02-09As a last-ditch attempt to turn data into Unicode, use errors=replace ↵Leonard Richardson
instead of errors=strict.
2012-02-09Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags like ↵Leonard Richardson
<meta charset="utf-8" />. [bug=837268]
2012-02-09Minor Unicode, Dammit cleanup.Leonard Richardson
2012-02-09Improved Unicode, Dammit's behavior when you give it Unicode to begin with.Leonard Richardson
2011-06-29Various changes so most tests pass on Python 3.Thomas Kluyver