Age | Commit message (Collapse) | Author | |
---|---|---|---|
2021-12-19 | Remove a huge list of HTML entities that was only necessary under Python 2. | Leonard Richardson | |
2021-12-19 | Removed support for the iconv_codec library, which doesn't seem | Leonard Richardson | |
to exist anymore and was never put up on PyPI. (The closest replacement on PyPI, iconv_codecs, is GPL-licensed, so we can't use it.) | |||
2021-12-19 | If the charset-normalizer Python module | Leonard Richardson | |
(https://pypi.org/project/charset-normalizer/) is installed, Beautiful Soup will use it to detect the character sets of incoming documents. This is also the module used by newer versions of the Requests library. For the sake of backwards compatibility, chardet and cchardet both take precedence if installed. [bug=1955346] | |||
2021-09-07 | Goodbye, Python 2. [bug=1942919] | Leonard Richardson | |
2021-05-31 | The html.parser tree builder can now handles named entities | Leonard Richardson | |
found in the HTML5 spec in much the same way that the html5lib tree builder does. Note that the lxml tree builder still handles named entities differently. [bug=1924908] | |||
2021-02-13 | Added a second way to pass specify encodings to UnicodeDammit and | Leonard Richardson | |
EncodingDetector, based on the order of precedence defined in the HTML5 spec, starting at: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding Encodings in 'known_definite_encodings' are tried first, then byte-order-mark sniffing is run, then encodings in 'user_encodings' are tried. The old argument, 'override_encodings', is now a deprecated alias for 'known_definite_encodings'. This changes the default behavior of the html.parser and lxml tree builders, in a way that may slightly improve encoding detection but will probably have no effect. [bug=1889014] | |||
2020-05-17 | Switch entirely to Python 3-style print statements, even in Python 2. | Leonard Richardson | |
2019-12-24 | Fixed deprecation warning. [bug=1855301] | Leonard Richardson | |
2019-12-24 | Added docstrings to all public methods in dammit.py. | Leonard Richardson | |
2019-09-02 | Avoid a crash when trying to detect the declared encoding of a | Leonard Richardson | |
Unicode document. Raise an explanatory exception when the underlying parser completely rejects the incoming markup. [bug=1838877] | |||
2019-07-07 | ' (which is valid in XML and XHTML, but not HTML 4) is now | Leonard Richardson | |
recognized as a named entity and converted to a single quote. [bug=1818721] | |||
2018-12-24 | Clarified the software license. | Leonard Richardson | |
2018-07-14 | Fixed code that was causing deprecation warnings in recent Python 3 | Leonard Richardson | |
versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496] | |||
2016-12-19 | Indentation change contributed by Pranav Salunke. | Leonard Richardson | |
2016-07-17 | Use a dedicated logger instead of the root logger. [bug=1511661] | Leonard Richardson | |
2016-07-17 | Use a dedicated logger instead of the root logger. [bug=1511661] | Leonard Richardson | |
2016-07-16 | Removed imports to pdb, since pdb is not available in some environments. ↵ | Leonard Richardson | |
[bug=1491700] | |||
2016-07-16 | Rename COPYING.txt to LICENSE. Add a reference to LICENSE in every source file. | Leonard Richardson | |
2016-04-06 | Minor change. Extra indent for character so it looks nicer. | Pranav Salunke | |
2015-09-28 | Add a __license__ statement to all source files. | Leonard Richardson | |
2015-07-03 | Unicode data cannot have a byte-order mark. Returning early stops a warning ↵ | Leonard Richardson | |
from happening. | |||
2015-06-27 | Added an exclude_encodings argument to UnicodeDammit and to the | Leonard Richardson | |
Beautiful Soup constructor, which lets you prohibit the detection of an encoding that you know is wrong. [bug=1469408] | |||
2015-06-26 | Added a sanity check helper method that makes sure all the elements of a ↵ | Leonard Richardson | |
tree are properly connected via .next_element and .previous_element. | |||
2015-06-25 | Fixed a crash in Unicode, Dammit's encoding detector when the name | Leonard Richardson | |
of the encoding itself contained invalid bytes. [bug=1360913] | |||
2013-10-02 | Fixed a bug that caused Unicode data put into UnicodeDammit to | Leonard Richardson | |
return None instead of the original data. [bug=1214983] | |||
2013-06-03 | Inlined some commonly called code to save a function call. | Leonard Richardson | |
2013-06-03 | Limit how much of the document is searched via regular expression for a ↵ | Leonard Richardson | |
declared encoding. | |||
2013-06-02 | Turns out we had two bits of code to strip byte-order marks. | Leonard Richardson | |
2013-06-02 | It turns out most of the untested code wasn't doing anything useful. | Leonard Richardson | |
2013-05-31 | Create a new lxml parser object for every new parsing strategy. | Leonard Richardson | |
2013-05-30 | Refactored code a bit. | Leonard Richardson | |
2013-05-30 | Split out the code that guesses at encodings from the code that tries to ↵ | Leonard Richardson | |
decode a bytestring based on those encodings. This is necessary because lxml wants to do the decoding itself. | |||
2013-05-20 | The default XML formatter will now replace ampersands even if they appear to ↵ | Leonard Richardson | |
be part of entities. That is, "<" will become "&lt;".[bug=1182183] | |||
2012-11-03 | Doc fixes. | Leonard Richardson | |
2012-08-17 | Fixed cchardet import. | Leonard Richardson | |
2012-07-03 | Mentioned cchardet in docs. | Leonard Richardson | |
2012-07-03 | When sniffing encodings, if the cchardet library is installed, use it ↵ | Leonard Richardson | |
instead of chardet. It's much faster. [bug=1020748] | |||
2012-07-03 | Use logging.warning() instead of warning.warn() to notify the user that ↵ | Leonard Richardson | |
characters were replaced with REPLACEMENT CHARACTER. [bug=1013862] | |||
2012-05-24 | Comments, processing instructions, document type declarations, and markup ↵ | Leonard Richardson | |
declarations are now treated as preformatted strings, the way CData blocks are. [bug=1001025] Also in this commit: renamed detwingle method to detwingle(). | |||
2012-05-03 | Fixed the handling of " with the built-in parser. [bug=993871] | Leonard Richardson | |
2012-04-27 | Added experimental support for fixing Windows-1252 characters embedded in ↵ | Leonard Richardson | |
UTF-8 documents. | |||
2012-04-26 | Fixed a bug in decoding data that contained a byte-order mark, such as data ↵ | Leonard Richardson | |
encoded in UTF-16LE. [bug=988980] | |||
2012-04-16 | Unicode, Dammit now has an option to turn MS smart quotes into ASCII characters. | Leonard Richardson | |
2012-04-16 | Attribute values are now run through the provided output formatter. ↵ | Leonard Richardson | |
Previously they were always run through the 'minimal' formatter. [bug=980237] | |||
2012-02-16 | Issue a warning if characters were replaced with REPLACEMENT CHARACTER ↵ | Leonard Richardson | |
during Unicode conversion. | |||
2012-02-09 | As a last-ditch attempt to turn data into Unicode, use errors=replace ↵ | Leonard Richardson | |
instead of errors=strict. | |||
2012-02-09 | Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags like ↵ | Leonard Richardson | |
<meta charset="utf-8" />. [bug=837268] | |||
2012-02-09 | Minor Unicode, Dammit cleanup. | Leonard Richardson | |
2012-02-09 | Improved Unicode, Dammit's behavior when you give it Unicode to begin with. | Leonard Richardson | |
2011-06-29 | Various changes so most tests pass on Python 3. | Thomas Kluyver | |