Age | Commit message (Collapse) | Author | |
---|---|---|---|
2021-09-07 | Goodbye, Python 2. [bug=1942919] | Leonard Richardson | |
2021-05-31 | The html.parser tree builder can now handles named entities | Leonard Richardson | |
found in the HTML5 spec in much the same way that the html5lib tree builder does. Note that the lxml tree builder still handles named entities differently. [bug=1924908] | |||
2021-02-13 | Added a second way to pass specify encodings to UnicodeDammit and | Leonard Richardson | |
EncodingDetector, based on the order of precedence defined in the HTML5 spec, starting at: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding Encodings in 'known_definite_encodings' are tried first, then byte-order-mark sniffing is run, then encodings in 'user_encodings' are tried. The old argument, 'override_encodings', is now a deprecated alias for 'known_definite_encodings'. This changes the default behavior of the html.parser and lxml tree builders, in a way that may slightly improve encoding detection but will probably have no effect. [bug=1889014] | |||
2020-05-17 | Switch entirely to Python 3-style print statements, even in Python 2. | Leonard Richardson | |
2019-12-24 | Fixed deprecation warning. [bug=1855301] | Leonard Richardson | |
2019-12-24 | Added docstrings to all public methods in dammit.py. | Leonard Richardson | |
2019-09-02 | Avoid a crash when trying to detect the declared encoding of a | Leonard Richardson | |
Unicode document. Raise an explanatory exception when the underlying parser completely rejects the incoming markup. [bug=1838877] | |||
2019-07-07 | ' (which is valid in XML and XHTML, but not HTML 4) is now | Leonard Richardson | |
recognized as a named entity and converted to a single quote. [bug=1818721] | |||
2018-12-24 | Clarified the software license. | Leonard Richardson | |
2018-07-14 | Fixed code that was causing deprecation warnings in recent Python 3 | Leonard Richardson | |
versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496] | |||
2016-12-19 | Indentation change contributed by Pranav Salunke. | Leonard Richardson | |
2016-07-17 | Use a dedicated logger instead of the root logger. [bug=1511661] | Leonard Richardson | |
2016-07-17 | Use a dedicated logger instead of the root logger. [bug=1511661] | Leonard Richardson | |
2016-07-16 | Removed imports to pdb, since pdb is not available in some environments. ↵ | Leonard Richardson | |
[bug=1491700] | |||
2016-07-16 | Rename COPYING.txt to LICENSE. Add a reference to LICENSE in every source file. | Leonard Richardson | |
2016-04-06 | Minor change. Extra indent for character so it looks nicer. | Pranav Salunke | |
2015-09-28 | Add a __license__ statement to all source files. | Leonard Richardson | |
2015-07-03 | Unicode data cannot have a byte-order mark. Returning early stops a warning ↵ | Leonard Richardson | |
from happening. | |||
2015-06-27 | Added an exclude_encodings argument to UnicodeDammit and to the | Leonard Richardson | |
Beautiful Soup constructor, which lets you prohibit the detection of an encoding that you know is wrong. [bug=1469408] | |||
2015-06-26 | Added a sanity check helper method that makes sure all the elements of a ↵ | Leonard Richardson | |
tree are properly connected via .next_element and .previous_element. | |||
2015-06-25 | Fixed a crash in Unicode, Dammit's encoding detector when the name | Leonard Richardson | |
of the encoding itself contained invalid bytes. [bug=1360913] | |||
2013-10-02 | Fixed a bug that caused Unicode data put into UnicodeDammit to | Leonard Richardson | |
return None instead of the original data. [bug=1214983] | |||
2013-06-03 | Inlined some commonly called code to save a function call. | Leonard Richardson | |
2013-06-03 | Limit how much of the document is searched via regular expression for a ↵ | Leonard Richardson | |
declared encoding. | |||
2013-06-02 | Turns out we had two bits of code to strip byte-order marks. | Leonard Richardson | |
2013-06-02 | It turns out most of the untested code wasn't doing anything useful. | Leonard Richardson | |
2013-05-31 | Create a new lxml parser object for every new parsing strategy. | Leonard Richardson | |
2013-05-30 | Refactored code a bit. | Leonard Richardson | |
2013-05-30 | Split out the code that guesses at encodings from the code that tries to ↵ | Leonard Richardson | |
decode a bytestring based on those encodings. This is necessary because lxml wants to do the decoding itself. | |||
2013-05-20 | The default XML formatter will now replace ampersands even if they appear to ↵ | Leonard Richardson | |
be part of entities. That is, "<" will become "&lt;".[bug=1182183] | |||
2012-11-03 | Doc fixes. | Leonard Richardson | |
2012-08-17 | Fixed cchardet import. | Leonard Richardson | |
2012-07-03 | Mentioned cchardet in docs. | Leonard Richardson | |
2012-07-03 | When sniffing encodings, if the cchardet library is installed, use it ↵ | Leonard Richardson | |
instead of chardet. It's much faster. [bug=1020748] | |||
2012-07-03 | Use logging.warning() instead of warning.warn() to notify the user that ↵ | Leonard Richardson | |
characters were replaced with REPLACEMENT CHARACTER. [bug=1013862] | |||
2012-05-24 | Comments, processing instructions, document type declarations, and markup ↵ | Leonard Richardson | |
declarations are now treated as preformatted strings, the way CData blocks are. [bug=1001025] Also in this commit: renamed detwingle method to detwingle(). | |||
2012-05-03 | Fixed the handling of " with the built-in parser. [bug=993871] | Leonard Richardson | |
2012-04-27 | Added experimental support for fixing Windows-1252 characters embedded in ↵ | Leonard Richardson | |
UTF-8 documents. | |||
2012-04-26 | Fixed a bug in decoding data that contained a byte-order mark, such as data ↵ | Leonard Richardson | |
encoded in UTF-16LE. [bug=988980] | |||
2012-04-16 | Unicode, Dammit now has an option to turn MS smart quotes into ASCII characters. | Leonard Richardson | |
2012-04-16 | Attribute values are now run through the provided output formatter. ↵ | Leonard Richardson | |
Previously they were always run through the 'minimal' formatter. [bug=980237] | |||
2012-02-16 | Issue a warning if characters were replaced with REPLACEMENT CHARACTER ↵ | Leonard Richardson | |
during Unicode conversion. | |||
2012-02-09 | As a last-ditch attempt to turn data into Unicode, use errors=replace ↵ | Leonard Richardson | |
instead of errors=strict. | |||
2012-02-09 | Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags like ↵ | Leonard Richardson | |
<meta charset="utf-8" />. [bug=837268] | |||
2012-02-09 | Minor Unicode, Dammit cleanup. | Leonard Richardson | |
2012-02-09 | Improved Unicode, Dammit's behavior when you give it Unicode to begin with. | Leonard Richardson | |
2011-06-29 | Various changes so most tests pass on Python 3. | Thomas Kluyver | |
2011-05-21 | OK, figured that out. | Leonard Richardson | |
2011-05-21 | Changed dammit.py to require fewer changes to be Python 3 compatible. | Leonard Richardson | |
2011-03-05 | PEP8ifying | Aaron DeVore | |