summaryrefslogtreecommitdiff
path: root/bs4/builder/_htmlparser.py
AgeCommit message (Collapse)Author
2023-06-04Fixed a case found by Mengyuhan where html.parser giving up onLeonard Richardson
markup would result in an AssertionError instead of a ParserRejectedMarkup exception.
2023-02-15When the html.parser parser decides it can't parse a document, BeautifulLeonard Richardson
Soup now consistently propagates this fact by raising a ParserRejectedMarkup error. [bug=2007343]
2023-01-27Got rid of some more warnings by removing code that's not relevant anymore, ↵Leonard Richardson
now that the minimum supported Python version is 3.6.
2023-01-27Warnings now do their best to provide an appropriate stacklevel,Leonard Richardson
improving the usefulness of the message. [bug=1978744]
2021-10-24Issue a warning when an HTML parser is used to parse a document thatLeonard Richardson
looks like XML but not XHTML. [bug=1939121]
2021-09-07Goodbye, Python 2. [bug=1942919]Leonard Richardson
2021-05-31The html.parser tree builder can now handles named entitiesLeonard Richardson
found in the HTML5 spec in much the same way that the html5lib tree builder does. Note that the lxml tree builder still handles named entities differently. [bug=1924908]
2021-02-13Added a second way to pass specify encodings to UnicodeDammit andLeonard Richardson
EncodingDetector, based on the order of precedence defined in the HTML5 spec, starting at: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding Encodings in 'known_definite_encodings' are tried first, then byte-order-mark sniffing is run, then encodings in 'user_encodings' are tried. The old argument, 'override_encodings', is now a deprecated alias for 'known_definite_encodings'. This changes the default behavior of the html.parser and lxml tree builders, in a way that may slightly improve encoding detection but will probably have no effect. [bug=1889014]
2020-05-30Remove explicit reference to the module name within the module, replacing it ↵Leonard Richardson
with __name__.
2020-05-17Switch entirely to Python 3-style print statements, even in Python 2.Leonard Richardson
2020-05-17Documented some recently added customization features.Leonard Richardson
2020-05-17Added a keyword argument on_duplicate_attribute to theLeonard Richardson
BeautifulSoupHTMLParser constructor (used by the html.parser tree builder) which lets you customize the handling of markup that contains the same attribute more than once, as in: <a href="url1" href="url2"> [bug=1878209]
2019-12-24Added docstrings for some but not all tree buidlers.Leonard Richardson
2019-11-11Simplified code.Leonard Richardson
2019-11-11The html.parser tree builder now correctly handles DOCTYPEs that areLeonard Richardson
not uppercase. [bug=1848401]
2019-07-21Implemented line number tracking for html5lib.Leonard Richardson
2019-07-21Adapt Chris Mayo's code to track line number and position when using ↵Leonard Richardson
html.parser.
2019-07-07It's now possible to override a TreeBuilder's cdata_list_attributes ↵Leonard Richardson
dictionary by passing in a replacement. None will disable the feature altogether. [bug=1832978]
2018-12-24Clarified the software license.Leonard Richardson
2018-07-28Correctly handle invalid HTML numeric character entities like &#147;Leonard Richardson
which reference code points that are not Unicode code points. Note that this is only fixed when Beautiful Soup is used with the html.parser parser -- html5lib already worked and I couldn't fix it with lxml. [bug=1782933]
2018-07-21Fixed a problem where the html.parser tree builder interpretedLeonard Richardson
a string like '&foo ' as the character entity '&foo;' [bug=1728706]
2018-07-15Stop data loss when encountering an empty numeric entity, andLeonard Richardson
possibly in other cases. Thanks to tos.kamiya for the fix. [bug=1698503]
2018-07-14Stopped HTMLParser from raising an exception in very rare cases ofLeonard Richardson
bad markup. [bug=1708831]
2017-05-06 Improved the handling of empty-element tags like <br> when using theLeonard Richardson
html.parser parser. [bug=1676935]
2016-07-16Removed imports to pdb, since pdb is not available in some environments. ↵Leonard Richardson
[bug=1491700]
2016-07-16Added a separate class for XML processing instructions, which have a ↵Leonard Richardson
slightly different format from SGML processing instructions. [bug=1504383]
2016-07-16Rename COPYING.txt to LICENSE. Add a reference to LICENSE in every source file.Leonard Richardson
2015-06-28 It's now possible to pickle a BeautifulSoup object no matter whichLeonard Richardson
tree builder was used to create it. However, the only tree builder that survives the pickling process is the HTMLParserTreeBuilder ('html.parser'). If you unpickle a BeautifulSoup object created with some other tree builder, soup.builder will be None. [bug=1231545]
2015-06-27Added an exclude_encodings argument to UnicodeDammit and to theLeonard Richardson
Beautiful Soup constructor, which lets you prohibit the detection of an encoding that you know is wrong. [bug=1469408]
2015-06-24Fixed an import error in Python 3.5 caused by the removal of theLeonard Richardson
2015-06-24Made double sure that we don't use the 'strict' constructor argument when ↵Leonard Richardson
it's deprecated. [bug=1341055]
2014-12-11Improved the lxml tree builder's handling of processingLeonard Richardson
instructions. [bug=1294645]
2014-12-07In Python 3.4 and above, set the new convert_charrefs argument toLeonard Richardson
the html.parser constructor to avoid a warning and future failures. Patch by Stefano Revera. [bug=1375721]
2014-12-07Issue a warning if the BeautifulSoup constructor arguments do not explicitly ↵Leonard Richardson
name a parser.
2013-10-01Fixed a bug in which short Unicode input was improperly encoded to ASCII ↵Leonard Richardson
when checking whether or not it was a file on disk. [bug=1227016]
2013-06-02Merged in big encoding-detection refactoring branch.Leonard Richardson
2013-05-31The html.parser treebuilder can now handle numeric attributes inLeonard Richardson
text when the hexidecimal name of the attribute starts with a capital X.
2013-05-31Create a new lxml parser object for every new parsing strategy.Leonard Richardson
2013-05-07Now that lxml's segfault on invalid doctype has been fixed, fix aLeonard Richardson
corresponding problem on the Beautiful Soup end that was previously invisible. [bug=984936]
2012-04-18Changed wording slightly.Leonard Richardson
2012-04-18Print a warning on HTMLParseErrors to let people know they should install an ↵Leonard Richardson
external parser.
2012-04-18Fixed a bug that made the HTMLParser treebuilder generate XML definitions ↵Leonard Richardson
ending with two question marks instead of one. [bug=984258]
2012-02-21Added nsprefix argument to the tag class.Leonard Richardson
2012-02-21Merged from trunk.Leonard Richardson
2012-02-20It's now possible to copy a BeautifulSoup object created with the ↵Leonard Richardson
html.parser treebuilder.
2012-02-20Changd the class structure so that the default parser test class uses ↵Leonard Richardson
html.parser.
2012-02-16It's a start, at least.Leonard Richardson
2012-02-09As a last-ditch attempt to turn data into Unicode, use errors=replace ↵Leonard Richardson
instead of errors=strict.
2012-02-09Minor Unicode, Dammit cleanup.Leonard Richardson
2012-02-06Monkeypatch Python 3.2 versions prior to 3.2.3 to solve some major ↵Leonard Richardson
HTMLParser bugs.