summaryrefslogtreecommitdiff
path: root/bs4/builder
AgeCommit message (Collapse)Author
2024-08-21* Changes to make tests work whether tests are run under soupsieve 2.6Leonard Richardson
or an earlier version. Based on a patch by Stefano Rivera. * Removed the strip_cdata argument to lxml's HTMLParser constructor, which never did anything and is deprecated as of lxml 5.3.0. Patch by Stefano Rivera. [bug=2076897]
2024-02-12Applied patch from Marc Müller to add a stacklevel to a warning that was ↵Leonard Richardson
missing it.
2024-01-17Added the correct stacklevel to instances of the XMLParsedAsHTMLWarning.Leonard Richardson
[bug=2034451]
2023-06-04Fixed a case found by Mengyuhan where html.parser giving up onLeonard Richardson
markup would result in an AssertionError instead of a ParserRejectedMarkup exception.
2023-02-15When the html.parser parser decides it can't parse a document, BeautifulLeonard Richardson
Soup now consistently propagates this fact by raising a ParserRejectedMarkup error. [bug=2007343]
2023-01-27Got rid of some more warnings by removing code that's not relevant anymore, ↵Leonard Richardson
now that the minimum supported Python version is 3.6.
2023-01-27Warnings now do their best to provide an appropriate stacklevel,Leonard Richardson
improving the usefulness of the message. [bug=1978744]
2022-04-10Fixed another crash when overriding multi_valued_attributes and using theLeonard Richardson
html5lib parser. [bug=1948488]
2021-11-29Do a better job of keeping track of namespaces as an XML document isLeonard Richardson
parsed, so that CSS selectors that use namespaces will do the right thing more often. [bug=1946243]
2021-10-24Issue a warning when an HTML parser is used to parse a document thatLeonard Richardson
looks like XML but not XHTML. [bug=1939121]
2021-10-23Added a workaround for an lxml bug ↵Leonard Richardson
(https://bugs.launchpad.net/lxml/+bug/1948551) that caused problems when parsing a Unicode string beginning with BYTE ORDER MARK. [bug=1947768]
2021-10-23Fixed a crash when overriding multi_valued_attributes and using theLeonard Richardson
html5lib parser. [bug=1948488]
2021-10-11Added special string classes, RubyParenthesisString and RubyTextString,Leonard Richardson
to make it possible to treat ruby text specially in get_text() calls. [bug=1941980]
2021-09-07Goodbye, Python 2. [bug=1942919]Leonard Richardson
2021-05-31The html.parser tree builder can now handles named entitiesLeonard Richardson
found in the HTML5 spec in much the same way that the html5lib tree builder does. Note that the lxml tree builder still handles named entities differently. [bug=1924908]
2021-02-13Added a second way to pass specify encodings to UnicodeDammit andLeonard Richardson
EncodingDetector, based on the order of precedence defined in the HTML5 spec, starting at: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding Encodings in 'known_definite_encodings' are tried first, then byte-order-mark sniffing is run, then encodings in 'user_encodings' are tried. The old argument, 'override_encodings', is now a deprecated alias for 'known_definite_encodings'. This changes the default behavior of the html.parser and lxml tree builders, in a way that may slightly improve encoding detection but will probably have no effect. [bug=1889014]
2020-05-30Remove explicit reference to the module name within the module, replacing it ↵Leonard Richardson
with __name__.
2020-05-17Switch entirely to Python 3-style print statements, even in Python 2.Leonard Richardson
2020-05-17Documented some recently added customization features.Leonard Richardson
2020-05-17Added a keyword argument on_duplicate_attribute to theLeonard Richardson
BeautifulSoupHTMLParser constructor (used by the html.parser tree builder) which lets you customize the handling of markup that contains the same attribute more than once, as in: <a href="url1" href="url2"> [bug=1878209]
2020-04-05Embedded CSS and Javascript is now stored in distinct Stylesheet andLeonard Richardson
Script tags, which are ignored by methods like get_text(). This feature is not supported by the html5lib treebuilder. [bug=1868861]
2019-12-24Added docstrings for some but not all tree buidlers.Leonard Richardson
2019-11-11Simplified code.Leonard Richardson
2019-11-11The html.parser tree builder now correctly handles DOCTYPEs that areLeonard Richardson
not uppercase. [bug=1848401]
2019-11-11Added a Brazilian Portuguese translation by Cezar Peixeiro.Leonard Richardson
2019-09-02Avoid a crash when trying to detect the declared encoding of aLeonard Richardson
Unicode document. Raise an explanatory exception when the underlying parser completely rejects the incoming markup. [bug=1838877]
2019-07-21Implemented line number tracking for html5lib.Leonard Richardson
2019-07-21Adapt Chris Mayo's code to track line number and position when using ↵Leonard Richardson
html.parser.
2019-07-14Give the Formatter class more control over formatting decisions.Leonard Richardson
2019-07-07Renamed the cdata_list_attributes argument to multi_valued_attributes since ↵Leonard Richardson
it's facing the end-user and that's a more easily understandable name.
2019-07-07It's now possible to override a TreeBuilder's cdata_list_attributes ↵Leonard Richardson
dictionary by passing in a replacement. None will disable the feature altogether. [bug=1832978]
2019-01-06Don't track un-prefixed namespacesIsaac Muse
2018-12-30Fixed a problem with multi-valued attributes where the valueLeonard Richardson
contained whitespace. Thanks to Jens Svalgaard for the fix. [bug=1787453]
2018-12-24Clarified the software license.Leonard Richardson
2018-12-24Keep track of the namespace abbreviations found while parsing the document. ↵Leonard Richardson
This makes select() work most of the time without requiring a value for 'namespaces'.
2018-12-22Fix next and previous linkage issues. Fixes issues #1806598 and #1782928.Isaac Muse
2018-08-12Converted README to Markdown format.Leonard Richardson
2018-07-28Correctly handle invalid HTML numeric character entities like &#147;Leonard Richardson
which reference code points that are not Unicode code points. Note that this is only fixed when Beautiful Soup is used with the html.parser parser -- html5lib already worked and I couldn't fix it with lxml. [bug=1782933]
2018-07-21Fixed a problem where the html.parser tree builder interpretedLeonard Richardson
a string like '&foo ' as the character entity '&foo;' [bug=1728706]
2018-07-18Preserve XML namespaces when they are introduced inside an XMLLeonard Richardson
document, not just the ones introduced at the top level. [bug=1718787]
2018-07-15Introduced the Formatter system. [bug=1716272].Leonard Richardson
2018-07-15It's possible for a TreeBuilder subclass to specify that voidLeonard Richardson
elements should be represented as <element> rather than <element/>, by setting TreeBuilder.void_element_close_prefix to the empty string. [bug=1716272]
2018-07-15Stop data loss when encountering an empty numeric entity, andLeonard Richardson
possibly in other cases. Thanks to tos.kamiya for the fix. [bug=1698503]
2018-07-14Stopped HTMLParser from raising an exception in very rare cases ofLeonard Richardson
bad markup. [bug=1708831]
2017-05-06 Improved the handling of empty-element tags like <br> when using theLeonard Richardson
html.parser parser. [bug=1676935]
2017-05-06HTML parsers treat all HTML4 and HTML5 empty element tags (aka void element ↵Leonard Richardson
tags) correctly. [bug=1656909]
2016-12-19Fixed foster parenting when html5lib is the tree builder. Thanks to Geoffrey ↵Leonard Richardson
Sneddon for a patch and test.
2016-12-19Fixed yet another problem that caused the html5lib tree builder toLeonard Richardson
2016-07-30Explained why we test both unicode and bytestring processing instructions.Leonard Richardson
2016-07-26Fixed a reported (but not duplicated) bug involving processing instructions ↵Leonard Richardson
fed into the lxml HTML parser.