summaryrefslogtreecommitdiff
path: root/bs4/builder/_lxml.py
AgeCommit message (Collapse)Author
2021-11-29Do a better job of keeping track of namespaces as an XML document isLeonard Richardson
parsed, so that CSS selectors that use namespaces will do the right thing more often. [bug=1946243]
2021-10-24Issue a warning when an HTML parser is used to parse a document thatLeonard Richardson
looks like XML but not XHTML. [bug=1939121]
2021-10-23Added a workaround for an lxml bug ↵Leonard Richardson
(https://bugs.launchpad.net/lxml/+bug/1948551) that caused problems when parsing a Unicode string beginning with BYTE ORDER MARK. [bug=1947768]
2021-09-07Goodbye, Python 2. [bug=1942919]Leonard Richardson
2021-02-13Added a second way to pass specify encodings to UnicodeDammit andLeonard Richardson
EncodingDetector, based on the order of precedence defined in the HTML5 spec, starting at: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding Encodings in 'known_definite_encodings' are tried first, then byte-order-mark sniffing is run, then encodings in 'user_encodings' are tried. The old argument, 'override_encodings', is now a deprecated alias for 'known_definite_encodings'. This changes the default behavior of the html.parser and lxml tree builders, in a way that may slightly improve encoding detection but will probably have no effect. [bug=1889014]
2019-12-24Added docstrings for some but not all tree buidlers.Leonard Richardson
2019-11-11Added a Brazilian Portuguese translation by Cezar Peixeiro.Leonard Richardson
2019-09-02Avoid a crash when trying to detect the declared encoding of aLeonard Richardson
Unicode document. Raise an explanatory exception when the underlying parser completely rejects the incoming markup. [bug=1838877]
2019-07-21Implemented line number tracking for html5lib.Leonard Richardson
2019-07-07It's now possible to override a TreeBuilder's cdata_list_attributes ↵Leonard Richardson
dictionary by passing in a replacement. None will disable the feature altogether. [bug=1832978]
2019-01-06Don't track un-prefixed namespacesIsaac Muse
2018-12-24Clarified the software license.Leonard Richardson
2018-12-24Keep track of the namespace abbreviations found while parsing the document. ↵Leonard Richardson
This makes select() work most of the time without requiring a value for 'namespaces'.
2018-07-18Preserve XML namespaces when they are introduced inside an XMLLeonard Richardson
document, not just the ones introduced at the top level. [bug=1718787]
2018-07-14Stopped HTMLParser from raising an exception in very rare cases ofLeonard Richardson
bad markup. [bug=1708831]
2016-07-30Explained why we test both unicode and bytestring processing instructions.Leonard Richardson
2016-07-26Fixed a reported (but not duplicated) bug involving processing instructions ↵Leonard Richardson
fed into the lxml HTML parser.
2016-07-16Removed imports to pdb, since pdb is not available in some environments. ↵Leonard Richardson
[bug=1491700]
2016-07-16Added a separate class for XML processing instructions, which have a ↵Leonard Richardson
slightly different format from SGML processing instructions. [bug=1504383]
2016-07-16Rename COPYING.txt to LICENSE. Add a reference to LICENSE in every source file.Leonard Richardson
2015-06-28Accept 'xml' as an unambiguous identifier for the lxml XML parser, since ↵Leonard Richardson
it's the only XML parser supported at the moment.
2015-06-27Added an exclude_encodings argument to UnicodeDammit and to theLeonard Richardson
Beautiful Soup constructor, which lets you prohibit the detection of an encoding that you know is wrong. [bug=1469408]
2014-12-11Improved the lxml tree builder's handling of processingLeonard Richardson
instructions. [bug=1294645]
2014-12-07Tweaked the parser warning.Leonard Richardson
2014-12-07Issue a warning if the BeautifulSoup constructor arguments do not explicitly ↵Leonard Richardson
name a parser.
2013-06-02Turns out we had two bits of code to strip byte-order marks.Leonard Richardson
2013-06-02It turns out most of the untested code wasn't doing anything useful.Leonard Richardson
2013-06-02Treat an lxml ParserError as a ParserRejectedMarkup.Leonard Richardson
2013-05-31Create a new lxml parser object for every new parsing strategy.Leonard Richardson
2013-05-09Changed lxml.feed() to handle the eventuality that it may be given a bytestring.Leonard Richardson
2013-05-09Added a diagnostic function for randomly generating a simple, invalid HTML ↵Leonard Richardson
document.
2012-10-11Fix a bug in the lxml treebuilder which crashed when a tag includedLeonard Richardson
an attribute from the predefined xml: namespace. [bug=1065617]
2012-09-28Fixed package name.Leonard Richardson
2012-08-16Use namespace prefixes for namespaced attribute names, instead ofLeonard Richardson
the fully-qualified names given by the lxml parser. [bug=1037597]
2012-05-29Removed breakpoints.Leonard Richardson
2012-05-29Prep for release.Leonard Richardson
2012-05-24Fixed a bug with the lxml treebuilder that prevented the user from adding ↵Leonard Richardson
attributes to a tag that didn't originally have any. [bug=1002378] Thanks to Oliver Beattie for the patch.
2012-04-03Got rid of the 4.0.2 workaround for HTML documents--it was unnecessary and ↵Leonard Richardson
the workaround was triggering a (possibly different, but related) bug in lxml. [bug=972466]
2012-04-03Don't split up the markup into chunks when using the lxml HTML parser, which ↵Leonard Richardson
doesn't have the problems of the XML parser.
2012-03-24Pass data into XMLParser.feed() in chunks. [bug=963880]Leonard Richardson
2012-02-28Fixed the generated XML declaration.Leonard Richardson
2012-02-23Fixed handling of the closing of namespaced tags.Leonard Richardson
2012-02-23Merge from trunk and added tests.Leonard Richardson
2012-02-22Added comments.Leonard Richardson
2012-02-22Treat a new namespace mapping as a set of attributes on the tag that defines ↵Leonard Richardson
it, so we don't lose the mappings.
2012-02-21Have lxml invert namespace maps as they come in and set each tag's prefix ↵Leonard Richardson
appropriately.
2012-02-21Added nsprefix argument to the tag class.Leonard Richardson
2012-02-16It's a start, at least.Leonard Richardson
2012-02-09As a last-ditch attempt to turn data into Unicode, use errors=replace ↵Leonard Richardson
instead of errors=strict.
2012-02-09Minor Unicode, Dammit cleanup.Leonard Richardson