summaryrefslogtreecommitdiff
path: root/bs4
AgeCommit message (Collapse)Author
2021-02-14The 'html5' formatter now treats attributes whose values are theLeonard Richardson
empty string as HTML boolean attributes. Previously (and in other formatters), an attribute value must be set as None to be treated as a boolean attribute. In a future release, I plan to also give this behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424]
2021-02-13The behavior of methods like .get_text() and .strings now differsLeonard Richardson
depending on the type of tag. The change is visible with HTML tags like <script>, <style>, and <template>. Starting in 4.9.0, methods like get_text() returned no results on such tags, because the contents of those tags are not considered 'text' within the document as a whole. But a user who calls script.get_text() is working from a different definition of 'text' than a user who calls div.get_text()--otherwise there would be no need to call script.get_text() at all. In 4.10.0, the contents of (e.g.) a <script> tag are considered 'text' during a get_text() call on the tag itself, but not considered 'text' during a get_text() call on the tag's parent. Because of this change, calling get_text() on each child of a tag may now return a different result than calling get_text() on the tag itself. That's because different tags now have different understandings of what counts as 'text'. [bug=1906226] [bug=1868861]
2021-02-13Corrected the use of special string container classes in cases when aLeonard Richardson
single tag may contain strings with different containers; such as the <template> tag, which may contain both TemplateString objects and Comment objects. [bug=1913406]
2021-02-13Added a second way to pass specify encodings to UnicodeDammit andLeonard Richardson
EncodingDetector, based on the order of precedence defined in the HTML5 spec, starting at: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding Encodings in 'known_definite_encodings' are tried first, then byte-order-mark sniffing is run, then encodings in 'user_encodings' are tried. The old argument, 'override_encodings', is now a deprecated alias for 'known_definite_encodings'. This changes the default behavior of the html.parser and lxml tree builders, in a way that may slightly improve encoding detection but will probably have no effect. [bug=1889014]
2021-02-13Performance improvement when processing tags that speeds up overallLeonard Richardson
tree construction by 2%. Patch by Morotti. [bug=1899358]
2021-02-13Improve the warning issued when a directory name (as opposed toLeonard Richardson
the name of a regular file) is passed as markup into the BeautifulSoup constructor. [bug=1913628]
2021-02-13Corrected output when the namespace prefix associated with aLeonard Richardson
namespaced attribute is the empty string, as opposed to None. [bug=1915583]
2020-10-03Prepare for release.Leonard Richardson
2020-10-02Implemented a significant performance optimization to the process ofLeonard Richardson
searching the parse tree. Patch by Morotti. [bug=1898212]
2020-09-26Increment version number.Leonard Richardson
2020-09-26Fixed a bug that inconsistently moved elements over when passingLeonard Richardson
a Tag, rather than a list, into Tag.extend(). [bug=1885710]
2020-09-26Change the signatures for BeautifulSoup.insert_before and insert_afterLeonard Richardson
(which are not implemented) to match PageElement.insert_before and insert_after, quieting warnings in some IDEs. [bug=1897120]
2020-05-30Fixed a bug that caused too many tags to be popped from the tagLeonard Richardson
stack during tree building, when encountering a closing tag that had no matching opening tag. [bug=1880420]
2020-05-30Remove explicit reference to the module name within the module, replacing it ↵Leonard Richardson
with __name__.
2020-05-17Switch entirely to Python 3-style print statements, even in Python 2.Leonard Richardson
2020-05-17Documented some recently added customization features.Leonard Richardson
2020-05-17Added docstring for BeautifulSoup.new_tag.Leonard Richardson
2020-05-17Added a keyword argument on_duplicate_attribute to theLeonard Richardson
BeautifulSoupHTMLParser constructor (used by the html.parser tree builder) which lets you customize the handling of markup that contains the same attribute more than once, as in: <a href="url1" href="url2"> [bug=1878209]
2020-04-24If you encode a document with a Python-specific encoding likeLeonard Richardson
'unicode_escape', that encoding is no longer mentioned in the final XML or HTML document. Instead, encoding information is omitted or left blank. [bug=1874955]
2020-04-21Added two distinct UserWarning subclasses for warnings issued from the ↵Leonard Richardson
BeautifulSoup constructor which a caller may want to filter out. [bug=1873787]
2020-04-12Fixed test failures when run against soupselect 2.0. Patch by TomášLeonard Richardson
Chvátal. [bug=1872279]
2020-04-07Add Script, Stylesheet, and TemplateString to the 'bs4' namespace.Leonard Richardson
2020-04-05Embedded CSS and Javascript is now stored in distinct Stylesheet andLeonard Richardson
Script tags, which are ignored by methods like get_text(). This feature is not supported by the html5lib treebuilder. [bug=1868861]
2020-04-04Use an :rtype: reported to work in pycharm.Leonard Richardson
2020-04-04select() always returns a Tag, so be more specific about its return type.Leonard Richardson
2020-03-10Fixed a bug that happened when passing a Unicode filename containingLeonard Richardson
non-ASCII characters as markup into Beautiful Soup, on a system that allows Unicode filenames. [bug=1866717]
2020-03-09Make find() methods return a union type of the two most common PageElements, ↵Leonard Richardson
rather than PageElement itself.
2020-03-05Added a performance optimization to PageElement.extract(). Patch by Arthur ↵Leonard Richardson
Darcet.
2020-01-01API CHANGE - Added PageElement.decomposed, a new property which lets youLeonard Richardson
check whether you've already called decompose() on a Tag or NavigableString.
2019-12-29Fixed an unhandled exception when formatting a Tag that had been ↵Leonard Richardson
decomposed.[bug=1857767]
2019-12-24Bumped version number.Leonard Richardson
2019-12-24Minor changes to docstrings.Leonard Richardson
2019-12-24Added :rtype: to the find method docstrings.Leonard Richardson
2019-12-24Added docstrings for some but not all tree buidlers.Leonard Richardson
2019-12-24Added docstrings to diagnose.py.Leonard Richardson
2019-12-24Wrote docstrings for formatter.py.Leonard Richardson
2019-12-24Fixed deprecation warning. [bug=1855301]Leonard Richardson
2019-12-24Added docstrings to all public methods in dammit.py.Leonard Richardson
2019-12-20Added docstrings to all methods in __init__.pyLeonard Richardson
2019-12-18Added Python docstrings to all public methods in element.py.Leonard Richardson
2019-11-11Simplified code.Leonard Richardson
2019-11-11The html.parser tree builder now correctly handles DOCTYPEs that areLeonard Richardson
not uppercase. [bug=1848401]
2019-11-11Fixed a deprecation warning on Python 3.7. Patch by ColinLeonard Richardson
Watson. [bug=1847592]
2019-11-11Added a Brazilian Portuguese translation by Cezar Peixeiro.Leonard Richardson
2019-11-10Fix deprecation warning with Python >= 3.7.Colin Watson
Python >= 3.7 issues a deprecation warning when using collections.Callable rather than collections.abc.Callable. Most of Beautiful Soup deals with this by using a conditional import, but the automatic Python 3 conversion apparently translates `callable(obj)` to `isinstance(obj, collections.Callable)` which trips this deprecation warning. `isinstance(obj, Callable)` works fine in Python 2 as well as 3, so just use it directly.
2019-10-06Added section on Python 2 sunsetting.Leonard Richardson
2019-10-05Avoid a crash when unpickling certain parse trees generated using html5lib ↵Leonard Richardson
on Python 3. [bug=1843545]
2019-09-02Avoid a crash when trying to detect the declared encoding of aLeonard Richardson
Unicode document. Raise an explanatory exception when the underlying parser completely rejects the incoming markup. [bug=1838877]
2019-08-26It's now possible to override any of the element classes.Leonard Richardson
2019-08-26Fixed the definition of the default XML namespace when usingLeonard Richardson
lxml 4.4. Patch by Isaac Muse. [bug=1840141]