summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-02-14The 'html5' formatter now treats attributes whose values are theLeonard Richardson
empty string as HTML boolean attributes. Previously (and in other formatters), an attribute value must be set as None to be treated as a boolean attribute. In a future release, I plan to also give this behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424]
2021-02-13The behavior of methods like .get_text() and .strings now differsLeonard Richardson
depending on the type of tag. The change is visible with HTML tags like <script>, <style>, and <template>. Starting in 4.9.0, methods like get_text() returned no results on such tags, because the contents of those tags are not considered 'text' within the document as a whole. But a user who calls script.get_text() is working from a different definition of 'text' than a user who calls div.get_text()--otherwise there would be no need to call script.get_text() at all. In 4.10.0, the contents of (e.g.) a <script> tag are considered 'text' during a get_text() call on the tag itself, but not considered 'text' during a get_text() call on the tag's parent. Because of this change, calling get_text() on each child of a tag may now return a different result than calling get_text() on the tag itself. That's because different tags now have different understandings of what counts as 'text'. [bug=1906226] [bug=1868861]
2021-02-13Corrected the use of special string container classes in cases when aLeonard Richardson
single tag may contain strings with different containers; such as the <template> tag, which may contain both TemplateString objects and Comment objects. [bug=1913406]
2021-02-13Added a second way to pass specify encodings to UnicodeDammit andLeonard Richardson
EncodingDetector, based on the order of precedence defined in the HTML5 spec, starting at: https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding Encodings in 'known_definite_encodings' are tried first, then byte-order-mark sniffing is run, then encodings in 'user_encodings' are tried. The old argument, 'override_encodings', is now a deprecated alias for 'known_definite_encodings'. This changes the default behavior of the html.parser and lxml tree builders, in a way that may slightly improve encoding detection but will probably have no effect. [bug=1889014]
2021-02-13Performance improvement when processing tags that speeds up overallLeonard Richardson
tree construction by 2%. Patch by Morotti. [bug=1899358]
2021-02-13Improve the warning issued when a directory name (as opposed toLeonard Richardson
the name of a regular file) is passed as markup into the BeautifulSoup constructor. [bug=1913628]
2021-02-13Corrected output when the namespace prefix associated with aLeonard Richardson
namespaced attribute is the empty string, as opposed to None. [bug=1915583]
2020-10-24Exclude more tests from the package. Patch by Ville Skyttä.Leonard Richardson
2020-10-24Fix tests install exclusionVille Skyttä
2020-10-03I always forget to bump the version number in the doc.Leonard Richardson
2020-10-03Prepare for release.Leonard Richardson
2020-10-02Implemented a significant performance optimization to the process ofLeonard Richardson
searching the parse tree. Patch by Morotti. [bug=1898212]
2020-09-26Changed version number of development Python in use.Leonard Richardson
2020-09-26Incremented version number in the documentation.Leonard Richardson
2020-09-26Increment version number.Leonard Richardson
2020-09-26Fixed a bug that inconsistently moved elements over when passingLeonard Richardson
a Tag, rather than a list, into Tag.extend(). [bug=1885710]
2020-09-26Change the signatures for BeautifulSoup.insert_before and insert_afterLeonard Richardson
(which are not implemented) to match PageElement.insert_before and insert_after, quieting warnings in some IDEs. [bug=1897120]
2020-08-31Specify the soupsieve dependency in a way that complies withLeonard Richardson
PEP 508. Patch by Mike Nerone. [bug=1893696]
2020-08-31Correct PyPI dep metadata (PEP 508 env markers instead of a condition in ↵Mike Nerone
setup.py)
2020-07-29Ran through all of the documentation code examples using Python 3, corrected ↵Leonard Richardson
discrepancies and errors, and updated representations.
2020-07-24Added a paragraph to the documentation about the fact that bs4 Tag ↵Leonard Richardson
implements __hash__ and bs3 Tag doesn't.
2020-06-11Converted the sample code in README.md to Python 3.Leonard Richardson
2020-05-31Make the doc a little less defensive.Leonard Richardson
2020-05-31Added to the troubleshooting section a bit to catch searches for the ↵Leonard Richardson
AttributeError that happens if you treat a string like a tag.
2020-05-30Fixed a bug that caused too many tags to be popped from the tagLeonard Richardson
stack during tree building, when encountering a closing tag that had no matching opening tag. [bug=1880420]
2020-05-30Remove explicit reference to the module name within the module, replacing it ↵Leonard Richardson
with __name__.
2020-05-17Prep for release.Leonard Richardson
2020-05-17Switch entirely to Python 3-style print statements, even in Python 2.Leonard Richardson
2020-05-17Documented some recently added customization features.Leonard Richardson
2020-05-17Added docstring for BeautifulSoup.new_tag.Leonard Richardson
2020-05-17Added a keyword argument on_duplicate_attribute to theLeonard Richardson
BeautifulSoupHTMLParser constructor (used by the html.parser tree builder) which lets you customize the handling of markup that contains the same attribute more than once, as in: <a href="url1" href="url2"> [bug=1878209]
2020-04-25Try to clarify the docs further that get_text now returns human-readable text.Leonard Richardson
2020-04-24If you encode a document with a Python-specific encoding likeLeonard Richardson
'unicode_escape', that encoding is no longer mentioned in the final XML or HTML document. Instead, encoding information is omitted or left blank. [bug=1874955]
2020-04-21Fixed typo.Leonard Richardson
2020-04-21Added two distinct UserWarning subclasses for warnings issued from the ↵Leonard Richardson
BeautifulSoup constructor which a caller may want to filter out. [bug=1873787]
2020-04-12Fixed test failures when run against soupselect 2.0. Patch by TomášLeonard Richardson
Chvátal. [bug=1872279]
2020-04-07Add Script, Stylesheet, and TemplateString to the 'bs4' namespace.Leonard Richardson
2020-04-07Added a notice about the new behavior of .text to the documentation.Leonard Richardson
2020-04-05Set up a different soupsieve dependency for Python 2.Leonard Richardson
2020-04-05Embedded CSS and Javascript is now stored in distinct Stylesheet andLeonard Richardson
Script tags, which are ignored by methods like get_text(). This feature is not supported by the html5lib treebuilder. [bug=1868861]
2020-04-04Use an :rtype: reported to work in pycharm.Leonard Richardson
2020-04-04select() always returns a Tag, so be more specific about its return type.Leonard Richardson
2020-04-04Added a Russian translation by 'authoress' to the repository.Leonard Richardson
2020-04-04Corrected error in Chinese translation, found by "One J".Leonard Richardson
2020-03-10Fixed a bug that happened when passing a Unicode filename containingLeonard Richardson
non-ASCII characters as markup into Beautiful Soup, on a system that allows Unicode filenames. [bug=1866717]
2020-03-09Make find() methods return a union type of the two most common PageElements, ↵Leonard Richardson
rather than PageElement itself.
2020-03-06Added a paragraph about the fact that prettify() adds whitespace to a document.Leonard Richardson
2020-03-05Added a performance optimization to PageElement.extract(). Patch by Arthur ↵Leonard Richardson
Darcet.
2020-01-22Merging in request 377978Leonard Richardson
2020-01-23Fix a confusing typo in the description of formatter="html5".Colin Watson