Age | Commit message (Collapse) | Author |
|
looks like XML but not XHTML. [bug=1939121]
|
|
|
|
commit to demonstrate that the renaming doesn't break anything. [bug=1947038]
|
|
(https://bugs.launchpad.net/lxml/+bug/1948551) that caused
problems when parsing a Unicode string beginning with BYTE ORDER MARK.
[bug=1947768]
|
|
html5lib parser. [bug=1948488]
|
|
to make it possible to treat ruby text specially in get_text() calls.
[bug=1941980]
|
|
|
|
|
|
and can be used to replace a single element with a sequence of elements.
Patch by Bill Chandos.
|
|
found in the HTML5 spec in much the same way that the html5lib
tree builder does. Note that the lxml tree builder still handles
named entities differently. [bug=1924908]
|
|
method, as well as the properties .strings and
.stripped_strings. These methods will either return the string
itself, or nothing, so the only reason to use this is when iterating
over a list of mixed Tag and NavigableString objects. [bug=1904309]
|
|
empty string as HTML boolean attributes. Previously (and in other
formatters), an attribute value must be set as None to be treated as
a boolean attribute. In a future release, I plan to also give this
behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424]
|
|
depending on the type of tag. The change is visible with HTML tags
like <script>, <style>, and <template>. Starting in 4.9.0, methods
like get_text() returned no results on such tags, because the
contents of those tags are not considered 'text' within the document
as a whole.
But a user who calls script.get_text() is working from a different
definition of 'text' than a user who calls div.get_text()--otherwise
there would be no need to call script.get_text() at all. In 4.10.0,
the contents of (e.g.) a <script> tag are considered 'text' during a
get_text() call on the tag itself, but not considered 'text' during
a get_text() call on the tag's parent.
Because of this change, calling get_text() on each child of a tag
may now return a different result than calling get_text() on the tag
itself. That's because different tags now have different
understandings of what counts as 'text'. [bug=1906226] [bug=1868861]
|
|
single tag may contain strings with different containers; such as
the <template> tag, which may contain both TemplateString objects
and Comment objects. [bug=1913406]
|
|
EncodingDetector, based on the order of precedence defined in the
HTML5 spec, starting at:
https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
Encodings in 'known_definite_encodings' are tried first, then
byte-order-mark sniffing is run, then encodings in 'user_encodings'
are tried. The old argument, 'override_encodings', is now a
deprecated alias for 'known_definite_encodings'.
This changes the default behavior of the html.parser and lxml tree
builders, in a way that may slightly improve encoding
detection but will probably have no effect. [bug=1889014]
|
|
tree construction by 2%. Patch by Morotti. [bug=1899358]
|
|
the name of a regular file) is passed as markup into the BeautifulSoup
constructor. [bug=1913628]
|
|
namespaced attribute is the empty string, as opposed to
None. [bug=1915583]
|
|
|
|
searching the parse tree. Patch by Morotti. [bug=1898212]
|
|
|
|
a Tag, rather than a list, into Tag.extend(). [bug=1885710]
|
|
(which are not implemented) to match PageElement.insert_before and
insert_after, quieting warnings in some IDEs. [bug=1897120]
|
|
PEP 508. Patch by Mike Nerone. [bug=1893696]
|
|
stack during tree building, when encountering a closing tag that had
no matching opening tag. [bug=1880420]
|
|
|
|
|
|
BeautifulSoupHTMLParser constructor (used by the html.parser tree
builder) which lets you customize the handling of markup that
contains the same attribute more than once, as in:
<a href="url1" href="url2"> [bug=1878209]
|
|
'unicode_escape', that encoding is no longer mentioned in the final
XML or HTML document. Instead, encoding information is omitted or
left blank. [bug=1874955]
|
|
|
|
BeautifulSoup constructor which a caller may want to filter out. [bug=1873787]
|
|
Chvátal. [bug=1872279]
|
|
|
|
Script tags, which are ignored by methods like get_text(). This
feature is not supported by the html5lib treebuilder. [bug=1868861]
|
|
|
|
non-ASCII characters as markup into Beautiful Soup, on a system that
allows Unicode filenames. [bug=1866717]
|
|
Darcet.
|
|
check whether you've already called decompose() on a Tag or
NavigableString.
|
|
decomposed.[bug=1857767]
|
|
|
|
|
|
|
|
|
|
not uppercase. [bug=1848401]
|
|
Watson. [bug=1847592]
|
|
|
|
on Python 3. [bug=1843545]
|
|
Unicode document. Raise an explanatory exception when the underlying parser
completely rejects the incoming markup. [bug=1838877]
|
|
lxml 4.4. Patch by Isaac Muse. [bug=1840141]
|
|
provide replacement classes to be instantiated for every tag ('tag_class')
or string ('string_class') encountered during parsing, rather than
using the default Tag and NavigableString objects.
|