The behavior of methods like .get_text() and .strings now differs

depending on the type of tag. The change is visible with HTML tags like <script>, <style>, and <template>. Starting in 4.9.0, methods like get_text() returned no results on such tags, because the contents of those tags are not considered 'text' within the document as a whole. But a user who calls script.get_text() is working from a different definition of 'text' than a user who calls div.get_text()--otherwise there would be no need to call script.get_text() at all. In 4.10.0, the contents of (e.g.) a <script> tag are considered 'text' during a get_text() call on the tag itself, but not considered 'text' during a get_text() call on the tag's parent. Because of this change, calling get_text() on each child of a tag may now return a different result than calling get_text() on the tag itself. That's because different tags now have different understandings of what counts as 'text'. [bug=1906226] [bug=1868861]
author: Leonard Richardson <leonardr@segfault.org> 2021-02-13 16:43:34 -0500
committer: Leonard Richardson <leonardr@segfault.org> 2021-02-13 16:43:34 -0500
commit: c876fbf402f15d924b7c0d9a9be5ba80769444a3 (patch)
tree: d2589d7db86200d17cb05e949f7fe09a439e53b2 /bs4/tests/test_tree.py
parent: 185ec704743ffa0dfd95b7a29e2f5d38a25433b5 (diff)
1 files changed, 34 insertions, 0 deletions
diff --git a/bs4/tests/test_tree.py b/bs4/tests/test_tree.py
index 1bd1577..9267a8f 100644
--- a/bs4/tests/test_tree.py
+++ b/bs4/tests/test_tree.py
@@ -1440,6 +1440,40 @@ class TestElementObjects(SoupTest):
         soup = self.soup("foo<style>CSS</style><script>Javascript</script>bar")
         self.assertEqual(['foo', 'bar'], list(soup.strings))
 
+    def test_string_methods_inside_special_string_container_tags(self):
+        # Strings inside tags like <script> are generally ignored by
+        # methods like get_text, because they're not what humans
+        # consider 'text'. But if you call get_text on the <script>
+        # tag itself, those strings _are_ considered to be 'text',
+        # because there's nothing else you might be looking for.
+        
+        style = self.soup("<div>a<style>Some CSS</style></div>")
+        template = self.soup("<div>a<template><p>Templated <b>text</b>.</p><!--With a comment.--></template></div>")
+        script = self.soup("<div>a<script><!--a comment-->Some text</script></div>")
+        
+        self.assertEqual(style.div.get_text(), "a")
+        self.assertEqual(list(style.div.strings), ["a"])
+        self.assertEqual(style.div.style.get_text(), "Some CSS")
+        self.assertEqual(list(style.div.style.strings),
+                         ['Some CSS'])
+        
+        # The comment is not picked up here. That's because it was
+        # parsed into a Comment object, which is not considered
+        # interesting by template.strings.
+        self.assertEqual(template.div.get_text(), "a")
+        self.assertEqual(list(template.div.strings), ["a"])
+        self.assertEqual(template.div.template.get_text(), "Templated text.")
+        self.assertEqual(list(template.div.template.strings),
+                         ["Templated ", "text", "."])
+
+        # The comment is included here, because it didn't get parsed
+        # into a Comment object--it's part of the Script string.
+        self.assertEqual(script.div.get_text(), "a")
+        self.assertEqual(list(script.div.strings), ["a"])
+        self.assertEqual(script.div.script.get_text(),
+                         "<!--a comment-->Some text")
+        self.assertEqual(list(script.div.script.strings),
+                         ['<!--a comment-->Some text'])
 
 class TestCDAtaListAttributes(SoupTest):
author	Leonard Richardson <leonardr@segfault.org>	2021-02-13 16:43:34 -0500
committer	Leonard Richardson <leonardr@segfault.org>	2021-02-13 16:43:34 -0500
commit	c876fbf402f15d924b7c0d9a9be5ba80769444a3 (patch)
tree	d2589d7db86200d17cb05e949f7fe09a439e53b2 /bs4/tests/test_tree.py
parent	185ec704743ffa0dfd95b7a29e2f5d38a25433b5 (diff)