Moved the formatter to its own class and updated its documentation.

author: Leonard Richardson <leonardr@segfault.org> 2019-07-15 15:22:32 -0400
committer: Leonard Richardson <leonardr@segfault.org> 2019-07-15 15:22:32 -0400
commit: 07d84a4e9af51863459a1e3d988f2806835fc110 (patch)
tree: aeb1f3dfd752d67b18b9e24f46590b23120e78cb /doc/source
parent: 81e86c866490ba9a5ce7109023b657d09d39dae1 (diff)
1 files changed, 28 insertions, 30 deletions
diff --git a/doc/source/index.rst b/doc/source/index.rst
index 8376549..0c09964 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -2264,16 +2264,17 @@ to Beautiful Soup generating invalid HTML/XML, as in these examples::
  print(link_soup.a.encode(formatter=None))
  # <a href="http://example.com/?foo=val1&bar=val2">A link</a>
 
-Finally, if you pass in a function for ``formatter``, Beautiful Soup
-will call that function once for every string and attribute value in
-the document. You can do whatever you want in this function. Here's a
-formatter that converts strings to uppercase and does absolutely
-nothing else::
+If you need more sophisticated control over your output, you can
+use Beautiful Soup's ``Formatter`` class. Here's a formatter that
+converts strings to uppercase, whether they occur in a text node or in an
+attribute value::
 
+ from bs4.formatter import HTMLFormatter
  def uppercase(str):
      return str.upper()
+ formatter = HTMLFormatter(uppercase)
 
- print(soup.prettify(formatter=uppercase))
+ print(soup.prettify(formatter=formatter))
  # <html>
  #  <body>
  #   <p>
@@ -2282,34 +2283,31 @@ nothing else::
  #  </body>
  # </html>
 
- print(link_soup.a.prettify(formatter=uppercase))
+ print(link_soup.a.prettify(formatter=formatter))
  # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2">
  #  A LINK
  # </a>
 
-If you're writing your own function, you should know about the
-``EntitySubstitution`` class in the ``bs4.dammit`` module. This class
-implements Beautiful Soup's standard formatters as class methods: the
-"html" formatter is ``EntitySubstitution.substitute_html``, and the
-"minimal" formatter is ``EntitySubstitution.substitute_xml``. You can
-use these functions to simulate ``formatter=html`` or
-``formatter==minimal``, but then do something extra.
-
-Here's an example that replaces Unicode characters with HTML entities
-whenever possible, but `also` converts all strings to uppercase::
-
- from bs4.dammit import EntitySubstitution
- def uppercase_and_substitute_html_entities(str):
-     return EntitySubstitution.substitute_html(str.upper())
-
- print(soup.prettify(formatter=uppercase_and_substitute_html_entities))
- # <html>
- #  <body>
- #   <p>
- #    IL A DIT &lt;&lt;SACR&Eacute; BLEU!&gt;&gt;
- #   </p>
- #  </body>
- # </html>
+Subclassing ``HTMLFormatter`` or ``XMLFormatter`` will give you even
+more control over the output. For example, Beautiful Soup sorts the
+attributes in every tag by default::
+
+ attr_soup = BeautifulSoup('<p z="1" m="2" a="3"></p>')
+ print(attr_soup.p.encode())
+ # <p a="3" m="2" z="1"></p>
+
+To turn this off, you can subclass the ``Formatter.attributes()``
+method, which controls which attributes are output and in what
+order. This implementation also filters out out one of the attributes.
+
+ class UnsortedAttributes(HTMLFormatter):
+     def attributes(self, tag):
+         for k, v in tag.attrs.items():
+             if k == 'm':
+	         continue
+             yield k, v
+ print(attr_soup.p.encode(formatter=UnsortedAttributes())) 
+ # <p z="1" a="3"></p>
 
 One last caveat: if you create a ``CData`` object, the text inside
 that object is always presented `exactly as it appears, with no
author	Leonard Richardson <leonardr@segfault.org>	2019-07-15 15:22:32 -0400
committer	Leonard Richardson <leonardr@segfault.org>	2019-07-15 15:22:32 -0400
commit	07d84a4e9af51863459a1e3d988f2806835fc110 (patch)
tree	aeb1f3dfd752d67b18b9e24f46590b23120e78cb /doc/source
parent	81e86c866490ba9a5ce7109023b657d09d39dae1 (diff)