summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--doc.html/index.jp.html2822
-rw-r--r--doc.html/index.kr.html2476
-rw-r--r--doc.ptbr/source/6.1.jpgbin0 -> 22619 bytes
-rw-r--r--doc.ptbr/source/conf.py256
-rw-r--r--doc.ptbr/source/index.rst (renamed from doc/source/index.ptbr.rst)0
-rw-r--r--doc.zh/source/6.1.jpgbin0 -> 22619 bytes
-rw-r--r--doc.zh/source/index.rst2739
-rw-r--r--doc.zh/source/index.zh.html2398
-rw-r--r--doc/source/README23
-rw-r--r--doc/source/index.rst21
10 files changed, 8337 insertions, 2398 deletions
diff --git a/doc.html/index.jp.html b/doc.html/index.jp.html
new file mode 100644
index 0000000..7f5d8e6
--- /dev/null
+++ b/doc.html/index.jp.html
@@ -0,0 +1,2822 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+
+
+<html xmlns="http://www.w3.org/1999/xhtml">
+ <head>
+ <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
+
+ <title>kondou.com - Beautiful Soup 4.2.0 Doc. 日本語訳 (2013-11-19最終更新)</title>
+
+ <link rel="stylesheet" href="_static/default.css" type="text/css" />
+ <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
+
+ <script type="text/javascript">
+ var DOCUMENTATION_OPTIONS = {
+ URL_ROOT: './',
+ VERSION: '4.2.0',
+ COLLAPSE_INDEX: false,
+ FILE_SUFFIX: '.html',
+ HAS_SOURCE: true
+ };
+ </script>
+ <script type="text/javascript" src="_static/jquery.js"></script>
+ <script type="text/javascript" src="_static/underscore.js"></script>
+ <script type="text/javascript" src="_static/doctools.js"></script>
+ <link rel="top" title="Beautiful Soup 4.2.0 Doc. 日本語訳 (2013-11-19最終更新)" href="#" />
+ </head>
+ <body>
+ <div class="related">
+ <h3>Navigation</h3>
+ <ul>
+ <li class="right" style="margin-right: 10px">
+ <a href="genindex.html" title="General Index"
+ accesskey="I">index</a></li>
+ <li><a href="#">Beautiful Soup 4.2.0 Doc. 日本語訳 (2013-11-19最終更新)</a> &raquo;</li>
+ </ul>
+ </div>
+
+ <div class="document">
+ <div class="documentwrapper">
+ <div class="bodywrapper">
+ <div class="body">
+
+ <div class="section" id="beautiful-soup">
+<h1>Beautiful Soup<a class="headerlink" href="#beautiful-soup" title="Permalink to this headline">¶</a></h1>
+<img alt="&quot;The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself.&quot;" class="align-right" src="6.1.jpg" />
+<p><a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> はHTMLやXMLファイルからデータを取得するPythonのライブラリです。あなたの好きなパーサー(構文解析器)を使って、パースツリー(構文木)の探索、検索、修正を行います。
+これはプログラマーの作業時間を大幅に短縮してくれます。</p>
+<div class="section" id="id2">
+<h2>(訳注)石鹸は食べられない<a class="headerlink" href="#id2" title="Permalink to this headline">¶</a></h2>
+<p>この文書は <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/">Beautiful Soup 4.2.0 Documentation</a> の日本語訳です。&#8221;Beautiful Soup&#8221;を&#8221;ビューティフルソープ&#8221;と読んでしまう英語が苦手でちょっぴりHな後輩のために翻訳しました。</p>
+<p>2013年10月29日からこの文書の翻訳をはじめました。11月1日現在まだ全てを訳し終えていませんが、スクレイピングに使う主な部分はとりあえず訳したので、一旦これで公開して、あとは年内を目処にまったりと翻訳をすすめ、あわせて質を高めていこうと思っています。今のところ、<a class="reference internal" href="#id38">パースツリーを修正</a> 以降は、ざっくり訳のためにおかしな表現が多々あることにご注意ください。</p>
+<p>誤訳やわかりづらいところを見つけたり、なにかご意見があるときには、近藤茂徳(<img src="sgm.png"/>)までご連絡ください。こういった翻訳をするのははじめてなので、つっこみ大歓迎です。よろしくお願いします。</p>
+<p>2013年10月現在、Beautiful Soupについての日本語Webページは、Beautiful Soup 3とBeautiful Soup 4(以下、BS3,BS4)の情報が混在しています。とくに、&#8221;Beautiful Soup&#8221;で日本語ページを対象にググると、最初に表示される10件中9件がBS3による情報であるために、初心者はそのままBS3を使って混乱しがちです。ご注意ください。</p>
+<p><strong>混乱しないように初心者が知っておくべきこと</strong></p>
+<ul class="simple">
+<li>2012年5月にBS3の開発が終了し、現在ではBS4の利用が推奨されています</li>
+<li>BS3はPython3に対応していません</li>
+<li>ただし、BS3のスクリプトのほとんどはimport文を変えるだけでBS4でも動きます</li>
+<li>そのため、BS3による情報も問題解決の役に立ちます</li>
+<li>詳しくは <a class="reference internal" href="#beautiful-soup-3">Beautiful Soup 3</a> を読んでください</li>
+<li>この文書の <a class="reference internal" href="#id7">クイックスタート</a> と <a class="reference internal" href="#find-all">find_all()</a> を読めば、それなりに用は足りると思います</li>
+</ul>
+</div>
+<div class="section" id="id3">
+<h2>この文書について<a class="headerlink" href="#id3" title="Permalink to this headline">¶</a></h2>
+<p>この文書は、Beautiful Soup 4 (訳注:以下BS4)の主要機能について、例を挙げて説明します。どのライブラリがよいか、どのように動くか、どのように使うか、どのようにあなたの望むことを達成するか、予想外の動きをしたときは何をすればよいかといったことを示します。</p>
+<p>この文書で挙げられる例は、Python2.7と3.2のどちらでも同じように動きます。</p>
+<p>あなたは <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">Beautiful Soup 3 (訳注:以下BS3)の文書</a> を探しているのかもしれません。もしそうなら、BS3はすでに開発を終えていて、BS4が全てのプロジェクト対して推奨されていることを知っていてください。BS3とBS4の違いを知りたいときは、<a class="reference internal" href="#bs4">BS4への移行</a> を見てください。</p>
+<p>この文書は、ユーザーにより他の言語にも翻訳されています。</p>
+<ul class="simple">
+<li>이 문서는 한국어 번역도 가능합니다.(<a class="reference external" href="http://coreapython.hosting.paran.com/etc/beautifulsoup4.html">외부 링크</a>)</li>
+</ul>
+</div>
+<div class="section" id="id5">
+<h2>助けてほしいときは<a class="headerlink" href="#id5" title="Permalink to this headline">¶</a></h2>
+<p>Beautiful Soup について疑問が生じたり、問題に直面したときは、 <a class="reference external" href="https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup">ディスカッショングループにメールしてください。</a> もし問題がHTMLのパースのことであれば、そのHTMLについて <a class="reference internal" href="#diagnose"><em>diagnose() 関数の返す内容</em></a> を必ず書くようにしてください。</p>
+</div>
+</div>
+<div class="section" id="id7">
+<h1>クイックスタート<a class="headerlink" href="#id7" title="Permalink to this headline">¶</a></h1>
+<p>以下のHTMLドキュメントは、このあと何回も例として用いられます。 <strong>ふしぎの国のアリス</strong> からの引用です。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">html_doc</span> <span class="o">=</span> <span class="s">&quot;&quot;&quot;</span>
+<span class="s">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;</span>
+<span class="s">&lt;body&gt;</span>
+<span class="s">&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;</span>
+
+<span class="s">&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were</span>
+<span class="s">&lt;a href=&quot;http://example.com/elsie&quot; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="s">&lt;a href=&quot;http://example.com/lacie&quot; class=&quot;sister&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and</span>
+<span class="s">&lt;a href=&quot;http://example.com/tillie&quot; class=&quot;sister&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;</span>
+<span class="s">and they lived at the bottom of a well.&lt;/p&gt;</span>
+
+<span class="s">&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
+<span class="s">&quot;&quot;&quot;</span>
+</pre></div>
+</div>
+<p>この&#8221;three sisters&#8221;ドキュメントを Beautiful Soup にかけると、 <tt class="docutils literal"><span class="pre">Beautiful</span> <span class="pre">Soup</span></tt> オブジェクトが得られます。これは入れ子データ構造でドキュメントを表現します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;head&gt;</span>
+<span class="c"># &lt;title&gt;</span>
+<span class="c"># The Dormouse&#39;s story</span>
+<span class="c"># &lt;/title&gt;</span>
+<span class="c"># &lt;/head&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p class=&quot;title&quot;&gt;</span>
+<span class="c"># &lt;b&gt;</span>
+<span class="c"># The Dormouse&#39;s story</span>
+<span class="c"># &lt;/b&gt;</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;p class=&quot;story&quot;&gt;</span>
+<span class="c"># Once upon a time there were three little sisters; and their names were</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;</span>
+<span class="c"># Elsie</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># ,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;</span>
+<span class="c"># Lacie</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># and</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link2&quot;&gt;</span>
+<span class="c"># Tillie</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># ; and they lived at the bottom of a well.</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;p class=&quot;story&quot;&gt;</span>
+<span class="c"># ...</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>以下は、データ構造を探索するいくつかの方法です。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">title</span>
+<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">name</span>
+<span class="c"># u&#39;title&#39;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">string</span>
+<span class="c"># u&#39;The Dormouse&#39;s story&#39;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span>
+<span class="c"># u&#39;head&#39;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">p</span>
+<span class="c"># &lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
+<span class="c"># u&#39;title&#39;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&#39;a&#39;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">&quot;link3&quot;</span><span class="p">)</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p>よくある処理として、ページの&lt;a&gt;タグ内にあるURLを全て抽出するというものがあります。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&#39;a&#39;</span><span class="p">):</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">link</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">&#39;href&#39;</span><span class="p">))</span>
+<span class="c"># http://example.com/elsie</span>
+<span class="c"># http://example.com/lacie</span>
+<span class="c"># http://example.com/tillie</span>
+</pre></div>
+</div>
+<p>また、ページからタグを除去して全テキストを抽出するという処理もあります。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">get_text</span><span class="p">())</span>
+<span class="c"># The Dormouse&#39;s story</span>
+<span class="c">#</span>
+<span class="c"># The Dormouse&#39;s story</span>
+<span class="c">#</span>
+<span class="c"># Once upon a time there were three little sisters; and their names were</span>
+<span class="c"># Elsie,</span>
+<span class="c"># Lacie and</span>
+<span class="c"># Tillie;</span>
+<span class="c"># and they lived at the bottom of a well.</span>
+<span class="c">#</span>
+<span class="c"># ...</span>
+</pre></div>
+</div>
+<p>必要な情報は得られましたか?つづきをどうぞ。</p>
+</div>
+<div class="section" id="id8">
+<h1>インストール<a class="headerlink" href="#id8" title="Permalink to this headline">¶</a></h1>
+<p>DebianかUbuntuの最近のバージョンを使っていれば、Beautiful Soupはシステムのパッケージマネージャでインストールできます。:</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">python-bs4</span></tt></p>
+<p>Beautiful Soup 4 は PyPiを通して公開されています。そのため、もしシステムパッケージで Beautiful Soup をインストールできないときは、<tt class="docutils literal"><span class="pre">easy_install</span></tt> か <tt class="docutils literal"><span class="pre">pip</span></tt> でインストールできます。</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">beautifulsoup4</span></tt></p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">beautifulsoup4</span></tt></p>
+<p>( <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> パッケージはおそらくあなたが探しているものでは <strong>ありません</strong> 。これは、一つ前のメジャーリリース <a class="reference internal" href="#beautiful-soup-3">Beautiful Soup 3</a> です。多くのソフトウェアがBS3を使っていて、今でもBS3は利用できます。しかし、新しくコードを書く場合は、 <tt class="docutils literal"><span class="pre">beautifulsoup4</span></tt> をインストールすべきです。)</p>
+<p>もし、 <tt class="docutils literal"><span class="pre">easy_install</span></tt> や <tt class="docutils literal"><span class="pre">pip</span></tt> をインストールしてないときは、<a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/download/4.x/">download the Beautiful Soup 4 source tarball</a> でソースをダウンロードし <tt class="docutils literal"><span class="pre">setup.py</span></tt> を用いてインストールできます。</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">python</span> <span class="pre">setup.py</span> <span class="pre">install</span></tt></p>
+<p>もしどの方法も失敗するのなら、あなたのアプリケーションにライブラリをそのままパッケージングするという手もあります。Beautiful Soupのライセンスはそれを認めています。.tar.gz形式でダウンロードし、アプリケーションのソースコード内に <tt class="docutils literal"><span class="pre">bs4</span></tt> ディレクトリをコピーしてください。そうすれば、Beautiful Soupをインストールすることなしに使うことができます。</p>
+<p>私は、Python 2.7とPython 3.2でBeautiful Soupを開発しましたが、他の最近のバージョンでも動くはずです。</p>
+<div class="section" id="id9">
+<h2>インストール後の問題<a class="headerlink" href="#id9" title="Permalink to this headline">¶</a></h2>
+<p>Beautiful SoupはPython 2のコードとしてパッケージされています。
+Beautiful SoupをPython 3環境で使おうとしてインストールすると、それは自動的にPython 3のコードとして変換されます。
+もし、Beautiful Soupパッケージをインストールしないと、コードは変換されません。
+Windowsでは、間違ったバージョンが入っていると、それが報告されます。</p>
+<p><tt class="docutils literal"><span class="pre">ImportError</span></tt> &#8220;No module named HTMLParser&#8221; というエラーが表示されたら、それはPython 3環境でPython 2で書かれたコードを実行しようとしたためです。</p>
+<p><tt class="docutils literal"><span class="pre">ImportError</span></tt> &#8220;No module named html.parser&#8221; というエラーが表示されたら、それはPython 2環境でPython 3ので書かれたコードを実行しようとしたためです。</p>
+<p>どちらの場合もとるべき対応は、Beautiful Soupを(tarballを解凍したときディレクトリを含め)
+完全にアンインストールして、再インストールをすることです。</p>
+<p><tt class="docutils literal"><span class="pre">ROOT_TAG_NAME</span> <span class="pre">=</span> <span class="pre">u'[document]'</span></tt> 行で <tt class="docutils literal"><span class="pre">SyntaxError</span></tt> &#8220;Invalid syntax&#8221; のエラーが表示されたら、
+Python 2で書かれたBeautiful SoupのコードをPython 3に変換しなければいけません。</p>
+<p>そのためには、次のようにパッケージをインストールするか、:</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">python3</span> <span class="pre">setup.py</span> <span class="pre">install</span></tt></p>
+<p>もしくは、手動で <tt class="docutils literal"><span class="pre">2to3</span></tt> 変換スクリプトを <tt class="docutils literal"><span class="pre">bs4</span></tt> ディレクトリで実行すればできます。:</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">2to3-3.2</span> <span class="pre">-w</span> <span class="pre">bs4</span></tt></p>
+</div>
+<div class="section" id="parser-installation">
+<span id="id10"></span><h2>パーサーのインストール<a class="headerlink" href="#parser-installation" title="Permalink to this headline">¶</a></h2>
+<p>Beautiful SoupはPythonの標準ライブラリに入っているHTMLパーサーをサポートすると同時に、多くのサードパーティーのPythonパーサーもサポートしています。一つには、 <a class="reference external" href="http://lxml.de/">lxml parser</a>. があります。環境に依りますが、以下のコマンドのどれかでlxmlをインストールできるでしょう。:</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">python-lxml</span></tt></p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">lxml</span></tt></p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">lxml</span></tt></p>
+<p>別の選択肢として、Python純正の <a class="reference external" href="http://code.google.com/p/html5lib/">html5lib parser</a> が挙げられます。これは HTMLをwebブラウザがするようにパースします。これも環境に依りますが、以下のコマンドのどれかでhtml5libをインストールできるでしょう。:</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">python-html5lib</span></tt></p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">html5lib</span></tt></p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">html5lib</span></tt></p>
+<p>以下の表は、各パーサーのライブラリの強みと弱みをまとめてあります。</p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="17%" />
+<col width="33%" />
+<col width="25%" />
+<col width="25%" />
+</colgroup>
+<tbody valign="top">
+<tr class="row-odd"><td>パーサー</td>
+<td>使用例</td>
+<td>強み</td>
+<td>弱み</td>
+</tr>
+<tr class="row-even"><td>Python&#8217;s html.parser</td>
+<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">&quot;html.parser&quot;)</span></tt></td>
+<td><ul class="first last simple">
+<li>標準ライブラリ</li>
+<li>まずまずのスピード</li>
+<li>Python2.7.3/3.2.2以降に対応</li>
+</ul>
+</td>
+<td><ul class="first last simple">
+<li>Python2.7.3/3.2.2未満は非対応</li>
+</ul>
+</td>
+</tr>
+<tr class="row-odd"><td>lxml&#8217;s HTML parser</td>
+<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">&quot;lxml&quot;)</span></tt></td>
+<td><ul class="first last simple">
+<li>爆速</li>
+<li>対応(?)</li>
+</ul>
+</td>
+<td><ul class="first last simple">
+<li>外部Cライブラリに依存</li>
+</ul>
+</td>
+</tr>
+<tr class="row-even"><td>lxml&#8217;s XML parser</td>
+<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">[&quot;lxml&quot;,</span> <span class="pre">&quot;xml&quot;])</span></tt>
+<tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">&quot;xml&quot;)</span></tt></td>
+<td><ul class="first last simple">
+<li>爆速</li>
+<li>唯一の対応XMLパーサー</li>
+</ul>
+</td>
+<td><ul class="first last simple">
+<li>外部Cライブラリに依存</li>
+</ul>
+</td>
+</tr>
+<tr class="row-odd"><td>html5lib</td>
+<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">&quot;html5lib&quot;)</span></tt></td>
+<td><ul class="first last simple">
+<li>対応度高</li>
+<li>WEBブラウザと同じようにパース</li>
+<li>正しいHTML5を生成</li>
+</ul>
+</td>
+<td><ul class="first last simple">
+<li>とても遅い</li>
+<li>外部Pythonライブラリに依存</li>
+</ul>
+</td>
+</tr>
+</tbody>
+</table>
+<p>できれば、速度のためにlxmlをインストールして使うことをお薦めします。
+とくに、あなたがPython2.7.3のPython2系か、Python3.2.2より前のPython3系を使っているばあいは、lxmlかhtml5libをインストールすることは <strong>とても大事です</strong> 。
+なぜなら、Pythonにはじめから組み込まれているHTMLパーサーは、古いバージョンのPythonではそこまで良く動かないからです。</p>
+<p>構文が不正確なドキュメントのときは、パーサーが違うと生成されるパースツリーが異なってくることに注意してください。
+詳しくは、 <a class="reference internal" href="#id47">パーサーの違い</a> を参照のこと。</p>
+</div>
+</div>
+<div class="section" id="id11">
+<h1>スープの作成<a class="headerlink" href="#id11" title="Permalink to this headline">¶</a></h1>
+<p>ドキュメントをパース(構文解析)するには、
+そのドキュメントを <tt class="docutils literal"><span class="pre">Beautiful</span> <span class="pre">Soup</span></tt> コンストラクタに渡します。
+文字列でも開いたファイルハンドルでも渡せます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
+
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="s">&quot;index.html&quot;</span><span class="p">))</span>
+
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;html&gt;data&lt;/html&gt;&quot;</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>最初に、ドキュメントはUnicodeに変換され、HTMLエンティティはUnicode文字列に変換されます。:</p>
+<div class="highlight-python"><pre>BeautifulSoup("Sacr&amp;eacute; bleu!")
+&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;Sacré bleu!&lt;/body&gt;&lt;/html&gt;</pre>
+</div>
+<p>Beautiful Soupは、ドキュメントをもっとも適したパーサー(構文解析器)を使ってパースします。
+XMLパーサーを使うように指定しなければ、HTMLパーサーが用いられます。( <a class="reference internal" href="#xml">XMLのパース</a> を参照)</p>
+</div>
+<div class="section" id="id12">
+<h1>4種類のオブジェクト<a class="headerlink" href="#id12" title="Permalink to this headline">¶</a></h1>
+<p>Beautiful Soup は複雑なHTMLドキュメントを、Pythonオブジェクトの複雑なツリー構造に変換します。
+しかし、あなたは <tt class="docutils literal"><span class="pre">Tag</span></tt>, <tt class="docutils literal"><span class="pre">NavigableString</span></tt>, <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt>, <tt class="docutils literal"><span class="pre">Comment</span></tt>
+の <strong>4種類のオブジェクト</strong> だけを扱えばよいです。</p>
+<div class="section" id="tag-obj">
+<span id="tag"></span><h2>Tag obj.<a class="headerlink" href="#tag-obj" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトは、元のドキュメント内のXMLやHTMLのタグに対応しています。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;b class=&quot;boldest&quot;&gt;Extremely bold&lt;/b&gt;&#39;</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
+<span class="nb">type</span><span class="p">(</span><span class="n">tag</span><span class="p">)</span>
+<span class="c"># &lt;class &#39;bs4.element.Tag&#39;&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトは、多くの属性とメソッドを持っています。それらのほとんどは、 <a class="reference internal" href="#id16">パースツリーを探索</a> と <a class="reference internal" href="#id25">パースツリーを検索</a> で説明します。この節では <tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトの重要な機能である、名前と属性について説明します。</p>
+<div class="section" id="id13">
+<h3>名前<a class="headerlink" href="#id13" title="Permalink to this headline">¶</a></h3>
+<p>タグはそれぞれ名前を持っていますが、 <tt class="docutils literal"><span class="pre">.name</span></tt> でアクセスできます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">name</span>
+<span class="c"># u&#39;b&#39;</span>
+</pre></div>
+</div>
+<p>タグの名前を変えると、その変更はBeautiful Soupが生成する全てのマークアップに反映されます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">&quot;blockquote&quot;</span>
+<span class="n">tag</span>
+<span class="c"># &lt;blockquote class=&quot;boldest&quot;&gt;Extremely bold&lt;/blockquote&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="id14">
+<h3>属性<a class="headerlink" href="#id14" title="Permalink to this headline">¶</a></h3>
+<p>タグは多くの属性を持ちます。
+&lt;b class=&#8221;boldest&#8221;&gt;は、&#8221;boldest&#8221;という値の&#8217;class&#8217;属性を持ちます。
+<tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトを辞書のように扱うことで、そのタグの属性にアクセスできます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
+<span class="c"># u&#39;boldest&#39;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">.attrs</span></tt> で辞書に直接アクセスできます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">attrs</span>
+<span class="c"># {u&#39;class&#39;: u&#39;boldest&#39;}</span>
+</pre></div>
+</div>
+<p>繰り返しになりますが、辞書のように <tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトを扱うことにより、タグの属性に対して追加, 削除, 修正も行うことができます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#39;verybold&#39;</span>
+<span class="n">tag</span><span class="p">[</span><span class="s">&#39;id&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
+<span class="n">tag</span>
+<span class="c"># &lt;blockquote class=&quot;verybold&quot; id=&quot;1&quot;&gt;Extremely bold&lt;/blockquote&gt;</span>
+
+<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
+<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">&#39;id&#39;</span><span class="p">]</span>
+<span class="n">tag</span>
+<span class="c"># &lt;blockquote&gt;Extremely bold&lt;/blockquote&gt;</span>
+
+<span class="n">tag</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
+<span class="c"># KeyError: &#39;class&#39;</span>
+<span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">&#39;class&#39;</span><span class="p">))</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+<span class="target" id="multivalue"></span><div class="section" id="id15">
+<h4>値が複数のとき<a class="headerlink" href="#id15" title="Permalink to this headline">¶</a></h4>
+<p>HTML4は、値を複数もてる2,3の属性を定義しています。
+HTML5で、それらはなくなりましたが、別の同様の属性が定義されました。
+もっとも一般的な値を複数もつ属性は <tt class="docutils literal"><span class="pre">class</span></tt> です。(たとえば、HTMLタグは複数のCSSクラスを持つことができます)
+また他の複数の値を持つ属性としては、 <tt class="docutils literal"><span class="pre">rel</span></tt>, <tt class="docutils literal"><span class="pre">rev</span></tt>, <tt class="docutils literal"><span class="pre">accept-charset</span></tt>, <tt class="docutils literal"><span class="pre">headers</span></tt>, <tt class="docutils literal"><span class="pre">accesskey</span></tt> があります。
+Beautiful Soupは、これらの属性がもつ複数の値をリストとして示します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">css_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;p class=&quot;body strikeout&quot;&gt;&lt;/p&gt;&#39;</span><span class="p">)</span>
+<span class="n">css_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
+<span class="c"># [&quot;body&quot;, &quot;strikeout&quot;]</span>
+
+<span class="n">css_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;p class=&quot;body&quot;&gt;&lt;/p&gt;&#39;</span><span class="p">)</span>
+<span class="n">css_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
+<span class="c"># [&quot;body&quot;]</span>
+</pre></div>
+</div>
+<p>ある属性が複数の値をもっているようでも、HTML標準の定義から外れている場合、Beautiful Soupはその属性をひとまとまりの値として扱います。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">id_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;p id=&quot;my id&quot;&gt;&lt;/p&gt;&#39;</span><span class="p">)</span>
+<span class="n">id_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">&#39;id&#39;</span><span class="p">]</span>
+<span class="c"># &#39;my id&#39;</span>
+</pre></div>
+</div>
+<p>タグを文字列に変換したときは、これらの属性の複数の値は一つにまとめられます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">rel_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;p&gt;Back to the &lt;a rel=&quot;index&quot;&gt;homepage&lt;/a&gt;&lt;/p&gt;&#39;</span><span class="p">)</span>
+<span class="n">rel_soup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">&#39;rel&#39;</span><span class="p">]</span>
+<span class="c"># [&#39;index&#39;]</span>
+<span class="n">rel_soup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">&#39;rel&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39;index&#39;</span><span class="p">,</span> <span class="s">&#39;contents&#39;</span><span class="p">]</span>
+<span class="k">print</span><span class="p">(</span><span class="n">rel_soup</span><span class="o">.</span><span class="n">p</span><span class="p">)</span>
+<span class="c"># &lt;p&gt;Back to the &lt;a rel=&quot;index contents&quot;&gt;homepage&lt;/a&gt;&lt;/p&gt;</span>
+</pre></div>
+</div>
+<p>ドキュメントをXMLとしてパースすると、値を複数もつ属性はなくなります。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">xml_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;p class=&quot;body strikeout&quot;&gt;&lt;/p&gt;&#39;</span><span class="p">,</span> <span class="s">&#39;xml&#39;</span><span class="p">)</span>
+<span class="n">xml_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
+<span class="c"># u&#39;body strikeout&#39;</span>
+</pre></div>
+</div>
+</div>
+</div>
+</div>
+<div class="section" id="navigablestring-obj">
+<h2>NavigableString obj.<a class="headerlink" href="#navigablestring-obj" title="Permalink to this headline">¶</a></h2>
+<p>タグの組に挟まれる短い(ドキュメントの本文のテキスト)文字列があります。
+Beautiful Soupは、これらの文字列を表すのに <tt class="docutils literal"><span class="pre">NavigableString</span></tt> クラスを用います。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">string</span>
+<span class="c"># u&#39;Extremely bold&#39;</span>
+<span class="nb">type</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
+<span class="c"># &lt;class &#39;bs4.element.NavigableString&#39;&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">NavigableString</span></tt> オブジェクトは、PythonのUnicode型のように振るまいます。
+また、<a class="reference internal" href="#id16">パースツリーを探索</a> と <a class="reference internal" href="#id25">パースツリーを検索</a> に述べられている機能のいくつかもサポートします。
+<tt class="docutils literal"><span class="pre">unicode()</span></tt> を用いて、 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> オブジェクトをUnicode型に変換できます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">unicode_string</span> <span class="o">=</span> <span class="nb">unicode</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
+<span class="n">unicode_string</span>
+<span class="c"># u&#39;Extremely bold&#39;</span>
+<span class="nb">type</span><span class="p">(</span><span class="n">unicode_string</span><span class="p">)</span>
+<span class="c"># &lt;type &#39;unicode&#39;&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">NavigableString</span></tt> の文字列は編集できませんが、 <a class="reference internal" href="#replace-with"><em>replace_with()</em></a> を使って、他の文字列に置換することはできます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="s">&quot;No longer bold&quot;</span><span class="p">)</span>
+<span class="n">tag</span>
+<span class="c"># &lt;blockquote&gt;No longer bold&lt;/blockquote&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">NavigableString</span></tt> は、<a class="reference internal" href="#id16">パースツリーを探索</a> と <a class="reference internal" href="#id25">パースツリーを検索</a> で述べられている機能のほとんどをサポートします。
+しかし、全てをサポートしているわけではありません。
+とくに、<tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトが文字列や別の <tt class="docutils literal"><span class="pre">Tag</span></tt> を内に含むのに対して、<tt class="docutils literal"><span class="pre">string</span></tt> オブジェクトは何も持たず、<cite>.contents`</cite> 属性, <tt class="docutils literal"><span class="pre">.string</span></tt> 属性, <tt class="docutils literal"><span class="pre">find()</span></tt> メソッドをサポートしません。</p>
+<p><tt class="docutils literal"><span class="pre">NavigableString</span></tt> をBeautiful Soupの外で使いたい場合は、 <tt class="docutils literal"><span class="pre">unicode()</span></tt> を使ってPythonのUnicode文字列に変換するべきです。そうしないと、Beautiful Soupを使い終わった後も、Beautiful Soupのパースツリー全体へのリファレンスを持ち続けることになり、メモリを大量に浪費します。</p>
+</div>
+<div class="section" id="beautifulsoup-obj">
+<h2>BeautifulSoup obj.<a class="headerlink" href="#beautifulsoup-obj" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Beautiful</span> <span class="pre">Soup</span></tt> オブジェクトは、それ自身で元のドキュメント全体を表しています。
+たいていの場合、<a class="reference internal" href="#tag"><em>Tag obj.</em></a> を扱うことで、用は足りるでしょう。
+これは、<a class="reference internal" href="#tag"><em>Tag obj.</em></a> が <a class="reference internal" href="#id16">パースツリーを探索</a> と <a class="reference internal" href="#id25">パースツリーを検索</a>. で述べられているメソッドの多くをサポートしているということです。</p>
+<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> オブジェクトは、実際のHTMLやXMLタグに対応していないので、名前や属性を持たない。
+しかし、 <tt class="docutils literal"><span class="pre">.name</span></tt> をみるような便利なものはいくつかある。そして、それらは特別な <tt class="docutils literal"><span class="pre">.name</span></tt> &#8220;[document]&#8221;を得られる(?訳がおかしい。けど次回まわし?):</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">name</span>
+<span class="c"># u&#39;[document]&#39;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="comments-obj">
+<h2>Comments obj. 他<a class="headerlink" href="#comments-obj" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag</span></tt>, <tt class="docutils literal"><span class="pre">NavigableString</span></tt>, <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> はHTMLやXMLファイルのほぼ全てをカバーします。しかし、少しだけ残ったものがあります。それはコメントについてです。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&quot;&lt;b&gt;&lt;!--Hey, buddy. Want to buy a used parser?--&gt;&lt;/b&gt;&quot;</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">comment</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span>
+<span class="nb">type</span><span class="p">(</span><span class="n">comment</span><span class="p">)</span>
+<span class="c"># &lt;class &#39;bs4.element.Comment&#39;&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">Comment</span></tt> オブジェクトは、 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> オブジェクトの特別なタイプです。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">comment</span>
+<span class="c"># u&#39;Hey, buddy. Want to buy a used parser&#39;</span>
+</pre></div>
+</div>
+<p>コメントはHTMLの中にあらわれますが、 <tt class="docutils literal"><span class="pre">Comment</span></tt> は特別な書式で表示されます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;b&gt;</span>
+<span class="c"># &lt;!--Hey, buddy. Want to buy a used parser?--&gt;</span>
+<span class="c"># &lt;/b&gt;</span>
+</pre></div>
+</div>
+<p>Beautiful Soupは、XMLドキュメントのなかの他の全ての要素をクラス定義しています。
+<tt class="docutils literal"><span class="pre">CData</span></tt>, <tt class="docutils literal"><span class="pre">ProcessingInstruction</span></tt>, <tt class="docutils literal"><span class="pre">Declaration</span></tt>, <tt class="docutils literal"><span class="pre">Doctype</span></tt>.
+<tt class="docutils literal"><span class="pre">Comment</span></tt> クラスのように、これらは文字に何かを加えた <tt class="docutils literal"><span class="pre">NavigableString</span></tt> のサブクラスです。
+ここでは、コメントをCDDATAブロックに置換した例を示します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">CData</span>
+<span class="n">cdata</span> <span class="o">=</span> <span class="n">CData</span><span class="p">(</span><span class="s">&quot;A CDATA block&quot;</span><span class="p">)</span>
+<span class="n">comment</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="n">cdata</span><span class="p">)</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;b&gt;</span>
+<span class="c"># &lt;![CDATA[A CDATA block]]&gt;</span>
+<span class="c"># &lt;/b&gt;</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="id16">
+<h1>パースツリーを探索<a class="headerlink" href="#id16" title="Permalink to this headline">¶</a></h1>
+<p>ここで再び &#8220;Three sisters&#8221; のHTMLドキュメントです。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">html_doc</span> <span class="o">=</span> <span class="s">&quot;&quot;&quot;</span>
+<span class="s">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;</span>
+
+<span class="s">&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;</span>
+
+<span class="s">&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were</span>
+<span class="s">&lt;a href=&quot;http://example.com/elsie&quot; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="s">&lt;a href=&quot;http://example.com/lacie&quot; class=&quot;sister&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and</span>
+<span class="s">&lt;a href=&quot;http://example.com/tillie&quot; class=&quot;sister&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;</span>
+<span class="s">and they lived at the bottom of a well.&lt;/p&gt;</span>
+
+<span class="s">&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
+<span class="s">&quot;&quot;&quot;</span>
+
+<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>ドキュメントのある部分から他の部分へどのように移動するかを示すために、このドキュメントを例に使っていきます。</p>
+<div class="section" id="id17">
+<h2>子要素へ下移動<a class="headerlink" href="#id17" title="Permalink to this headline">¶</a></h2>
+<p>タグはその間に(ドキュメント本文のテキスト)文字列や他のタグを挟んでいます。これらの要素は、 タグの <cite>子要素</cite> です。Beautiful Soupは、タグの子要素を探索し扱うための多くの属性を提供します。</p>
+<p>Beautiful Soupの文字列は、これらの属性をサポートしません。なぜなら、文字列は子要素をもたないからです。</p>
+<div class="section" id="id18">
+<h3>タグ名で探索<a class="headerlink" href="#id18" title="Permalink to this headline">¶</a></h3>
+<p>パースツリーを探索する一番簡単な方法は、あなたが取得したいタグの名前を使うことです。
+もし、&lt;head&gt; タグを取得したければ、 <tt class="docutils literal"><span class="pre">soup.head</span></tt> と入力すればよいです。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">head</span>
+<span class="c"># &lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">title</span>
+<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
+</pre></div>
+</div>
+<p>また、パースツリーのある部分から出発して、何度もズームインを繰り返す方法もあります。
+このコードは、&lt;body&gt;タグ以下の最初の&lt;b&gt;タグを取得します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">body</span><span class="o">.</span><span class="n">b</span>
+<span class="c"># &lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;</span>
+</pre></div>
+</div>
+<p>属性としてタグ名を使うと、その名前のタグのうち <cite>最初</cite> にあるものを取得できます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p><cite>全ての</cite> &lt;a&gt;タグを取得したいときや、ある名前のタグのうち2番目以降のものをしたいときは、 <a class="reference internal" href="#id25">パースツリーを検索</a> で述べられている <cite>find_all()</cite> のようなメソッドを使う必要があります。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&#39;a&#39;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="contents-children">
+<h3><tt class="docutils literal"><span class="pre">.contents</span></tt> / <tt class="docutils literal"><span class="pre">.children</span></tt><a class="headerlink" href="#contents-children" title="Permalink to this headline">¶</a></h3>
+<p>タグの子要素は、 <tt class="docutils literal"><span class="pre">.contents</span></tt> で呼び出すと、リストで取得できます。:</p>
+<div class="highlight-python"><pre>head_tag = soup.head
+head_tag
+# &lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;
+
+head_tag.contents
+[&lt;title&gt;The Dormouse's story&lt;/title&gt;]
+
+title_tag = head_tag.contents[0]
+title_tag
+# &lt;title&gt;The Dormouse's story&lt;/title&gt;
+title_tag.contents
+# [u'The Dormouse's story']</pre>
+</div>
+<p><tt class="docutils literal"><span class="pre">Beautiful</span> <span class="pre">Soup</span></tt> オブジェクトは、それ自身が子要素を持ちます。この場合、&lt;html&gt;タグが <tt class="docutils literal"><span class="pre">Beautiful</span> <span class="pre">Soup</span></tt> オブジェクトの子要素になります。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="nb">len</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">contents</span><span class="p">)</span>
+<span class="c"># 1</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">contents</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">name</span>
+<span class="c"># u&#39;html&#39;</span>
+</pre></div>
+</div>
+<p>文字列は <tt class="docutils literal"><span class="pre">.contents</span></tt> を持ちません。なぜなら、文字列は何も挟まないからです。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">text</span> <span class="o">=</span> <span class="n">title_tag</span><span class="o">.</span><span class="n">contents</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
+<span class="n">text</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># AttributeError: &#39;NavigableString&#39; object has no attribute &#39;contents&#39;</span>
+</pre></div>
+</div>
+<p>タグの子要素を、リストの代わりに、 <cite>.children</cite> ジェネレーターを用いてイテレーターで扱うこともできます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">title_tag</span><span class="o">.</span><span class="n">children</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">child</span><span class="p">)</span>
+<span class="c"># The Dormouse&#39;s story</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="descendants">
+<h3><tt class="docutils literal"><span class="pre">.descendants</span></tt><a class="headerlink" href="#descendants" title="Permalink to this headline">¶</a></h3>
+<p><tt class="docutils literal"><span class="pre">.contents</span></tt> と <tt class="docutils literal"><span class="pre">.children</span></tt> 属性は、あるタグの <cite>直下の</cite> 子要素のみを表します。
+例えば、&lt;head&gt;タグは、ただ一つの直下の子要素である&lt;title&gt;タグを持ちます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">head_tag</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
+</pre></div>
+</div>
+<p>しかし、この&lt;title&gt;タグ自身も、子要素に&#8221;The Dormouse&#8217;s story&#8221;文字列を持ちます。
+この文字列もまた、&lt;head&gt;タグの子要素であるという意味になります。
+そこで、 <tt class="docutils literal"><span class="pre">.descendants</span></tt> (子孫) 属性を用いると、 あるタグの <strong>全ての</strong> 子要素を再帰的に取り出すことができます。
+再帰的というのは、直下の子要素、そのまた子要素、そしてさらにといったふうに繰り返してということです。</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">head_tag</span><span class="o">.</span><span class="n">descendants</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">child</span><span class="p">)</span>
+<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
+<span class="c"># The Dormouse&#39;s story</span>
+</pre></div>
+</div>
+<p>このドキュメントの&lt;head&gt;タグはただ1つの子要素しか持ちませんが、
+&lt;title&gt;タグと&lt;title&gt;タグの子要素という2つの子孫要素を持ちます。
+また、このドキュメントの <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> オブジェクトには、
+直下の子要素は&lt;html&gt;タグ1つしかありませんが、子孫要素はたくさんあります。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">children</span><span class="p">))</span>
+<span class="c"># 1</span>
+<span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">descendants</span><span class="p">))</span>
+<span class="c"># 25</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="string">
+<span id="id19"></span><h3><tt class="docutils literal"><span class="pre">.string</span></tt><a class="headerlink" href="#string" title="Permalink to this headline">¶</a></h3>
+<p>ある <tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトが1つだけ子要素をもっていて、その子要素が <tt class="docutils literal"><span class="pre">NavigableString</span></tt> オブジェクトならば、 <tt class="docutils literal"><span class="pre">.string</span></tt> 属性で利用できます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">title_tag</span><span class="o">.</span><span class="n">string</span>
+<span class="c"># u&#39;The Dormouse&#39;s story&#39;</span>
+</pre></div>
+</div>
+<p>ある <tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトのただ1つの子要素が、別の <tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトであって <tt class="docutils literal"><span class="pre">.string</span></tt> 属性を持つならば、元の <tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトも同じ <tt class="docutils literal"><span class="pre">.string</span></tt> 属性を持つと考えられます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">head_tag</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
+
+<span class="n">head_tag</span><span class="o">.</span><span class="n">string</span>
+<span class="c"># u&#39;The Dormouse&#39;s story&#39;</span>
+</pre></div>
+</div>
+<p>ある <tt class="docutils literal"><span class="pre">tag</span></tt> オブジェクトが複数の子要素を持ち、 <tt class="docutils literal"><span class="pre">.string</span></tt> 属性がどの子要素を参照しているかわからないとき、 <tt class="docutils literal"><span class="pre">.string</span></tt> 属性は <tt class="docutils literal"><span class="pre">None</span></tt> と定義されます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="strings-stripped-strings">
+<span id="string-generators"></span><h3><tt class="docutils literal"><span class="pre">.strings</span></tt> / <tt class="docutils literal"><span class="pre">.stripped_strings</span></tt><a class="headerlink" href="#strings-stripped-strings" title="Permalink to this headline">¶</a></h3>
+<p>あるタグの中にあるドキュメント本文が要素が複数であっても、それらの文字列をみることができます。
+その場合は、 <tt class="docutils literal"><span class="pre">.strings</span></tt> ジェネレーターを使用します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">string</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">strings</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">string</span><span class="p">))</span>
+<span class="c"># u&quot;The Dormouse&#39;s story&quot;</span>
+<span class="c"># u&#39;\n\n&#39;</span>
+<span class="c"># u&quot;The Dormouse&#39;s story&quot;</span>
+<span class="c"># u&#39;\n\n&#39;</span>
+<span class="c"># u&#39;Once upon a time there were three little sisters; and their names were\n&#39;</span>
+<span class="c"># u&#39;Elsie&#39;</span>
+<span class="c"># u&#39;,\n&#39;</span>
+<span class="c"># u&#39;Lacie&#39;</span>
+<span class="c"># u&#39; and\n&#39;</span>
+<span class="c"># u&#39;Tillie&#39;</span>
+<span class="c"># u&#39;;\nand they lived at the bottom of a well.&#39;</span>
+<span class="c"># u&#39;\n\n&#39;</span>
+<span class="c"># u&#39;...&#39;</span>
+<span class="c"># u&#39;\n&#39;</span>
+</pre></div>
+</div>
+<p>これらの文字列は、大量の余計な空白が入りがちである。
+そこで、 <tt class="docutils literal"><span class="pre">.stripped_strings</span></tt> ジェネレーターを代わりに用いることで、それら空白を除くことができる。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">string</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">stripped_strings</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">string</span><span class="p">))</span>
+<span class="c"># u&quot;The Dormouse&#39;s story&quot;</span>
+<span class="c"># u&quot;The Dormouse&#39;s story&quot;</span>
+<span class="c"># u&#39;Once upon a time there were three little sisters; and their names were&#39;</span>
+<span class="c"># u&#39;Elsie&#39;</span>
+<span class="c"># u&#39;,&#39;</span>
+<span class="c"># u&#39;Lacie&#39;</span>
+<span class="c"># u&#39;and&#39;</span>
+<span class="c"># u&#39;Tillie&#39;</span>
+<span class="c"># u&#39;;\nand they lived at the bottom of a well.&#39;</span>
+<span class="c"># u&#39;...&#39;</span>
+</pre></div>
+</div>
+<p>ここでは、文字列中に入る空白はそのままで、文字列の最初や最後に付く空白は削除されます。</p>
+</div>
+</div>
+<div class="section" id="id20">
+<h2>親要素へ上移動<a class="headerlink" href="#id20" title="Permalink to this headline">¶</a></h2>
+<p>&#8220;家族ツリー&#8221;に例えると、全てのタグや文字列はそれぞれが一つの親要素を持ちます。</p>
+<div class="section" id="parent">
+<span id="id21"></span><h3><tt class="docutils literal"><span class="pre">.parent</span></tt><a class="headerlink" href="#parent" title="Permalink to this headline">¶</a></h3>
+<p><tt class="docutils literal"><span class="pre">.parent</span></tt> 属性で親要素にアクセスできます。
+たとえば、&#8221;three sisters&#8221;ドキュメントでは、&lt;head&gt;タグは&lt;title&gt;タグの親要素です。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">title_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">title</span>
+<span class="n">title_tag</span>
+<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
+<span class="n">title_tag</span><span class="o">.</span><span class="n">parent</span>
+<span class="c"># &lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;</span>
+</pre></div>
+</div>
+<p>タイトル文字列はそれ自身が親要素を持ち、&lt;title&gt;タグはタイトル文字列を子要素に持ちます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">title_tag</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">parent</span>
+<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
+</pre></div>
+</div>
+<p>&lt;html&gt;タグの様なトップレベルのタグは、 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> オブジェクトそれ自身になります。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">html_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">html</span>
+<span class="nb">type</span><span class="p">(</span><span class="n">html_tag</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
+<span class="c"># &lt;class &#39;bs4.BeautifulSoup&#39;&gt;</span>
+</pre></div>
+</div>
+<p>そして、<tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> オブジェクトの <tt class="docutils literal"><span class="pre">.parent</span></tt> 属性は、Noneになります。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="parents">
+<span id="id22"></span><h3><tt class="docutils literal"><span class="pre">.parents</span></tt><a class="headerlink" href="#parents" title="Permalink to this headline">¶</a></h3>
+<p>あるタグに対する祖先要素全てを <tt class="docutils literal"><span class="pre">.parents</span></tt> で取得することができます。
+以下は、HTMLドキュメントの深いところにある&lt;a&gt;タグからスタートして、最上層まで辿っています。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="n">link</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
+<span class="k">for</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">link</span><span class="o">.</span><span class="n">parents</span><span class="p">:</span>
+ <span class="k">if</span> <span class="n">parent</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">parent</span><span class="p">)</span>
+ <span class="k">else</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
+<span class="c"># p</span>
+<span class="c"># body</span>
+<span class="c"># html</span>
+<span class="c"># [document]</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="id23">
+<h2>兄弟要素へ横移動<a class="headerlink" href="#id23" title="Permalink to this headline">¶</a></h2>
+<p>以下のようなシンプルなHTMLドキュメントを考えてみましょう。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">sibling_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;&lt;b&gt;text1&lt;/b&gt;&lt;c&gt;text2&lt;/c&gt;&lt;/b&gt;&lt;/a&gt;&quot;</span><span class="p">)</span>
+<span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;a&gt;</span>
+<span class="c"># &lt;b&gt;</span>
+<span class="c"># text1</span>
+<span class="c"># &lt;/b&gt;</span>
+<span class="c"># &lt;c&gt;</span>
+<span class="c"># text2</span>
+<span class="c"># &lt;/c&gt;</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>&lt;b&gt;タグは&lt;c&gt;タグと同じレベルにあります。つまり、2つはともに同じタグの直下の子要素ということです。
+こういった関係にあるタグを <cite>siblings</cite> (兄弟)といいます。
+HTMLドキュメントをきれいに出力(?)したとき、siblingsは同じインデントレベルになります。
+こういったタグの関係をコードで利用することができます。</p>
+<div class="section" id="next-sibling-previous-sibling">
+<h3><tt class="docutils literal"><span class="pre">.next_sibling</span></tt> / <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt><a class="headerlink" href="#next-sibling-previous-sibling" title="Permalink to this headline">¶</a></h3>
+<p><tt class="docutils literal"><span class="pre">.next_sibling</span></tt> と <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt> を用いて、パースツリーの同じレベルの要素間を辿ることができます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">next_sibling</span>
+<span class="c"># &lt;c&gt;text2&lt;/c&gt;</span>
+
+<span class="n">sibling_soup</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">previous_sibling</span>
+<span class="c"># &lt;b&gt;text1&lt;/b&gt;</span>
+</pre></div>
+</div>
+<p>この&lt;b&gt;タグは <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> は持ちますが、 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt> は持ちません。
+なぜなら、&lt;b&gt;タグの前にはパースツリーで同レベルの要素がないからです。
+同様に、&lt;c&gt;タグは <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt> を持ちますが、<tt class="docutils literal"><span class="pre">.next_sibling</span></tt> は持ちません。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">previous_sibling</span><span class="p">)</span>
+<span class="c"># None</span>
+<span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">next_sibling</span><span class="p">)</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+<p>&#8220;text1&#8221;と&#8221;text&#8221;は兄弟ではありません。なぜなら、2つは同じ親をもたないからです。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span>
+<span class="c"># u&#39;text1&#39;</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">next_sibling</span><span class="p">)</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+<p>実際のHTMLドキュメントをパースすると、 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> や <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt> は前後に空白を持ちます。
+&#8220;three sisters&#8221;ドキュメントで見てみましょう。:</p>
+<div class="highlight-python"><pre>&lt;a href="http://example.com/elsie" class="sister" id="link1"&gt;Elsie&lt;/a&gt;
+&lt;a href="http://example.com/lacie" class="sister" id="link2"&gt;Lacie&lt;/a&gt;
+&lt;a href="http://example.com/tillie" class="sister" id="link3"&gt;Tillie&lt;/a&gt;</pre>
+</div>
+<p>すなおに考えれば、最初の&lt;a&gt;タグの <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> は2番目の&lt;a&gt;タグとなるはずですが、実際は違います。
+それは、最初の&lt;a&gt;タグと2番目を分ける&#8221;コンマと改行コード&#8221;という文字列になります。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="n">link</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
+
+<span class="n">link</span><span class="o">.</span><span class="n">next_sibling</span>
+<span class="c"># u&#39;,\n&#39;</span>
+</pre></div>
+</div>
+<p>2番目の&lt;a&gt;タグは、そのコンマと改行コードの <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> になります。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">link</span><span class="o">.</span><span class="n">next_sibling</span><span class="o">.</span><span class="n">next_sibling</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="next-siblings-previous-siblings">
+<span id="sibling-generators"></span><h3><tt class="docutils literal"><span class="pre">.next_siblings</span></tt> / <tt class="docutils literal"><span class="pre">.previous_siblings</span></tt><a class="headerlink" href="#next-siblings-previous-siblings" title="Permalink to this headline">¶</a></h3>
+<p>複数の兄弟要素を <tt class="docutils literal"><span class="pre">.next_siblings</span></tt> や <tt class="docutils literal"><span class="pre">.previous_siblings</span></tt> をイテレーターとして使って、まとめて扱えます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">sibling</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">next_siblings</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">sibling</span><span class="p">))</span>
+<span class="c"># u&#39;,\n&#39;</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;</span>
+<span class="c"># u&#39; and\n&#39;</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;</span>
+<span class="c"># u&#39;; and they lived at the bottom of a well.&#39;</span>
+<span class="c"># None</span>
+
+<span class="k">for</span> <span class="n">sibling</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">&quot;link3&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">previous_siblings</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">sibling</span><span class="p">))</span>
+<span class="c"># &#39; and\n&#39;</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;</span>
+<span class="c"># u&#39;,\n&#39;</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
+<span class="c"># u&#39;Once upon a time there were three little sisters; and their names were\n&#39;</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="id24">
+<h2>前後の要素へ移動<a class="headerlink" href="#id24" title="Permalink to this headline">¶</a></h2>
+<p>&#8220;three sisters&#8221;ドキュメントのはじめの部分を見てみましょう。:</p>
+<div class="highlight-python"><pre>&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;
+&lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</pre>
+</div>
+<p>HTMLパーサーは、この文字列を読み込み、イベントの連なりとして理解します。&#8221;open an &lt;html&gt; tag&#8221;, &#8220;open a &lt;head&gt; tag&#8221;, &#8220;open a &lt;title&gt; tag&#8221;, &#8220;add a string&#8221;, &#8220;close the &lt;title&gt; tag&#8221;, &#8220;open a &lt;p&gt;&#8221;... といったかんじです。Beautiful Soupはこのイベントの連なりを、さらに再構成して扱います。</p>
+<div class="section" id="next-element-previous-element">
+<span id="element-generators"></span><h3><tt class="docutils literal"><span class="pre">.next_element</span></tt> / <tt class="docutils literal"><span class="pre">.previous_element</span></tt><a class="headerlink" href="#next-element-previous-element" title="Permalink to this headline">¶</a></h3>
+<p>文字列やHTMLタグの <tt class="docutils literal"><span class="pre">.next_element</span></tt> 属性は、それの直後の要素を指し示します。
+<tt class="docutils literal"><span class="pre">.next_string</span></tt> と同じようですが、決定的に違います。</p>
+<p>&#8220;three sisters&#8221;ドキュメントの最後の&lt;a&gt;タグについて考えてみましょう。
+それの <tt class="docutils literal"><span class="pre">.next_string</span></tt> はその&lt;a&gt;タグによって分割された文の後ろの部分の文字列です。(?):</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">last_a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="s">&quot;link3&quot;</span><span class="p">)</span>
+<span class="n">last_a_tag</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;</span>
+
+<span class="n">last_a_tag</span><span class="o">.</span><span class="n">next_sibling</span>
+<span class="c"># &#39;; and they lived at the bottom of a well.&#39;</span>
+</pre></div>
+</div>
+<p>一方、 <tt class="docutils literal"><span class="pre">.next_element</span></tt> は、&lt;a&gt;タグのすぐ後ろの要素である&#8221;Tillie&#8221;という単語を指し示します。文の残りの部分ではありません。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">last_a_tag</span><span class="o">.</span><span class="n">next_element</span>
+<span class="c"># u&#39;Tillie&#39;</span>
+</pre></div>
+</div>
+<p>これは元の文章で&#8221;Tillie&#8221;という単語がセミコロンの前に現れるからです。
+パーサーは&lt;a&gt;タグに出会い、次に&#8221;Tillie&#8221;という単語、そして&lt;/a&gt;という閉じるタグがきます。
+そのあとは、セミコロンがあって、文の残りの部分です。
+セミコロンは&lt;a&gt;タグと同じレベルにありますが、&#8221;Tillie&#8221;という単語が最初に出会います。</p>
+<p><tt class="docutils literal"><span class="pre">.previous_element</span></tt> 属性は、 <tt class="docutils literal"><span class="pre">.next_element</span></tt> とは逆です。
+その要素の一つ前の要素を指し示します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">last_a_tag</span><span class="o">.</span><span class="n">previous_element</span>
+<span class="c"># u&#39; and\n&#39;</span>
+<span class="n">last_a_tag</span><span class="o">.</span><span class="n">previous_element</span><span class="o">.</span><span class="n">next_element</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="next-elements-previous-elements">
+<h3><tt class="docutils literal"><span class="pre">.next_elements</span></tt> / <tt class="docutils literal"><span class="pre">.previous_elements</span></tt><a class="headerlink" href="#next-elements-previous-elements" title="Permalink to this headline">¶</a></h3>
+<p>パースされたドキュメントの要素を、前後方向に取得していくイテレーターを使うこともできます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">element</span> <span class="ow">in</span> <span class="n">last_a_tag</span><span class="o">.</span><span class="n">next_elements</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">element</span><span class="p">))</span>
+<span class="c"># u&#39;Tillie&#39;</span>
+<span class="c"># u&#39;;\nand they lived at the bottom of a well.&#39;</span>
+<span class="c"># u&#39;\n\n&#39;</span>
+<span class="c"># &lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
+<span class="c"># u&#39;...&#39;</span>
+<span class="c"># u&#39;\n&#39;</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+</div>
+</div>
+</div>
+<div class="section" id="id25">
+<h1>パースツリーを検索<a class="headerlink" href="#id25" title="Permalink to this headline">¶</a></h1>
+<p>Beautiful Soupはパースパースツリーを検索する多くのメソッドを定義しています。
+しかし、それらはどれもとても似通っています。
+この章では、<tt class="docutils literal"><span class="pre">find()</span></tt> と <tt class="docutils literal"><span class="pre">find_all()</span></tt> という2つの人気のメソッドの説明に、多くのスペースを費やします。
+それ以外のメソッドは、ほとんど同じ引数を持つので、簡単な説明にとどめることにします。</p>
+<p>ここでは再び、&#8221;three sisters&#8221;ドキュメントを例に使っていきます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">html_doc</span> <span class="o">=</span> <span class="s">&quot;&quot;&quot;</span>
+<span class="s">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;</span>
+
+<span class="s">&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;</span>
+
+<span class="s">&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were</span>
+<span class="s">&lt;a href=&quot;http://example.com/elsie&quot; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="s">&lt;a href=&quot;http://example.com/lacie&quot; class=&quot;sister&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and</span>
+<span class="s">&lt;a href=&quot;http://example.com/tillie&quot; class=&quot;sister&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;</span>
+<span class="s">and they lived at the bottom of a well.&lt;/p&gt;</span>
+
+<span class="s">&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
+<span class="s">&quot;&quot;&quot;</span>
+
+<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> のようなフィルターを通すことにより、
+興味のあるドキュメントのある一部分にズームすることができます。</p>
+<div class="section" id="id26">
+<h2>フィルターの種類<a class="headerlink" href="#id26" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> 等のメソッドの詳細を説明するまえに、これらのメソッドに渡すフィルターの例を示します。
+検索APIの使い方をマスターする上で、フィルターは何度もでてきます。
+これにより、タグ名, タグの属性, ドキュメントの文字列やそれを組み合わせた条件を指定して、フィルターをかけます</p>
+<span class="target" id="a-string"></span><div class="section" id="id27">
+<h3>文字列<a class="headerlink" href="#id27" title="Permalink to this headline">¶</a></h3>
+<p>もっともシンプルなフィルターは文字列です。
+検索メソッドに文字列を渡すと、Beautiful Soupは厳格に文字列を一致させます。
+以下のコードは、ドキュメント内の&lt;b&gt;タグを全て見つけます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&#39;b&#39;</span><span class="p">)</span>
+<span class="c"># [&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;]</span>
+</pre></div>
+</div>
+<p>バイト文字列を渡すと、Beautiful SoupはそれをUTF-8にエンコードされた文字列として扱います。
+これを避けるには、代わりにUnicode文字列を渡します。</p>
+<span class="target" id="a-regular-expression"></span></div>
+<div class="section" id="id28">
+<h3>正規表現<a class="headerlink" href="#id28" title="Permalink to this headline">¶</a></h3>
+<p>正規表現オブジェクトを渡すと、Beautiful Soupはそれの <tt class="docutils literal"><span class="pre">match()</span></tt> メソッドを用いて、その正規表現に一致するものをマッチさせます。
+以下のコードは、全ての&#8221;b&#8221;ではじまるつづりの名前のタグを見つけます。
+&#8220;three sisters&#8221;ドキュメントでは、&lt;body&gt;タグと&lt;b&gt;タグにマッチします。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">import</span> <span class="nn">re</span>
+<span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">&quot;^b&quot;</span><span class="p">)):</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
+<span class="c"># body</span>
+<span class="c"># b</span>
+</pre></div>
+</div>
+<p>以下のコードでは、タグ名に&#8221;t&#8221;のつづりを含むもの全てを見つけます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">&quot;t&quot;</span><span class="p">)):</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
+<span class="c"># html</span>
+<span class="c"># title</span>
+</pre></div>
+</div>
+<span class="target" id="a-list"></span></div>
+<div class="section" id="id29">
+<h3>リスト<a class="headerlink" href="#id29" title="Permalink to this headline">¶</a></h3>
+<p>フィルターにリストで引数をわたすと、Beautiful Soupはそのリストの内のいずれかにマッチした要素を返します。
+以下のコードは、全ての&lt;a&gt;タグと&lt;b&gt;タグを見つけます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">([</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="s">&quot;b&quot;</span><span class="p">])</span>
+<span class="c"># [&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="true">
+<span id="the-value-true"></span><h3>True値<a class="headerlink" href="#true" title="Permalink to this headline">¶</a></h3>
+<p><tt class="docutils literal"><span class="pre">True値</span></tt> は全ての要素にマッチします。
+以下のコードは、ドキュメント内の <strong>全て</strong> のタグをみつけます。
+ただし、ドキュメント本文のテキスト文字列はマッチされません。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="bp">True</span><span class="p">):</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
+<span class="c"># html</span>
+<span class="c"># head</span>
+<span class="c"># title</span>
+<span class="c"># body</span>
+<span class="c"># p</span>
+<span class="c"># b</span>
+<span class="c"># p</span>
+<span class="c"># a</span>
+<span class="c"># a</span>
+<span class="c"># a</span>
+<span class="c"># p</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="id30">
+<h3>関数<a class="headerlink" href="#id30" title="Permalink to this headline">¶</a></h3>
+<p>以上のフィルターで機能が足りないときは、自分で引数に要素をとる関数を定義することもできます。
+その関数は、引数がマッチしたときは <tt class="docutils literal"><span class="pre">True</span></tt> を、そうでないときは <tt class="docutils literal"><span class="pre">False</span></tt> を返します。</p>
+<p>以下の関数では、HTMLタグが &#8220;class&#8221; 属性を持ち、&#8221;id&#8221;属性を持たない場合に <tt class="docutils literal"><span class="pre">True</span></tt> を返します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">def</span> <span class="nf">has_class_but_no_id</span><span class="p">(</span><span class="n">tag</span><span class="p">):</span>
+ <span class="k">return</span> <span class="n">tag</span><span class="o">.</span><span class="n">has_attr</span><span class="p">(</span><span class="s">&#39;class&#39;</span><span class="p">)</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">tag</span><span class="o">.</span><span class="n">has_attr</span><span class="p">(</span><span class="s">&#39;id&#39;</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>この関数を <tt class="docutils literal"><span class="pre">find_all()</span></tt> に渡すと、&#8221;three sisters&#8221;ドキュメントから全ての&lt;p&gt;タグを取得できます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">has_class_but_no_id</span><span class="p">)</span>
+<span class="c"># [&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;,</span>
+<span class="c"># &lt;p class=&quot;story&quot;&gt;Once upon a time there were...&lt;/p&gt;,</span>
+<span class="c"># &lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;]</span>
+</pre></div>
+</div>
+<p>この関数は&lt;p&gt;タグだけを抽出します。
+&lt;a&gt;タグは&#8221;class&#8221;と&#8221;id&#8221;の両方の属性を定義しているので抽出できません。
+&lt;html&gt;や&lt;title&gt;のようなタグは、&#8221;class&#8221;を定義してないので、同様に抽出できません。</p>
+<p>以下の関数は、HTMLタグがstringオブジェクトに囲まれているときは、 <tt class="docutils literal"><span class="pre">True</span></tt> を返します。(?):</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">NavigableString</span>
+<span class="k">def</span> <span class="nf">surrounded_by_strings</span><span class="p">(</span><span class="n">tag</span><span class="p">):</span>
+ <span class="k">return</span> <span class="p">(</span><span class="nb">isinstance</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">next_element</span><span class="p">,</span> <span class="n">NavigableString</span><span class="p">)</span>
+ <span class="ow">and</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">previous_element</span><span class="p">,</span> <span class="n">NavigableString</span><span class="p">))</span>
+
+<span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">surrounded_by_strings</span><span class="p">):</span>
+ <span class="k">print</span> <span class="n">tag</span><span class="o">.</span><span class="n">name</span>
+<span class="c"># p</span>
+<span class="c"># a</span>
+<span class="c"># a</span>
+<span class="c"># a</span>
+<span class="c"># p</span>
+</pre></div>
+</div>
+<p>これで検索メソッドの詳細をみていくことの準備ができました。</p>
+</div>
+</div>
+<div class="section" id="find-all">
+<h2><tt class="docutils literal"><span class="pre">find_all()</span></tt><a class="headerlink" href="#find-all" title="Permalink to this headline">¶</a></h2>
+<p>使い方: find_all(<a class="reference internal" href="#name"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#recursive"><em>recursive</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#limit"><em>limit</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> メソッドは、<cite>Tag</cite> オブジェクトが持つ子孫要素のうち、引数に一致する <strong>全ての</strong> 要素を見つけます。
+<a class="reference internal" href="#id26">フィルターの種類</a> でいくつかの例を挙げましたが、ここでもう少し説明します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">,</span> <span class="s">&quot;title&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">&quot;link2&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;]</span>
+
+<span class="kn">import</span> <span class="nn">re</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">&quot;sisters&quot;</span><span class="p">))</span>
+<span class="c"># u&#39;Once upon a time there were three little sisters; and their names were\n&#39;</span>
+</pre></div>
+</div>
+<p>これらの使い方は、すでに説明してるものもあれば、初出のものもあります。
+<tt class="docutils literal"><span class="pre">text</span></tt> や <tt class="docutils literal"><span class="pre">id</span></tt> に値を渡すのはどういう意味でしょうか?
+なぜ、<tt class="docutils literal"><span class="pre">find_all(&quot;p&quot;,</span> <span class="pre">&quot;title&quot;)</span></tt> は、CSSの&#8221;title&#8221;タグをもつ&lt;p&gt;タグを発見したのでしょうか?
+<tt class="docutils literal"><span class="pre">find_all()</span></tt> の引数をみていきましょう。</p>
+<span class="target" id="name"></span><div class="section" id="id31">
+<h3>name引数<a class="headerlink" href="#id31" title="Permalink to this headline">¶</a></h3>
+<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> の <tt class="docutils literal"><span class="pre">name</span></tt> 引数に値を渡すと、タグの名前だけを対象に検索が行われます。
+名前がマッチしないタグと同じように、テキスト文字列は無視されます。</p>
+<p>以下の例は、もっともシンプルな使い方です。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
+</pre></div>
+</div>
+<p><a class="reference internal" href="#id26">フィルターの種類</a> で述べたように、 <tt class="docutils literal"><span class="pre">name</span></tt> 引数は文字列, 正規表現, リスト, 関数, True値をとることができます。</p>
+<span class="target" id="kwargs"></span></div>
+<div class="section" id="id32">
+<h3>キーワード引数<a class="headerlink" href="#id32" title="Permalink to this headline">¶</a></h3>
+<p>どのような理解できない引数でも、タグの属性の一つとして解釈されます。
+キーワード引数 <tt class="docutils literal"><span class="pre">id</span></tt> に値を渡すと、Beautiful Soupはタグの&#8217;id&#8217;属性に対してフィルタリングを行います。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">&#39;link2&#39;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>キーワード引数 <tt class="docutils literal"><span class="pre">href</span></tt> に値を渡すと、Beautiful SoupはHTMLタグの&#8217;href&#8217;属性に対してフィルタリングを行います。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">&quot;elsie&quot;</span><span class="p">))</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>キーワード引数の値もまた、 <a class="reference internal" href="#id27">文字列</a>, <a class="reference internal" href="#id28">正規表現</a>, <a class="reference internal" href="#id29">リスト</a>, <a class="reference internal" href="#id30">関数</a>, <a class="reference internal" href="#true">True値</a> をとることができます。</p>
+<p>次のコードは、<tt class="docutils literal"><span class="pre">id</span></tt> 属性に値が入っている全てのタグを見つけます。このとき、値は何でもあっても構いません。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>複数のキーワード引数を一度に渡すことによって、複数の属性についてフィルタリングできます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">&quot;elsie&quot;</span><span class="p">),</span> <span class="nb">id</span><span class="o">=</span><span class="s">&#39;link1&#39;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;three&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>HTML5の &#8216;data-*&#8217; 属性など、いくつかの属性についてはキーワード引数として用いることができません。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">data_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;div data-foo=&quot;value&quot;&gt;foo!&lt;/div&gt;&#39;</span><span class="p">)</span>
+<span class="n">data_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">data</span><span class="o">-</span><span class="n">foo</span><span class="o">=</span><span class="s">&quot;value&quot;</span><span class="p">)</span>
+<span class="c"># SyntaxError: keyword can&#39;t be an expression</span>
+</pre></div>
+</div>
+<p>しかし、これらの属性を辞書にして、キーワード引数 <tt class="docutils literal"><span class="pre">attrs</span></tt> として値を渡せばフィルタリングすることができます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">data_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">&quot;data-foo&quot;</span><span class="p">:</span> <span class="s">&quot;value&quot;</span><span class="p">})</span>
+<span class="c"># [&lt;div data-foo=&quot;value&quot;&gt;foo!&lt;/div&gt;]</span>
+</pre></div>
+</div>
+<span class="target" id="attrs"></span></div>
+<div class="section" id="css">
+<h3>CSSのクラスで検索<a class="headerlink" href="#css" title="Permalink to this headline">¶</a></h3>
+<p>HTMLタグが持つCSSのクラスで検索をかけるのはとても便利です。
+しかし&#8221;class&#8221;はPythonの予約語のため、<tt class="docutils literal"><span class="pre">class</span></tt> をキーワード引数として用いると文法エラーになります。
+そこで、Beautiful Soup 4.1.2からは、 <tt class="docutils literal"><span class="pre">class_</span></tt> というキーワード引数でCSSのクラスを検索できるようになりました。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">&quot;sister&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>他のキーワード引数と同様、 <tt class="docutils literal"><span class="pre">class_</span></tt> には文字列, 正規表現, 関数, True値を渡せます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">class_</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">&quot;itl&quot;</span><span class="p">))</span>
+<span class="c"># [&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;]</span>
+
+<span class="k">def</span> <span class="nf">has_six_characters</span><span class="p">(</span><span class="n">css_class</span><span class="p">):</span>
+ <span class="k">return</span> <span class="n">css_class</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="n">css_class</span><span class="p">)</span> <span class="o">==</span> <span class="mi">6</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">class_</span><span class="o">=</span><span class="n">has_six_characters</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p><cite>Tag</cite> オブジェクトの属性の <a class="reference internal" href="#id15">値が複数のとき</a> を思い出してください。
+それと同様に、あるCSSクラスを検索するときは、複数のCSSクラスに対してマッチさせられます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">css_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;p class=&quot;body strikeout&quot;&gt;&lt;/p&gt;&#39;</span><span class="p">)</span>
+<span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">&quot;strikeout&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;p class=&quot;body strikeout&quot;&gt;&lt;/p&gt;]</span>
+
+<span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">&quot;body&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;p class=&quot;body strikeout&quot;&gt;&lt;/p&gt;]</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">class</span></tt> 属性の値は、文字列としても検索できます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">&quot;body strikeout&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;p class=&quot;body strikeout&quot;&gt;&lt;/p&gt;]</span>
+</pre></div>
+</div>
+<p>しかし、文字列の値としての変数を検索することはできません。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">&quot;strikeout body&quot;</span><span class="p">)</span>
+<span class="c"># []</span>
+</pre></div>
+</div>
+<p>もしあなたが2つ以上のクラスをまっちさせたいなら、CSSセレクトを使ってください。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">css_soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;p.strikeout.body&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;p class=&quot;body strikeout&quot;&gt;&lt;/p&gt;]</span>
+</pre></div>
+</div>
+<p>Beautiful Soupの古いバージョンでは、 <tt class="docutils literal"><span class="pre">class_</span></tt> 引数は使えません。
+そこで、以下に述べる <tt class="docutils literal"><span class="pre">attrs</span></tt> トリックを使うことができます。
+これは&#8221;class&#8221;をkeyに持つ辞書を <tt class="docutils literal"><span class="pre">attrs</span></tt> 引数に渡して、検索することができます。
+この辞書のvalueには、文字列, 正規表現などが使えます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">&quot;class&quot;</span><span class="p">:</span> <span class="s">&quot;sister&quot;</span><span class="p">})</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<span class="target" id="text"></span></div>
+<div class="section" id="id33">
+<h3>text引数<a class="headerlink" href="#id33" title="Permalink to this headline">¶</a></h3>
+<p><tt class="docutils literal"><span class="pre">text</span></tt> 引数で、タグに挟まれている文字列を対象に検索することができます。
+<tt class="docutils literal"><span class="pre">name</span></tt> 引数やキーワード引数のように、 <a class="reference internal" href="#id27">文字列</a> , <a class="reference internal" href="#id28">正規表現</a> , <a class="reference internal" href="#id29">リスト</a> , <a class="reference internal" href="#id30">関数</a> , <a class="reference internal" href="#true">True値</a> が使えます。
+以下の例をごらんください。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="s">&quot;Elsie&quot;</span><span class="p">)</span>
+<span class="c"># [u&#39;Elsie&#39;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="p">[</span><span class="s">&quot;Tillie&quot;</span><span class="p">,</span> <span class="s">&quot;Elsie&quot;</span><span class="p">,</span> <span class="s">&quot;Lacie&quot;</span><span class="p">])</span>
+<span class="c"># [u&#39;Elsie&#39;, u&#39;Lacie&#39;, u&#39;Tillie&#39;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">&quot;Dormouse&quot;</span><span class="p">))</span>
+<span class="p">[</span><span class="s">u&quot;The Dormouse&#39;s story&quot;</span><span class="p">,</span> <span class="s">u&quot;The Dormouse&#39;s story&quot;</span><span class="p">]</span>
+
+<span class="k">def</span> <span class="nf">is_the_only_string_within_a_tag</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
+ <span class="sd">&quot;&quot;&quot;Return True if this string is the only child of its parent tag.&quot;&quot;&quot;</span>
+ <span class="k">return</span> <span class="p">(</span><span class="n">s</span> <span class="o">==</span> <span class="n">s</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">is_the_only_string_within_a_tag</span><span class="p">)</span>
+<span class="c"># [u&quot;The Dormouse&#39;s story&quot;, u&quot;The Dormouse&#39;s story&quot;, u&#39;Elsie&#39;, u&#39;Lacie&#39;, u&#39;Tillie&#39;, u&#39;...&#39;]</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">text</span></tt> 引数はテキスト文字列の検索ですが、これにタグの検索を組みわせることもできます。
+Beautiful Soupは、<tt class="docutils literal"><span class="pre">text</span></tt> 引数で指定した文字列を <tt class="docutils literal"><span class="pre">.string</span></tt> にもつタグ全てを見つけます。
+次のコードは、<tt class="docutils literal"><span class="pre">.string</span></tt> に &#8220;Elsie&#8221;を持つ&lt;a&gt;タグを見つけます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="s">&quot;Elsie&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a href=&quot;http://example.com/elsie&quot; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="limit">
+<span id="id34"></span><h3>limit引数<a class="headerlink" href="#limit" title="Permalink to this headline">¶</a></h3>
+<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> メソッドは、指定したフィルターにマッチした全てのタグと文字列を返します。
+これはドキュメントが大きいときは時間がかかります。
+もし、 <strong>全ての</strong> 結果を必要としなければ、<tt class="docutils literal"><span class="pre">limit</span></tt> 引数で取得する数を指定することができます。</p>
+<p>&#8220;three siters&#8221;ドキュメントには3つのリンクがある、しかし以下のコードははじめの2つしか見つけない。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="recursive">
+<span id="id35"></span><h3>recursive引数<a class="headerlink" href="#recursive" title="Permalink to this headline">¶</a></h3>
+<p><tt class="docutils literal"><span class="pre">mytag.find_all()</span></tt> を実行すると、Beautiful Soupは、 <tt class="docutils literal"><span class="pre">mytag</span></tt> の全ての子孫要素を調べます。
+(子要素、子要素の子要素、そのまた子要素というかんじで、、)
+もし、直下の子要素しか調べたくなければ、<tt class="docutils literal"><span class="pre">recursive=False</span></tt> という引数を渡せばよいです。
+以下で違いをみてみましょう。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">,</span> <span class="n">recursive</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
+<span class="c"># []</span>
+</pre></div>
+</div>
+<p>これはドキュメントの一部です。:</p>
+<div class="highlight-python"><pre>&lt;html&gt;
+ &lt;head&gt;
+ &lt;title&gt;
+ The Dormouse's story
+ &lt;/title&gt;
+ &lt;/head&gt;
+...</pre>
+</div>
+<p>このドキュメントにおいて、&lt;title&gt;タグは&lt;html&gt;の下にはあるが、<cite>直下</cite> にあるわけではありません。
+Beautiful Soupが&lt;title&gt;タグを見つけることができるのは、&lt;html&gt;タグ以下の全ての子孫要素を探してよいときだけです。
+もし、<tt class="docutils literal"><span class="pre">find_all()</span></tt> の引数に <tt class="docutils literal"><span class="pre">recurive=False</span></tt> という&lt;html&gt;タグの直下のみを検索するという制限がかかっていたら、&lt;title&gt;タグを見つけることはできません。</p>
+<p>Beautiful Soupは、多くのパースツリーを検索するメソッドを提供しています。
+それら多くは共通する引数を持ちます。
+<tt class="docutils literal"><span class="pre">find_all()</span></tt> の <tt class="docutils literal"><span class="pre">name</span></tt>, <tt class="docutils literal"><span class="pre">attrs</span></tt>, <tt class="docutils literal"><span class="pre">text</span></tt>, <tt class="docutils literal"><span class="pre">limit</span></tt>, キーワード引数は、他の多くのメソッドにも対応しています。
+しかし、 <tt class="docutils literal"><span class="pre">recursive</span></tt> 引数は、 <tt class="docutils literal"><span class="pre">find_all()</span></tt>, <tt class="docutils literal"><span class="pre">find()</span></tt> の2つのメソッドしか対応していません。
+<tt class="docutils literal"><span class="pre">find_parents()</span></tt> のようなメソッドに、引数 <tt class="docutils literal"><span class="pre">recursive=False</span></tt> を渡しても意味がありません。</p>
+</div>
+<div class="section" id="id36">
+<h3>ショートカット<a class="headerlink" href="#id36" title="Permalink to this headline">¶</a></h3>
+<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> はBeautiful Soupの検索APIの中で、一番使われるものなので、ショートカットがあります。
+<tt class="docutils literal"><span class="pre">Beautiful</span> <span class="pre">Soup</span></tt> オブジェクトや <tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトを関数のように扱って、 <tt class="docutils literal"><span class="pre">find_all()</span></tt> メソッドを呼び出すことができます。
+以下の2行は等価です。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">)</span>
+<span class="n">soup</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>以下の2行もまた等価です。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="find">
+<h2><tt class="docutils literal"><span class="pre">find()</span></tt><a class="headerlink" href="#find" title="Permalink to this headline">¶</a></h2>
+<p>使い方: find(<a class="reference internal" href="#name"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#recursive"><em>recursive</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> メソッドは、検索結果を得るためにHTMLドキュメント全部をスキャンします。
+しかし、1つだけの検索結果が必要なときがあります。
+もし、HTMLドキュメントに&lt;body&gt;タグが1つだけなら、HTMLドキュメント全体をスキャンするのは時間の無駄です。
+その場合は <tt class="docutils literal"><span class="pre">find_all()</span></tt> メソッドに <tt class="docutils literal"><span class="pre">limit=1</span></tt> という引数を渡さずに、 <tt class="docutils literal"><span class="pre">find()</span></tt> メソッドを使うことができます。
+以下の2行は、ほぼ等価です。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&#39;title&#39;</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&#39;title&#39;</span><span class="p">)</span>
+<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
+</pre></div>
+</div>
+<p>ただ1つ違う点は、<tt class="docutils literal"><span class="pre">find_all()</span></tt> は要素1のリストを返し、<tt class="docutils literal"><span class="pre">find()</span></tt> は要素をそのまま返すことです。</p>
+<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> が何もみつけられないときは空リストを返します。
+<tt class="docutils literal"><span class="pre">find()</span></tt> が何もみつけられないときは、 <tt class="docutils literal"><span class="pre">None</span></tt> を返します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;nosuchtag&quot;</span><span class="p">))</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+<p><a class="reference internal" href="#id18">タグ名で探索</a> で出てきた <tt class="docutils literal"><span class="pre">soup.head.title</span></tt> で探索する方法を覚えていますか? おkれは、<tt class="docutils literal"><span class="pre">find()</span></tt> についても適用できます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">head</span><span class="o">.</span><span class="n">title</span>
+<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;head&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">)</span>
+<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="find-parents-find-parent">
+<h2><tt class="docutils literal"><span class="pre">find_parents()</span></tt> / <tt class="docutils literal"><span class="pre">find_parent()</span></tt><a class="headerlink" href="#find-parents-find-parent" title="Permalink to this headline">¶</a></h2>
+<p>使い方: find_parents(<a class="reference internal" href="#name"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#limit"><em>limit</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>使い方: find_parent(<a class="reference internal" href="#name"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>ここまで <tt class="docutils literal"><span class="pre">find_all()</span></tt> と <tt class="docutils literal"><span class="pre">find()</span></tt> について述べてきました。
+Beautiful Soup APIにはパースツリーを検索するためのメソッドが、あと10あります。
+しかし、おそれる必要はありません。
+そのうち5つは、<tt class="docutils literal"><span class="pre">find_all()</span></tt> と基本的に同じです。
+そして、のこりの5つは <tt class="docutils literal"><span class="pre">find()</span></tt> と基本的に同じです。
+違いは、ツリーのどの部分を検索対象にするのかという点のみです。</p>
+<p>最初に、 <tt class="docutils literal"><span class="pre">find_parents()</span></tt> と <tt class="docutils literal"><span class="pre">find_parent()</span></tt> を見てみましょう。
+<tt class="docutils literal"><span class="pre">find_all()</span></tt> と <tt class="docutils literal"><span class="pre">find()</span></tt> がタグの子孫を見て、ツリーを下りていったことを思い出してください。
+<tt class="docutils literal"><span class="pre">find_parents()</span></tt> と <tt class="docutils literal"><span class="pre">find_parent()</span></tt> は逆です。
+これらはタグや文字列の親をみて、ツリーを&#8217;上に&#8217;検索していきます。
+以下の&#8221;three daughters&#8221;ドキュメントの例で、深いレベルにある文字列から検索していく様子を見てください。:</p>
+<div class="highlight-python"><pre>a_string = soup.find(text="Lacie")
+a_string
+# u'Lacie'
+
+a_string.find_parents("a")
+# [&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;]
+
+a_string.find_parent("p")
+# &lt;p class="story"&gt;Once upon a time there were three little sisters; and their names were
+# &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,
+# &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt; and
+# &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;;
+# and they lived at the bottom of a well.&lt;/p&gt;
+
+a_string.find_parents("p", class="title")
+# []</pre>
+</div>
+<p>3つの&lt;a&gt;タグのうちの1つは、検索の起点になる文字列の直接の親要素なので、それが返されました。
+3つの&lt;p&gt;タグのうちの1つは、起点の文字列の直接の親ではありませんが、やはりそれも返されました。
+CSSクラス&#8221;title&#8221;をもつ&lt;p&gt;タグは、&#8221;three daughers&#8221;ドキュメント中にはあるのですが、起点の文字列の親要素ではないので、 <tt class="docutils literal"><span class="pre">find_parents()</span></tt> では見つけることができませんでした。</p>
+<p><tt class="docutils literal"><span class="pre">find_parent()</span></tt> と <tt class="docutils literal"><span class="pre">find_parents()</span></tt> のつながりはわかったでしょうか。
+<a class="reference internal" href="#parent">.parent</a> と <a class="reference internal" href="#parents">.parents</a> 属性については、以前に述べてあります。
+そのつながりはとても強いです。
+これらの検索メソッドは実際には <tt class="docutils literal"><span class="pre">.parents</span></tt> で、全ての親要素の連なりをイテレートして扱います。
+そして、要素それぞれについてフィルターにマッチするかどうかをチェックします。</p>
+</div>
+<div class="section" id="find-next-siblings-find-next-sibling">
+<h2><tt class="docutils literal"><span class="pre">find_next_siblings()</span></tt> / <tt class="docutils literal"><span class="pre">find_next_sibling()</span></tt><a class="headerlink" href="#find-next-siblings-find-next-sibling" title="Permalink to this headline">¶</a></h2>
+<p>使い方: find_next_siblings(<a class="reference internal" href="#name"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#limit"><em>limit</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>使い方: find_next_sibling(<a class="reference internal" href="#name"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>これらのメソッドは、後方にある兄弟要素を扱うのに、 <a class="reference internal" href="#sibling-generators"><em>.next_siblings</em></a> を使います。
+<tt class="docutils literal"><span class="pre">find_next_siblings()</span></tt> メソッドはマッチする兄弟要素を全て返し、 <tt class="docutils literal"><span class="pre">find_next_sibling()</span></tt> は最初の一つを返します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">first_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="n">first_link</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
+
+<span class="n">first_link</span><span class="o">.</span><span class="n">find_next_siblings</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">first_story_paragraph</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">,</span> <span class="s">&quot;story&quot;</span><span class="p">)</span>
+<span class="n">first_story_paragraph</span><span class="o">.</span><span class="n">find_next_sibling</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">)</span>
+<span class="c"># &lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="find-previous-siblings-find-previous-sibling">
+<h2><tt class="docutils literal"><span class="pre">find_previous_siblings()</span></tt> / <tt class="docutils literal"><span class="pre">find_previous_sibling()</span></tt><a class="headerlink" href="#find-previous-siblings-find-previous-sibling" title="Permalink to this headline">¶</a></h2>
+<p>使い方: find_previous_siblings(<a class="reference internal" href="#name"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#limit"><em>limit</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>使い方: find_previous_sibling(<a class="reference internal" href="#name"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>これらのメソッドは、HTMLドキュメントの前方にあった兄弟要素を扱うのに <a class="reference internal" href="#sibling-generators"><em>.previous_siblings</em></a> を使います。
+<tt class="docutils literal"><span class="pre">find_previous_siblings()</span></tt> メソッドはマッチする兄弟要素を全て返し、 <tt class="docutils literal"><span class="pre">find_previous_sibling()</span></tt> は最初の一つを返します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">last_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="s">&quot;link3&quot;</span><span class="p">)</span>
+<span class="n">last_link</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;</span>
+
+<span class="n">last_link</span><span class="o">.</span><span class="n">find_previous_siblings</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
+
+<span class="n">first_story_paragraph</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">,</span> <span class="s">&quot;story&quot;</span><span class="p">)</span>
+<span class="n">first_story_paragraph</span><span class="o">.</span><span class="n">find_previous_sibling</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">)</span>
+<span class="c"># &lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="find-all-next-find-next">
+<h2><tt class="docutils literal"><span class="pre">find_all_next()</span></tt> / <tt class="docutils literal"><span class="pre">find_next()</span></tt><a class="headerlink" href="#find-all-next-find-next" title="Permalink to this headline">¶</a></h2>
+<p>使い方: find_all_next(<a class="reference internal" href="#name"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#limit"><em>limit</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>使い方: find_next(<a class="reference internal" href="#name"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>これらのメソッドは、HTMLドキュメントのその後にあらわれるタグと文字列の要素全てイテレートして扱うために、 <tt class="docutils literal"><span class="pre">.next_elements</span></tt> メソッドを使います。
+<tt class="docutils literal"><span class="pre">find_all_next()</span></tt> メソッドはマッチするもの全てを返し、 <tt class="docutils literal"><span class="pre">find_next()</span></tt> は最初にマッチしたものを返します。(!要改善):</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">first_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="n">first_link</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
+
+<span class="n">first_link</span><span class="o">.</span><span class="n">find_all_next</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
+<span class="c"># [u&#39;Elsie&#39;, u&#39;,\n&#39;, u&#39;Lacie&#39;, u&#39; and\n&#39;, u&#39;Tillie&#39;,</span>
+<span class="c"># u&#39;;\nand they lived at the bottom of a well.&#39;, u&#39;\n\n&#39;, u&#39;...&#39;, u&#39;\n&#39;]</span>
+
+<span class="n">first_link</span><span class="o">.</span><span class="n">find_next</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">)</span>
+<span class="c"># &lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
+</pre></div>
+</div>
+<p>最初の例では、起点となった&lt;a&gt;タグに挟まれている、文字列&#8221;Elsie&#8221;が返されています。
+2番めの例では、起点となった&lt;a&gt;タグと同じパートじゃないにも関わらず、最後の&lt;p&gt;タグが示されています。
+これらのメソッドでは、問題はフィルターにマッチするか否かと、スタートした要素よりも後にでてきたかということが問われます。(!要改善)</p>
+</div>
+<div class="section" id="find-all-previous-find-previous">
+<h2><tt class="docutils literal"><span class="pre">find_all_previous()</span></tt> / <tt class="docutils literal"><span class="pre">find_previous()</span></tt><a class="headerlink" href="#find-all-previous-find-previous" title="Permalink to this headline">¶</a></h2>
+<p>使い方: find_all_previous(<a class="reference internal" href="#name"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#limit"><em>limit</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>使い方: find_previous(<a class="reference internal" href="#name"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>これらのメソッドは、ドキュメントの起点のタグの前にあらわれるタグと文字列の要素全てをイテレートして扱うために、 <a class="reference internal" href="#element-generators"><em>.previous_elements</em></a> メソッドを使います。(!要改善):</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">first_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="n">first_link</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
+
+<span class="n">first_link</span><span class="o">.</span><span class="n">find_all_previous</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; ...&lt;/p&gt;,</span>
+<span class="c"># &lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;]</span>
+
+<span class="n">first_link</span><span class="o">.</span><span class="n">find_previous</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">)</span>
+<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">find_all_previous(&quot;p&quot;)</span></tt> は&#8221;three sisters&#8221;ドキュメントの最初の段落を見つけます。(class=&#8221;title&#8221;のやつです)
+しかし、第2段落でも見つけます。&lt;p&gt;タグは内に起点にした&lt;a&gt;タグを含んでいます。
+驚きすぎないでください。
+我々は、起点のタグより前方に現れた全てのタグを見ているのです。&lt;a&gt;タグを挟んでいる&lt;p&gt;タグは、&lt;a&gt;タグよりも前に示されねばなりません。(!要改善)</p>
+</div>
+<div class="section" id="id37">
+<h2>CSSセレクタ<a class="headerlink" href="#id37" title="Permalink to this headline">¶</a></h2>
+<p>Beautiful Soupは、よく使われるCSSセレクタをほとんどサポートしています。
+<tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトや <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> オブジェクトに <tt class="docutils literal"><span class="pre">.select()</span></tt> メソッドで文字列を渡すだけで使えます。</p>
+<p>タグを見つけるには次のようにします。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;p nth-of-type(3)&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;]</span>
+</pre></div>
+</div>
+<p>あるタグより後ろの指定されたタグを見つけます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;body a&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;html head title&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
+</pre></div>
+</div>
+<p>あるタグの直後の指定されたタグを見つけます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;head &gt; title&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;p &gt; a&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;p &gt; a:nth-of-type(2)&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;p &gt; #link1&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;body &gt; a&quot;</span><span class="p">)</span>
+<span class="c"># []</span>
+</pre></div>
+</div>
+<p>タグの兄弟要素を見つけます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;#link1 ~ .sister&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;#link1 + .sister&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>CSSクラスによってタグを見つけます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;.sister&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;[class~=sister]&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>CSSのIDによってタグを見つけます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;#link1&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;a#link2&quot;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>指定の属性の有無でタグを見つけます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&#39;a[href]&#39;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>属性が持つ値によってタグを見つけます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&#39;a[href=&quot;http://example.com/elsie&quot;]&#39;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&#39;a[href^=&quot;http://example.com/&quot;]&#39;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&#39;a[href$=&quot;tillie&quot;]&#39;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&#39;a[href*=&quot;.com/el&quot;]&#39;</span><span class="p">)</span>
+<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>languageコードで、マッチさせます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">multilingual_markup</span> <span class="o">=</span> <span class="s">&quot;&quot;&quot;</span>
+<span class="s"> &lt;p lang=&quot;en&quot;&gt;Hello&lt;/p&gt;</span>
+<span class="s"> &lt;p lang=&quot;en-us&quot;&gt;Howdy, y&#39;all&lt;/p&gt;</span>
+<span class="s"> &lt;p lang=&quot;en-gb&quot;&gt;Pip-pip, old fruit&lt;/p&gt;</span>
+<span class="s"> &lt;p lang=&quot;fr&quot;&gt;Bonjour mes amis&lt;/p&gt;</span>
+<span class="s">&quot;&quot;&quot;</span>
+<span class="n">multilingual_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">multilingual_markup</span><span class="p">)</span>
+<span class="n">multilingual_soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&#39;p[lang|=en]&#39;</span><span class="p">)</span>
+<span class="c"># [&lt;p lang=&quot;en&quot;&gt;Hello&lt;/p&gt;,</span>
+<span class="c"># &lt;p lang=&quot;en-us&quot;&gt;Howdy, y&#39;all&lt;/p&gt;,</span>
+<span class="c"># &lt;p lang=&quot;en-gb&quot;&gt;Pip-pip, old fruit&lt;/p&gt;]</span>
+</pre></div>
+</div>
+<p>このやり方は、CSSセレクタの文法を知っているユーザにとっては、とても便利です。
+これでBeautiful Soup APIの全てのキモを使えるようになりました。
+もしCSSセレクタを使いこなしたいなら、lxmlを使ってみるのもよいでしょう。
+lxmlは処理がとても速く、さらに多くのCSSセレクタをサポートしています。
+しかし、ここではBeautiful Soup APIを使って、シンプルなCSSセレクタの組み合わせるによる方法を説明しました。</p>
+</div>
+</div>
+<div class="section" id="id38">
+<h1>パースツリーを修正<a class="headerlink" href="#id38" title="Permalink to this headline">¶</a></h1>
+<p>Beautiful Soupの主な強みは、パースツリーの検索するところにあります。
+しかしまた、Beautiful Soupは、ツリーを修正したり、変更したツリーを新しいHTMLやXMLのドキュメントに出力することもできます。</p>
+<div class="section" id="id39">
+<h2>名前や属性の変更<a class="headerlink" href="#id39" title="Permalink to this headline">¶</a></h2>
+<p><a class="reference internal" href="#id14">属性</a> の節でも述べましたが、タグの名前変更、属性値の変更、追加、削除ができます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;b class=&quot;boldest&quot;&gt;Extremely bold&lt;/b&gt;&#39;</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
+
+<span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">&quot;blockquote&quot;</span>
+<span class="n">tag</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#39;verybold&#39;</span>
+<span class="n">tag</span><span class="p">[</span><span class="s">&#39;id&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
+<span class="n">tag</span>
+<span class="c"># &lt;blockquote class=&quot;verybold&quot; id=&quot;1&quot;&gt;Extremely bold&lt;/blockquote&gt;</span>
+
+<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
+<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">&#39;id&#39;</span><span class="p">]</span>
+<span class="n">tag</span>
+<span class="c"># &lt;blockquote&gt;Extremely bold&lt;/blockquote&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="id40">
+<h2><tt class="docutils literal"><span class="pre">.string</span></tt> の修正<a class="headerlink" href="#id40" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトの <tt class="docutils literal"><span class="pre">.string</span></tt> を変更すると、そのタグが挟む文字列がその値に変更されます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">&quot;New link text.&quot;</span>
+<span class="n">tag</span>
+<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;New link text.&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p>注意点: 変更したタグが他のタグを挟んでいると、それらのタグ全てが破壊されます。</p>
+</div>
+<div class="section" id="append">
+<h2><tt class="docutils literal"><span class="pre">append()</span></tt><a class="headerlink" href="#append" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag.append()</span></tt> により、タグが挟んでいる文字列に追加をすることができます。
+まるでPythonのリストの <tt class="docutils literal"><span class="pre">.append()</span></tt> のように作用します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;Foo&lt;/a&gt;&quot;</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s">&quot;Bar&quot;</span><span class="p">)</span>
+
+<span class="n">soup</span>
+<span class="c"># &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;a&gt;FooBar&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># [u&#39;Foo&#39;, u&#39;Bar&#39;]</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="beautifulsoup-new-string-new-tag">
+<h2><tt class="docutils literal"><span class="pre">BeautifulSoup.new_string()</span></tt> / <tt class="docutils literal"><span class="pre">.new_tag()</span></tt><a class="headerlink" href="#beautifulsoup-new-string-new-tag" title="Permalink to this headline">¶</a></h2>
+<p>ドキュメントに文字列を加えたいときは、Pythonの文字列を <tt class="docutils literal"><span class="pre">append()</span></tt> に渡してください。
+もしくは、 <a class="reference external" href="http://ja.wikipedia.org/wiki/Factory_Method_%E3%83%91%E3%82%BF%E3%83%BC%E3%83%B3">factory method</a> の <tt class="docutils literal"><span class="pre">BeautifulSoup.new_string()</span></tt> を呼出してください。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;b&gt;&lt;/b&gt;&quot;</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s">&quot;Hello&quot;</span><span class="p">)</span>
+<span class="n">new_string</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_string</span><span class="p">(</span><span class="s">&quot; there&quot;</span><span class="p">)</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_string</span><span class="p">)</span>
+<span class="n">tag</span>
+<span class="c"># &lt;b&gt;Hello there.&lt;/b&gt;</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># [u&#39;Hello&#39;, u&#39; there&#39;]</span>
+</pre></div>
+</div>
+<p>新しいコメントや 他の <tt class="docutils literal"><span class="pre">NavigableString</span></tt> のサブクラスを生成したいときは、 <tt class="docutils literal"><span class="pre">new_string()</span></tt> の第2引数にそのクラスを渡してください。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">Comment</span>
+<span class="n">new_comment</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_string</span><span class="p">(</span><span class="s">&quot;Nice to see you.&quot;</span><span class="p">,</span> <span class="n">Comment</span><span class="p">)</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_comment</span><span class="p">)</span>
+<span class="n">tag</span>
+<span class="c"># &lt;b&gt;Hello there&lt;!--Nice to see you.--&gt;&lt;/b&gt;</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># [u&#39;Hello&#39;, u&#39; there&#39;, u&#39;Nice to see you.&#39;]</span>
+</pre></div>
+</div>
+<p>(これはBeautiful Soup 4.2.1 の新機能です)</p>
+<p>完全に新しいタグを生成したいときは、factory methodの <tt class="docutils literal"><span class="pre">BeautifulSoup.new_tag()</span></tt> を呼び出してください。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;b&gt;&lt;/b&gt;&quot;</span><span class="p">)</span>
+<span class="n">original_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
+
+<span class="n">new_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="n">href</span><span class="o">=</span><span class="s">&quot;http://www.example.com&quot;</span><span class="p">)</span>
+<span class="n">original_tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_tag</span><span class="p">)</span>
+<span class="n">original_tag</span>
+<span class="c"># &lt;b&gt;&lt;a href=&quot;http://www.example.com&quot;&gt;&lt;/a&gt;&lt;/b&gt;</span>
+
+<span class="n">new_tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">&quot;Link text.&quot;</span>
+<span class="n">original_tag</span>
+<span class="c"># &lt;b&gt;&lt;a href=&quot;http://www.example.com&quot;&gt;Link text.&lt;/a&gt;&lt;/b&gt;</span>
+</pre></div>
+</div>
+<p>第1引数のタグ名だけは必須です。</p>
+</div>
+<div class="section" id="insert">
+<h2><tt class="docutils literal"><span class="pre">insert()</span></tt><a class="headerlink" href="#insert" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag.insert()</span></tt> は <tt class="docutils literal"><span class="pre">Tag.append()</span></tt> に似ています。
+違うのは、タグの <tt class="docutils literal"><span class="pre">.contents</span></tt> の最後以外にも、要素を挿入できるという点です。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+
+<span class="n">tag</span><span class="o">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">&quot;but did not endorse &quot;</span><span class="p">)</span>
+<span class="n">tag</span>
+<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;I linked to but did not endorse &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># [u&#39;I linked to &#39;, u&#39;but did not endorse&#39;, &lt;i&gt;example.com&lt;/i&gt;]</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="insert-before-insert-after">
+<h2><tt class="docutils literal"><span class="pre">insert_before()</span></tt> / <tt class="docutils literal"><span class="pre">insert_after()</span></tt><a class="headerlink" href="#insert-before-insert-after" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">insert_before()</span></tt> メソッドは、あるタグの直前に、別のタグや文字列を挿入します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;b&gt;stop&lt;/b&gt;&quot;</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">&quot;i&quot;</span><span class="p">)</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">&quot;Don&#39;t&quot;</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">insert_before</span><span class="p">(</span><span class="n">tag</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">b</span>
+<span class="c"># &lt;b&gt;&lt;i&gt;Don&#39;t&lt;/i&gt;stop&lt;/b&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">insert_after()</span></tt> メソッドは、あるタグの直後に、別のタグや文字列を挿入します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">insert_after</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_string</span><span class="p">(</span><span class="s">&quot; ever &quot;</span><span class="p">))</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">b</span>
+<span class="c"># &lt;b&gt;&lt;i&gt;Don&#39;t&lt;/i&gt; ever stop&lt;/b&gt;</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># [&lt;i&gt;Don&#39;t&lt;/i&gt;, u&#39; ever &#39;, u&#39;stop&#39;]</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="clear">
+<h2><tt class="docutils literal"><span class="pre">clear()</span></tt><a class="headerlink" href="#clear" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag.clear()</span></tt> は、タグが挟んでいるcontentsを削除します。(訳注:要チェック?):</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+
+<span class="n">tag</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
+<span class="n">tag</span>
+<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;&lt;/a&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="extract">
+<h2><tt class="docutils literal"><span class="pre">extract()</span></tt><a class="headerlink" href="#extract" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">PageElement.extract()</span></tt> はツリーからタグや文字列を除去します。
+返値は、その抽出されたタグや文字列です。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+
+<span class="n">i_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
+
+<span class="n">a_tag</span>
+<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;I linked to&lt;/a&gt;</span>
+
+<span class="n">i_tag</span>
+<span class="c"># &lt;i&gt;example.com&lt;/i&gt;</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">i_tag</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
+<span class="bp">None</span>
+</pre></div>
+</div>
+<p>このとき、2つのパースツリーがあります。1つは <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> オブジェクトを根ノードとしたあなたがパースしたドキュメントです。もう1つは、抽出したタグを根ノードとするものです。抽出した要素の子要素を <tt class="docutils literal"><span class="pre">extract</span></tt> でコールできます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">my_string</span> <span class="o">=</span> <span class="n">i_tag</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
+<span class="n">my_string</span>
+<span class="c"># u&#39;example.com&#39;</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">my_string</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
+<span class="c"># None</span>
+<span class="n">i_tag</span>
+<span class="c"># &lt;i&gt;&lt;/i&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="decompose">
+<h2><tt class="docutils literal"><span class="pre">decompose()</span></tt><a class="headerlink" href="#decompose" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag.decompose()</span></tt> はパースツリーからタグを除去します。
+<strong>そのタグと挟んでいるcontentsを完全に削除します</strong></p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">decompose</span><span class="p">()</span>
+
+<span class="n">a_tag</span>
+<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;I linked to&lt;/a&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="replace-with">
+<span id="id41"></span><h2><tt class="docutils literal"><span class="pre">replace_with()</span></tt><a class="headerlink" href="#replace-with" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">PageElement.replace_with()</span></tt> はツリーからタグと文字列を除去し、
+引数に与えたタグや文字をその代わりに置き換えます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+
+<span class="n">new_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">&quot;b&quot;</span><span class="p">)</span>
+<span class="n">new_tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">&quot;example.net&quot;</span>
+<span class="n">a_tag</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="n">new_tag</span><span class="p">)</span>
+
+<span class="n">a_tag</span>
+<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;b&gt;example.net&lt;/b&gt;&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">replace_with()</span></tt> は置き換えられたタグや文字列を返します。
+それを、調査したり、ツリーの他の部分に加えることができます。</p>
+</div>
+<div class="section" id="wrap">
+<h2><tt class="docutils literal"><span class="pre">wrap()</span></tt><a class="headerlink" href="#wrap" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">PageElement.wrap()</span></tt> は、その要素を引数で指定したタグを挟みます。
+新しく挟まれたものを返します。</p>
+<div class="highlight-python"><pre>soup = BeautifulSoup("&lt;p&gt;I wish I was bold.&lt;/p&gt;")
+soup.p.string.wrap(soup.new_tag("b"))
+# &lt;b&gt;I wish I was bold.&lt;/b&gt;
+
+soup.p.wrap(soup.new_tag("div")
+# &lt;div&gt;&lt;p&gt;&lt;b&gt;I wish I was bold.&lt;/b&gt;&lt;/p&gt;&lt;/div&gt;</pre>
+</div>
+<p>このメソッドは、Beautiful Soup 4.0.5 からの新機能です。</p>
+</div>
+<div class="section" id="unwrap">
+<h2><tt class="docutils literal"><span class="pre">unwrap()</span></tt><a class="headerlink" href="#unwrap" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag.unwrap()</span></tt> は <tt class="docutils literal"><span class="pre">wrap()</span></tt> の反対です。
+それは、タグの中身がなんであれ、それとタグを置き換えます。
+マークアップをはずすのに便利です。(訳注:やりなおし??):</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+
+<span class="n">a_tag</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">unwrap</span><span class="p">()</span>
+<span class="n">a_tag</span>
+<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;I linked to example.com&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">replace_with()</span></tt> のように、 <tt class="docutils literal"><span class="pre">unwrap()</span></tt> は置き換えられたタグを返します。</p>
+</div>
+</div>
+<div class="section" id="id42">
+<h1>出力<a class="headerlink" href="#id42" title="Permalink to this headline">¶</a></h1>
+<span class="target" id="prettyprinting"></span><div class="section" id="id43">
+<h2>きれいに出力<a class="headerlink" href="#id43" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">prettify()</span></tt> メソッドは、BeautifulSoupパースツリーを、1行に1タグのきれいにフォーマットされたUnicode文字列に変換します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">()</span>
+<span class="c"># &#39;&lt;html&gt;\n &lt;head&gt;\n &lt;/head&gt;\n &lt;body&gt;\n &lt;a href=&quot;http://example.com/&quot;&gt;\n...&#39;</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;head&gt;</span>
+<span class="c"># &lt;/head&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;</span>
+<span class="c"># I linked to</span>
+<span class="c"># &lt;i&gt;</span>
+<span class="c"># example.com</span>
+<span class="c"># &lt;/i&gt;</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">prettify()</span></tt> メソッドは、 トップレベルの <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> オブジェクトでも、それ以外の <tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトでも呼び出すことができます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;</span>
+<span class="c"># I linked to</span>
+<span class="c"># &lt;i&gt;</span>
+<span class="c"># example.com</span>
+<span class="c"># &lt;/i&gt;</span>
+<span class="c"># &lt;/a&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="id44">
+<h2>一行に出力<a class="headerlink" href="#id44" title="Permalink to this headline">¶</a></h2>
+<p>フォーマットされたテキストではなく単なる文字列がほしければ、 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> や <tt class="docutils literal"><span class="pre">Tag</span></tt> オブジェクトの <tt class="docutils literal"><span class="pre">unicode()</span></tt> や <tt class="docutils literal"><span class="pre">str()</span></tt> を呼び出せます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
+<span class="c"># &#39;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;&#39;</span>
+
+<span class="nb">unicode</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="p">)</span>
+<span class="c"># u&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">str()</span></tt> 関数は、UTF-8にエンコードされた文字列を返します。
+他のオプションを知りたければ、 <a class="reference internal" href="#id48">エンコード</a> をみてください。</p>
+<p>バイト文字列を得るのに、 <tt class="docutils literal"><span class="pre">encode()</span></tt> を用いることもできます。
+<tt class="docutils literal"><span class="pre">decode()</span></tt> を用いると、Unicodeを得ることができます。</p>
+<span class="target" id="output-formatters"></span></div>
+<div class="section" id="id45">
+<h2>フォーマットを指定<a class="headerlink" href="#id45" title="Permalink to this headline">¶</a></h2>
+<p>Beautiful Soupに、&#8221;&amp;lquot;&#8221;のようなHTMLエンティティを含んだドキュメントを渡すと、それらはUnicodeキャラクタに変換されます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&amp;ldquo;Dammit!&amp;rdquo; he said.&quot;</span><span class="p">)</span>
+<span class="nb">unicode</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
+<span class="c"># u&#39;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;\u201cDammit!\u201d he said.&lt;/body&gt;&lt;/html&gt;&#39;</span>
+</pre></div>
+</div>
+<p>そのドキュメントを文字列に変換すると、Unicode文字列はUTF-8キャラクタとしてエンコードされます。
+それを、HTMLエンティティに戻すことはできません。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
+<span class="c"># &#39;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;\xe2\x80\x9cDammit!\xe2\x80\x9d he said.&lt;/body&gt;&lt;/html&gt;&#39;</span>
+</pre></div>
+</div>
+<p>デフォルトでは、出力するときエスケープされるのは、裸の&amp;と角かっこのみです。
+これらは、&#8221;&amp;amp;&#8221;,&#8221;&amp;lt;&#8221;,&#8221;&amp;gt&#8221;に変換されます。
+そのためBeautifulSoupはうっかり不正確なHTMLやXMLを生成することはありません。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;p&gt;The law firm of Dewey, Cheatem, &amp; Howe&lt;/p&gt;&quot;</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">p</span>
+<span class="c"># &lt;p&gt;The law firm of Dewey, Cheatem, &amp;amp; Howe&lt;/p&gt;</span>
+
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;a href=&quot;http://example.com/?foo=val1&amp;bar=val2&quot;&gt;A link&lt;/a&gt;&#39;</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="c"># &lt;a href=&quot;http://example.com/?foo=val1&amp;amp;bar=val2&quot;&gt;A link&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">prettify()</span></tt>, <tt class="docutils literal"><span class="pre">encode()</span></tt>, <tt class="docutils literal"><span class="pre">decode()</span></tt> の <tt class="docutils literal"><span class="pre">formatter</span></tt> 属性に値を与えると、出力を変更することができます。
+<tt class="docutils literal"><span class="pre">formatter</span></tt> は、4種類の値をとり得ます。</p>
+<p>デフォルトでは、 <tt class="docutils literal"><span class="pre">formatter=&quot;minimal&quot;</span></tt> です。
+文字列は、Beautiful Soupが正しいHTML/XMLを生成することを十分に保証するように、加工されるだけです。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">french</span> <span class="o">=</span> <span class="s">&quot;&lt;p&gt;Il a dit &amp;lt;&amp;lt;Sacr&amp;eacute; bleu!&amp;gt;&amp;gt;&lt;/p&gt;&quot;</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">french</span><span class="p">)</span>
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="s">&quot;minimal&quot;</span><span class="p">))</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p&gt;</span>
+<span class="c"># Il a dit &amp;lt;&amp;lt;Sacrテゥ bleu!&amp;gt;&amp;gt;</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>もし、 <tt class="docutils literal"><span class="pre">formatter=&quot;html&quot;</span></tt> を渡せば、BSは 可能なときはいつでも、Unicode文字列を HTMLエンティティに変換します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="s">&quot;html&quot;</span><span class="p">))</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p&gt;</span>
+<span class="c"># Il a dit &amp;lt;&amp;lt;Sacr&amp;eacute; bleu!&amp;gt;&amp;gt;</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>もし、 <tt class="docutils literal"><span class="pre">formatter=None</span></tt> を渡せば、BSは出力においてまったく文字列を修正しません。
+これは、最速のオプションですが、BSが正しくないHTML/XMLを生成することになります。
+次の例をご覧ください。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="bp">None</span><span class="p">))</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p&gt;</span>
+<span class="c"># Il a dit &lt;&lt;Sacrテゥ bleu!&gt;&gt;</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+
+<span class="n">link_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;a href=&quot;http://example.com/?foo=val1&amp;bar=val2&quot;&gt;A link&lt;/a&gt;&#39;</span><span class="p">)</span>
+<span class="k">print</span><span class="p">(</span><span class="n">link_soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="bp">None</span><span class="p">))</span>
+<span class="c"># &lt;a href=&quot;http://example.com/?foo=val1&amp;bar=val2&quot;&gt;A link&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">formatter</span></tt> に関数を渡すと、ドキュメントの文字列や属性値のたびに、BSはその関数をコールします。
+関数内で望むことはなんであれできます。
+以下では、formatterは文字列を大文字にコンバートし、他には何もしません。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">def</span> <span class="nf">uppercase</span><span class="p">(</span><span class="nb">str</span><span class="p">):</span>
+ <span class="k">return</span> <span class="nb">str</span><span class="o">.</span><span class="n">upper</span><span class="p">()</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="n">uppercase</span><span class="p">))</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p&gt;</span>
+<span class="c"># IL A DIT &lt;&lt;SACRテ BLEU!&gt;&gt;</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">link_soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="n">uppercase</span><span class="p">))</span>
+<span class="c"># &lt;a href=&quot;HTTP://EXAMPLE.COM/?FOO=VAL1&amp;BAR=VAL2&quot;&gt;</span>
+<span class="c"># A LINK</span>
+<span class="c"># &lt;/a&gt;</span>
+</pre></div>
+</div>
+<p>もしあなたがあなたの関数を書いたなら、あなたは <tt class="docutils literal"><span class="pre">bs4.dammit</span></tt> の <tt class="docutils literal"><span class="pre">EntitySubstitution</span></tt> クラスについて知るべきです。
+このクラスは、BSの標準的なformatter をクラスメソッドとして内包します。
+&#8220;html&#8221; formatterは <tt class="docutils literal"><span class="pre">EntitySubstitution.substitute_html</span></tt> ,
+&#8220;minimal&#8221; formatterは <tt class="docutils literal"><span class="pre">EntitySubstitution.substitute_xml</span></tt> です。
+あなたは、これらの関数を、 <tt class="docutils literal"><span class="pre">formatter==html</span></tt> や <tt class="docutils literal"><span class="pre">formatter==minimal</span></tt> をシュミレーションします。
+しかし、それに加えて他のこともします。</p>
+<p>これは例です。UnicodeキャラクタをHTMLエンティティに置換します。可能なときはいつでも。
+しかし、 <strong>また</strong> 全ての文字列を大文字に変換します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4.dammit</span> <span class="kn">import</span> <span class="n">EntitySubstitution</span>
+<span class="k">def</span> <span class="nf">uppercase_and_substitute_html_entities</span><span class="p">(</span><span class="nb">str</span><span class="p">):</span>
+ <span class="k">return</span> <span class="n">EntitySubstitution</span><span class="o">.</span><span class="n">substitute_html</span><span class="p">(</span><span class="nb">str</span><span class="o">.</span><span class="n">upper</span><span class="p">())</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="n">uppercase_and_substitute_html_entities</span><span class="p">))</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p&gt;</span>
+<span class="c"># IL A DIT &amp;lt;&amp;lt;SACR&amp;Eacute; BLEU!&amp;gt;&amp;gt;</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>最後に一点(最終通告?): もし <tt class="docutils literal"><span class="pre">CData</span></tt> オブジェクトを生成したときは、そのオブジェクト内のテキストは <strong>正確にあるがまま、フォーマットされることなく</strong> いつも表されます。
+BSは formatterメソッドを呼出します。あなたがカスタムメソッドを書いた場合にのみ。どういうカスタムメソッド化というと、全てのドキュメント内の文字列やなにかをcountする。しかし、それは返り値を無視します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4.element</span> <span class="kn">import</span> <span class="n">CData</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;&lt;/a&gt;&quot;</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="n">CData</span><span class="p">(</span><span class="s">&quot;one &lt; three&quot;</span><span class="p">)</span>
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="s">&quot;xml&quot;</span><span class="p">))</span>
+<span class="c"># &lt;a&gt;</span>
+<span class="c"># &lt;![CDATA[one &lt; three]]&gt;</span>
+<span class="c"># &lt;/a&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="get-text">
+<h2><tt class="docutils literal"><span class="pre">get_text()</span></tt><a class="headerlink" href="#get-text" title="Permalink to this headline">¶</a></h2>
+<p>ドキュメントやタグのテキスト部分だけが取得したいときは、 <tt class="docutils literal"><span class="pre">get_text()</span></tt> メソッドを使います。
+それは、全ドキュメントや下層のタグを、ユニコードの単一文字列として返します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;</span><span class="se">\n</span><span class="s">I linked to &lt;i&gt;example.com&lt;/i&gt;</span><span class="se">\n</span><span class="s">&lt;/a&gt;&#39;</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span>
+<span class="s">u&#39;</span><span class="se">\n</span><span class="s">I linked to example.com</span><span class="se">\n</span><span class="s">&#39;</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span>
+<span class="s">u&#39;example.com&#39;</span>
+</pre></div>
+</div>
+<p>テキストをまとめる際の区切り文字を指定することができます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="c"># soup.get_text(&quot;|&quot;)</span>
+<span class="s">u&#39;</span><span class="se">\n</span><span class="s">I linked to |example.com|</span><span class="se">\n</span><span class="s">&#39;</span>
+</pre></div>
+</div>
+<p>各文字列パーツの最初と最後の空白を除去することもできます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="c"># soup.get_text(&quot;|&quot;, strip=True)</span>
+<span class="s">u&#39;I linked to|example.com&#39;</span>
+</pre></div>
+</div>
+<p>空白を除去するのに、 <a class="reference internal" href="#string-generators"><em>stripped_strings</em></a> ジェネレーターを使って処理することもできます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="p">[</span><span class="n">text</span> <span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">stripped_strings</span><span class="p">]</span>
+<span class="c"># [u&#39;I linked to&#39;, u&#39;example.com&#39;]</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="id46">
+<h1>パーサーの指定<a class="headerlink" href="#id46" title="Permalink to this headline">¶</a></h1>
+<p>もし貴方がいくつかのhtmlをパースしたいなら、あなたは、 <tt class="docutils literal"><span class="pre">Beautiful</span> <span class="pre">Soup</span></tt> コンストラクタに、マークアップをダンプできる。
+それはたぶんうまくいきます。
+Beautiful Soupはパーサーを選んで、データをパースします。
+しかし、どのパーサーが使われるか変更するために、コンストラクタに渡すいくつかの引数があります</p>
+<p>1つ目の <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> コンストラクタの引数は、 あなたがパースしたいマークアップの、文字列または開いているファイルハンドルです。
+2つ目の引数は、<em>どのように</em> マークアップをぱーすするかについてです。</p>
+<p>もし何も指定しなかった場合は、インストールされているなかで最高のHTMLパーサーを使います。
+Beautiful Soupは、lxmlのパーサーを最高のものとしています。そして、html5libとPythonの組み込みパーサー。
+あなたは次のうちの一つを指定することで、これを上書きできます。</p>
+<ul class="simple">
+<li>パースしたいマークアップの種類: サポートしているのは、&#8221;html&#8221;, &#8220;xml&#8221;, &#8220;html5&#8221;です。</li>
+</ul>
+<ul class="simple">
+<li>パーサーライブラリの名前: オプションとしてサポートしているのは、&#8221;lxml&#8221;, &#8220;html5lib&#8221;, (Pythonの組み込みHTMLパーサーである) &#8220;html.parser&#8221;。</li>
+</ul>
+<p>この <a class="reference internal" href="#id10">パーサーのインストール</a> の章は、サポートしているパーサーを比較します。</p>
+<p>もし適切なパーサーをインストールしていないときは、Beautiful Soupはあなたのリクエストを無視し、違うパーサーを選びます。
+現在、ただ一つサポートされているXMLパーサーは、lxmlです。
+もし、lxmlをインストールしてないとき、XMLの要求はあなたに何も与えませんし、&#8221;lxml&#8221;へのリクエストも動きません。(要改善!)</p>
+<div class="section" id="id47">
+<h2>パーサーの違い<a class="headerlink" href="#id47" title="Permalink to this headline">¶</a></h2>
+<p>Beautiful Soupは多くの異なるパーサーに同じインターフェースを提供しています。
+しかし、パーサーはそれぞれは異なります。
+パーサーが異なると、同じドキュメントでも、生成されるパースツリーは異なってきます。
+HTMLパーサーとXMLパーサーには大きな違いがあります。
+以下は、短いドキュメントをHTMLとしてパースしたものです。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;&lt;b /&gt;&lt;/a&gt;&quot;</span><span class="p">)</span>
+<span class="c"># &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;a&gt;&lt;b&gt;&lt;/b&gt;&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>空の&lt;b /&gt;タグは、正式なHTMLではないため、パーサーはそれを&lt;b&gt;&lt;/b&gt;のタグの組に変換します。</p>
+<p>以下は、同じドキュメントをXMLとしてパースしたものです。
+(これを実行するにはlxmlをインストールしておく必要があります)
+&lt;b /&gt;タグはそのまま残っており、ドキュメントはXML宣言が&lt;html&gt;タグの代わりに加えられたことに気づいてください。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;&lt;b /&gt;&lt;/a&gt;&quot;</span><span class="p">,</span> <span class="s">&quot;xml&quot;</span><span class="p">)</span>
+<span class="c"># &lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;</span>
+<span class="c"># &lt;a&gt;&lt;b/&gt;&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p>HTMLパーサー同士でも、違いはあります。
+完全な形のHTMLドキュメントをBeautiful Soupに与えたときは、その違いは問題になりません。
+あるパーサーは、他のパーサーよりも速いでしょう。
+しかし、それらは全て元のHTMLドキュメントを正確に反映したデータ構造を与えるでしょう。</p>
+<p>しかし、不完全な形のHTMLドキュメントのときは、異なるパーサーは異なる結果を出力します。
+以下は、lxmlのHTMLパーサーによってパースされた短く不正なドキュメントです。
+ぶらさがっている&lt;/p&gt;タグは、単に無視されていることに気づいてください。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;&lt;/p&gt;&quot;</span><span class="p">,</span> <span class="s">&quot;lxml&quot;</span><span class="p">)</span>
+<span class="c"># &lt;html&gt;&lt;body&gt;&lt;a&gt;&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>以下は、html5libによってパースされた同じドキュメントです。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;&lt;/p&gt;&quot;</span><span class="p">,</span> <span class="s">&quot;html5lib&quot;</span><span class="p">)</span>
+<span class="c"># &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;a&gt;&lt;p&gt;&lt;/p&gt;&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>ぶらさがっている&lt;/p&gt;タグを無視する代わりに、html5libは、それを開始の&lt;p&gt;タグと組にしました。
+このパーサーはまた、ドキュメントに空の&lt;head&gt;タグも加えました。</p>
+<p>以下は、Python組み込みのHTMLパーサーで同じドキュメントをパースしたものです。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;&lt;/p&gt;&quot;</span><span class="p">,</span> <span class="s">&quot;html.parser&quot;</span><span class="p">)</span>
+<span class="c"># &lt;a&gt;&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p>html5libのように、このパーサーは終わりの&lt;/p&gt;タグを無視します。
+html5libとは違い、このパーサーは&lt;body&gt;タグを加えて正しい書式のHTMLドキュメントを作成しようとはしません。
+lxmlとは違い、なんとかして&lt;html&gt;タグを加えようとはしません。</p>
+<p>&#8220;&lt;a&gt;&lt;/p&gt;&#8221;というドキュメントは不正なので、これについての&#8221;正しい&#8221;処理方法はありません。
+html5libパーサーはhtml5標準のいち部分のテクニックを使います。
+それは、ただしい主張を正しい方法についてします。しかし、これらの3つの方法全て、道理に合っています。(?あとで再チェック)</p>
+<p>パーサー間の違いは、あなたのスクリプトにも影響するでしょう。
+もし、スクリプトを他の人に配布したり、複数の計算機で実行しようとするならば、 <tt class="docutils literal"><span class="pre">Beautiful</span> <span class="pre">Soup</span></tt> コンストラクタについてパーサーを指定するべきです。
+そうすることによって、あなたがパースした方法と違うやりかたでドキュメントをパースする可能性を減らすでしょう。</p>
+</div>
+</div>
+<div class="section" id="id48">
+<h1>エンコード<a class="headerlink" href="#id48" title="Permalink to this headline">¶</a></h1>
+<p>HTMLやXMLドキュメントは全て、ASCIIやUTF-8のような特定の文字コードで書かれています。
+しかし、BeautifulSoupにドキュメントをロードすると、それらはUnicode型に変換されます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&quot;&lt;h1&gt;Sacr</span><span class="se">\xc3\xa9</span><span class="s"> bleu!&lt;/h1&gt;&quot;</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">h1</span>
+<span class="c"># &lt;h1&gt;Sacré bleu!&lt;/h1&gt;</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">h1</span><span class="o">.</span><span class="n">string</span>
+<span class="c"># u&#39;Sacr\xe9 bleu!&#39;</span>
+</pre></div>
+</div>
+<p>これは魔法ではありません。Beautiful Soupは <a class="reference internal" href="#unicode-dammit">Unicode, Dammit</a> を内部でライブラリとして呼び出し、文字コードを判別してUnicodeに変換するのに使っています。
+自動判別された文字コードは、 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> オブジェクトの <tt class="docutils literal"><span class="pre">.original_encoding</span></tt> 属性で参照することができます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">original_encoding</span>
+<span class="s">&#39;utf-8&#39;</span>
+</pre></div>
+</div>
+<p>Unicode, Dammit はほとんどの場合正しく判別しますが、たまに失敗します。
+たいてい適切に判別しますが、バイト毎の検索の場合は、とてもながい時間がかかります。
+もし、ドキュメントの文字コードが分かっているのなら、失敗や遅延を避けるために、 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> コンストラクタに <tt class="docutils literal"><span class="pre">from_encoding</span></tt> として渡すとよいです。</p>
+<p>次の例は、ISO-8859-8(訳注:ラテン文字等の文字コード)で書かれたドキュメントです。
+このドキュメントは短いために、Unicode, Dammitはそれが何か判別できず、それをISO-8859-7(訳注:ギリシア文字等の文字コード)と誤認します。:</p>
+<div class="highlight-python"><pre>markup = b"&lt;h1&gt;\xed\xe5\xec\xf9&lt;/h1&gt;"
+soup = BeautifulSoup(markup)
+soup.h1
+&lt;h1&gt;νεμω&lt;/h1&gt;
+soup.original_encoding
+'ISO-8859-7'</pre>
+</div>
+<p>正しい <tt class="docutils literal"><span class="pre">from_encoding</span></tt> を渡すことで、これを正すことができます。:</p>
+<div class="highlight-python"><pre>soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
+soup.h1
+&lt;h1&gt;ם ו ל ש&lt;/h1&gt;
+soup.original_encoding
+'iso8859-8'</pre>
+</div>
+<p>(通常、UTF-8のドキュメントは複数の文字コードを含むことができますが、) ごくまれに、変換できない文字をユニコードの特殊文字&#8221;REPLACEMENT CHARACTER&#8221; (U+FFFD,�) に置き換えることがあります。
+Unicode, Dammit がこれを行うときは、同時に、 <tt class="docutils literal"><span class="pre">UnicodeDammit</span></tt> か <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> オブジェクトの .contains_replacement_characters 属性にTrueをセットします。
+これにより、変換後のユニコードの文字列は、元の文字コードの文字列を正確に表現しておらず、いくつかのデータが損なわれているということがわかります。
+もし、 <tt class="docutils literal"><span class="pre">.contains_replacement_characters</span></tt> が <tt class="docutils literal"><span class="pre">False</span></tt> のときは、ドキュメント内に特殊文字�があっても、それは(この段落の�のように)もともとあり、データは損なわれていないということです。</p>
+<div class="section" id="id49">
+<h2>出力のエンコード<a class="headerlink" href="#id49" title="Permalink to this headline">¶</a></h2>
+<p>Beautiful Soupでドキュメントを出力すると、元のドキュメントがUTF-8でなくても、UTF-8で出力されます。
+次の例は、Latin-1で書かれたドキュメントについてです。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="n">b</span><span class="s">&#39;&#39;&#39;</span>
+<span class="s"> &lt;html&gt;</span>
+<span class="s"> &lt;head&gt;</span>
+<span class="s"> &lt;meta content=&quot;text/html; charset=ISO-Latin-1&quot; http-equiv=&quot;Content-type&quot; /&gt;</span>
+<span class="s"> &lt;/head&gt;</span>
+<span class="s"> &lt;body&gt;</span>
+<span class="s"> &lt;p&gt;Sacr</span><span class="se">\xe9</span><span class="s"> bleu!&lt;/p&gt;</span>
+<span class="s"> &lt;/body&gt;</span>
+<span class="s"> &lt;/html&gt;</span>
+<span class="s">&#39;&#39;&#39;</span>
+
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;head&gt;</span>
+<span class="c"># &lt;meta content=&quot;text/html; charset=utf-8&quot; http-equiv=&quot;Content-type&quot; /&gt;</span>
+<span class="c"># &lt;/head&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p&gt;</span>
+<span class="c"># Sacrテゥ bleu!</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>&lt;meta&gt;タグは書き換えられ、ドキュメントが現在UTF-8であることを示しています。:</p>
+<div class="highlight-python"><pre>.. If you don't want UTF-8, you can pass an encoding into ``prettify()``::</pre>
+</div>
+<p>UTF-8以外で出力したいときは、 <tt class="docutils literal"><span class="pre">prettify()</span></tt> にその文字コードを渡してください。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="s">&quot;latin-1&quot;</span><span class="p">))</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;head&gt;</span>
+<span class="c"># &lt;meta content=&quot;text/html; charset=latin-1&quot; http-equiv=&quot;Content-type&quot; /&gt;</span>
+<span class="c"># ...</span>
+</pre></div>
+</div>
+<p>Pythonのstr型であるかのように、<tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> オブジェクトや、その要素のencode()をコールすることができます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;latin-1&quot;</span><span class="p">)</span>
+<span class="c"># &#39;&lt;p&gt;Sacr\xe9 bleu!&lt;/p&gt;&#39;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;utf-8&quot;</span><span class="p">)</span>
+<span class="c"># &#39;&lt;p&gt;Sacr\xc3\xa9 bleu!&lt;/p&gt;&#39;</span>
+</pre></div>
+</div>
+<p>あなたが選んだ文字コードでは表せない文字は、XMLエンティティリファレンスの数字に変換されます。
+次の例は、スノーマンのユニコード文字を含んだドキュメントです。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">u&quot;&lt;b&gt;</span><span class="se">\N{SNOWMAN}</span><span class="s">&lt;/b&gt;&quot;</span>
+<span class="n">snowman_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">snowman_soup</span><span class="o">.</span><span class="n">b</span>
+</pre></div>
+</div>
+<p>スノーマンの文字はUTF-8のドキュメントに組み込めます。(それは☃と表示されます。しかし、ISO-Latin-1やASCIIにはその文字がありません。そこで、これらの文字コードでは&#8221;&amp;#9731&#8221;に変換されます。):</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;utf-8&quot;</span><span class="p">))</span>
+<span class="c"># &lt;b&gt;☃&lt;/b&gt;</span>
+
+<span class="k">print</span> <span class="n">tag</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;latin-1&quot;</span><span class="p">)</span>
+<span class="c"># &lt;b&gt;&amp;#9731;&lt;/b&gt;</span>
+
+<span class="k">print</span> <span class="n">tag</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;ascii&quot;</span><span class="p">)</span>
+<span class="c"># &lt;b&gt;&amp;#9731;&lt;/b&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="unicode-dammit">
+<h2>Unicode, Dammit<a class="headerlink" href="#unicode-dammit" title="Permalink to this headline">¶</a></h2>
+<p>Beautiful Soup 抜きで、Unicode, Dammitを使えます。
+文字コードがわからないデータを持つときや、Unicodeにそのデータを変換したいときは、それは便利です。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">UnicodeDammit</span>
+<span class="n">dammit</span> <span class="o">=</span> <span class="n">UnicodeDammit</span><span class="p">(</span><span class="s">&quot;Sacr</span><span class="se">\xc3\xa9</span><span class="s"> bleu!&quot;</span><span class="p">)</span>
+<span class="k">print</span><span class="p">(</span><span class="n">dammit</span><span class="o">.</span><span class="n">unicode_markup</span><span class="p">)</span>
+<span class="c"># Sacré bleu!</span>
+<span class="n">dammit</span><span class="o">.</span><span class="n">original_encoding</span>
+<span class="c"># &#39;utf-8&#39;</span>
+</pre></div>
+</div>
+<p>Pythonライブラリ <tt class="docutils literal"><span class="pre">chardet</span></tt> か <tt class="docutils literal"><span class="pre">cchardet</span></tt> をインストールしていれば、Unicode, Dammitはさらに正確に文字コードを推測できます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">dammit</span> <span class="o">=</span> <span class="n">UnicodeDammit</span><span class="p">(</span><span class="s">&quot;Sacr</span><span class="se">\xe9</span><span class="s"> bleu!&quot;</span><span class="p">,</span> <span class="p">[</span><span class="s">&quot;latin-1&quot;</span><span class="p">,</span> <span class="s">&quot;iso-8859-1&quot;</span><span class="p">])</span>
+<span class="k">print</span><span class="p">(</span><span class="n">dammit</span><span class="o">.</span><span class="n">unicode_markup</span><span class="p">)</span>
+<span class="c"># Sacré bleu!</span>
+<span class="n">dammit</span><span class="o">.</span><span class="n">original_encoding</span>
+<span class="c"># &#39;latin-1&#39;</span>
+</pre></div>
+</div>
+<p>Unicode, Dammitには、Beautiful Soupが使わない2つの機能があります。</p>
+<div class="section" id="id50">
+<h3>スマート引用符<a class="headerlink" href="#id50" title="Permalink to this headline">¶</a></h3>
+<p>(訳注: スマート引用符とは、引用符&#8217;で左右の向き(open/close)が区別されているもののことです。
+ASCIIコードやシフトJISの引用符は区別されていません。
+[ <a class="reference external" href="http://www.nikkeibp.co.jp/article/nba/20100517/226165/">参考リンク</a> ])</p>
+<p>Unicode, Dammitは Microsoftスマート引用符を、HTMLやXMLのエンティティに変換します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="n">b</span><span class="s">&quot;&lt;p&gt;I just </span><span class="se">\x93</span><span class="s">love</span><span class="se">\x94</span><span class="s"> Microsoft Word</span><span class="se">\x92</span><span class="s">s smart quotes&lt;/p&gt;&quot;</span>
+
+<span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">&quot;windows-1252&quot;</span><span class="p">],</span> <span class="n">smart_quotes_to</span><span class="o">=</span><span class="s">&quot;html&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">unicode_markup</span>
+<span class="c"># u&#39;&lt;p&gt;I just &amp;ldquo;love&amp;rdquo; Microsoft Word&amp;rsquo;s smart quotes&lt;/p&gt;&#39;</span>
+
+<span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">&quot;windows-1252&quot;</span><span class="p">],</span> <span class="n">smart_quotes_to</span><span class="o">=</span><span class="s">&quot;xml&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">unicode_markup</span>
+<span class="c"># u&#39;&lt;p&gt;I just &amp;#x201C;love&amp;#x201D; Microsoft Word&amp;#x2019;s smart quotes&lt;/p&gt;&#39;</span>
+</pre></div>
+</div>
+<p>Microsoftスマート引用符をASCII引用符に変換できます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">&quot;windows-1252&quot;</span><span class="p">],</span> <span class="n">smart_quotes_to</span><span class="o">=</span><span class="s">&quot;ascii&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">unicode_markup</span>
+<span class="c"># u&#39;&lt;p&gt;I just &quot;love&quot; Microsoft Word\&#39;s smart quotes&lt;/p&gt;&#39;</span>
+</pre></div>
+</div>
+<p>できればこの機能を便利に使ってほしいですが、Beautiful Soupはそれを使いません。
+Beautiful Soupは、他の文字と同じように、Microsoftスマート引用符をUnicodeキャラクタに変換するという、デフォルトの振るまいを選びます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">&quot;windows-1252&quot;</span><span class="p">])</span><span class="o">.</span><span class="n">unicode_markup</span>
+<span class="c"># u&#39;&lt;p&gt;I just \u201clove\u201d Microsoft Word\u2019s smart quotes&lt;/p&gt;&#39;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="id52">
+<h3>複数の文字コード<a class="headerlink" href="#id52" title="Permalink to this headline">¶</a></h3>
+<p>ときどき、ほぼUTF-8で書かれているが、一部Microsoftスマート引用符のような文字コードがWindows-1252の文字を含むドキュメントがあります。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">snowmen</span> <span class="o">=</span> <span class="p">(</span><span class="s">u&quot;</span><span class="se">\N{SNOWMAN}</span><span class="s">&quot;</span> <span class="o">*</span> <span class="mi">3</span><span class="p">)</span>
+<span class="n">quote</span> <span class="o">=</span> <span class="p">(</span><span class="s">u&quot;</span><span class="se">\N{LEFT DOUBLE QUOTATION MARK}</span><span class="s">I like snowmen!</span><span class="se">\N{RIGHT DOUBLE QUOTATION MARK}</span><span class="s">&quot;</span><span class="p">)</span>
+<span class="n">doc</span> <span class="o">=</span> <span class="n">snowmen</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;utf8&quot;</span><span class="p">)</span> <span class="o">+</span> <span class="n">quote</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;windows_1252&quot;</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>このドキュメントは扱いに困ります。
+スノーマンはUTF-8ですが、スマート引用符はWindows-1252です。
+スノーマンか引用符のどちらかしか表示できません。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span>
+<span class="c"># ☃☃☃�I like snowmen!�</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">doc</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s">&quot;windows-1252&quot;</span><span class="p">))</span>
+<span class="c"># ☃☃☃“I like snowmen!”</span>
+</pre></div>
+</div>
+<p>ドキュメントをUTF-8としてデコードすると、 <tt class="docutils literal"><span class="pre">UnicodeDecodeError</span></tt> が発生し、Windows-1252でデコードすると意味不明(gibberish?)なことになります。
+幸いなことに、 <tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> はその文字をpure UTF-8に変換し、それをUnicodeにデコードし、スノーマンと引用符を並べて表示することを許可します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">new_doc</span> <span class="o">=</span> <span class="n">UnicodeDammit</span><span class="o">.</span><span class="n">detwingle</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span>
+<span class="k">print</span><span class="p">(</span><span class="n">new_doc</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s">&quot;utf8&quot;</span><span class="p">))</span>
+<span class="c"># ☃☃☃“I like snowmen!”</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> UTF-8に埋め込まれたWindows-1252の文字を扱う方法(とその逆)のみを知っています。しかしこれは、よくあるケースではありません。</p>
+<p>データを <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> や <tt class="docutils literal"><span class="pre">UnicodeDammit</span></tt> コンストラクタに渡す前に、 <tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> をコールしなければならないことに注意してください。
+Beautiful Soupは 何らかの単一の文字コードでドキュメントが記されていると想定しています。
+もし、UTF-8とWindows-1252の両方を含むドキュメントを渡したら、ドキュメント全体がWindows-1252と判断しがちです。そして、そしてそのドキュメントの出力は、 ` テ「ヒ愴津「ヒ愴津「ヒ愴停廬 like snowmen!窶拜` のようになります。</p>
+<p><tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> はBeautiful Soup 4.1.0からの機能です。</p>
+</div>
+</div>
+</div>
+<div class="section" id="id53">
+<h1>ドキュメントの一部をパース<a class="headerlink" href="#id53" title="Permalink to this headline">¶</a></h1>
+<p>あるドキュメントの&lt;a&gt;タグに対してBeautiful Soupを使いたい場合、ドキュメント全部をパースして、その中から&lt;a&gt;タグを探すのは、時間とメモリの無駄です。
+最初にでてくる&lt;a&gt;タグ以外を全て無視すれば、処理は速くなります。
+<tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> クラスは、与えられたドキュメントのどの部分をパースするかを選ぶことができます。
+そのためには、 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> を作成し、それを <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> コンストラクタに <tt class="docutils literal"><span class="pre">parse_only</span></tt> 属性として渡すだけです。</p>
+<p>(この機能はhtml5libパーサーを使っているときは、使えないことにご注意ください。
+もしhtml5libを使うときはどんなときでも、ドキュメント全体がパースされます。
+これは、html5libがパースツリーをそのように継続的に再構築するためです。
+もし、ドキュメントの一部がパースツリーに組み込まれてない場合は、それは裏ッシュします。
+それをさけるためには、例において、Beautiful SoupがPythonの組み込みパーサーを利用させてください)</p>
+<div class="section" id="soupstrainer">
+<h2><tt class="docutils literal"><span class="pre">SoupStrainer</span></tt><a class="headerlink" href="#soupstrainer" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> (スープ漉し器)クラスは、 <a class="reference internal" href="#id25">パースツリーを検索</a>: するときの典型的なメソッドである <a class="reference internal" href="#name"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, and <a class="reference internal" href="#kwargs"><em>**kwargs</em></a> をもちます。
+以下は、 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> オブジェクトの3通りの例です。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">SoupStrainer</span>
+
+<span class="n">only_a_tags</span> <span class="o">=</span> <span class="n">SoupStrainer</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">)</span>
+
+<span class="n">only_tags_with_id_link2</span> <span class="o">=</span> <span class="n">SoupStrainer</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">&quot;link2&quot;</span><span class="p">)</span>
+
+<span class="k">def</span> <span class="nf">is_short_string</span><span class="p">(</span><span class="n">string</span><span class="p">):</span>
+ <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">string</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">10</span>
+
+<span class="n">only_short_strings</span> <span class="o">=</span> <span class="n">SoupStrainer</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">is_short_string</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>ここで、&#8221;three sisters&#8221;ドキュメントをもう一回とりあげます。
+ドキュメントを <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> オブジェクトで3通りにパースするので、どうなるかを見てみましょう。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">html_doc</span> <span class="o">=</span> <span class="s">&quot;&quot;&quot;</span>
+<span class="s">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;</span>
+
+<span class="s">&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;</span>
+
+<span class="s">&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were</span>
+<span class="s">&lt;a href=&quot;http://example.com/elsie&quot; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
+<span class="s">&lt;a href=&quot;http://example.com/lacie&quot; class=&quot;sister&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and</span>
+<span class="s">&lt;a href=&quot;http://example.com/tillie&quot; class=&quot;sister&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;</span>
+<span class="s">and they lived at the bottom of a well.&lt;/p&gt;</span>
+
+<span class="s">&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
+<span class="s">&quot;&quot;&quot;</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">,</span> <span class="s">&quot;html.parser&quot;</span><span class="p">,</span> <span class="n">parse_only</span><span class="o">=</span><span class="n">only_a_tags</span><span class="p">)</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;</span>
+<span class="c"># Elsie</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;</span>
+<span class="c"># Lacie</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;</span>
+<span class="c"># Tillie</span>
+<span class="c"># &lt;/a&gt;</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">,</span> <span class="s">&quot;html.parser&quot;</span><span class="p">,</span> <span class="n">parse_only</span><span class="o">=</span><span class="n">only_tags_with_id_link2</span><span class="p">)</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;</span>
+<span class="c"># Lacie</span>
+<span class="c"># &lt;/a&gt;</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">,</span> <span class="s">&quot;html.parser&quot;</span><span class="p">,</span> <span class="n">parse_only</span><span class="o">=</span><span class="n">only_short_strings</span><span class="p">)</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># Elsie</span>
+<span class="c"># ,</span>
+<span class="c"># Lacie</span>
+<span class="c"># and</span>
+<span class="c"># Tillie</span>
+<span class="c"># ...</span>
+<span class="c">#</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> <a class="reference internal" href="#id25">パースツリーを検索</a> のメソッドに渡すことができます。
+これは、とても便利です。少しだけ説明します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">only_short_strings</span><span class="p">)</span>
+<span class="c"># [u&#39;\n\n&#39;, u&#39;\n\n&#39;, u&#39;Elsie&#39;, u&#39;,\n&#39;, u&#39;Lacie&#39;, u&#39; and\n&#39;, u&#39;Tillie&#39;,</span>
+<span class="c"># u&#39;\n\n&#39;, u&#39;...&#39;, u&#39;\n&#39;]</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="id54">
+<h1>トラブルシューティング<a class="headerlink" href="#id54" title="Permalink to this headline">¶</a></h1>
+<div class="section" id="diagnose">
+<span id="id55"></span><h2><tt class="docutils literal"><span class="pre">diagnose()</span></tt><a class="headerlink" href="#diagnose" title="Permalink to this headline">¶</a></h2>
+<p>もし、Beautiful Soupがドキュメントに何かをしてトラブルになっているときは、どのドキュメントを <tt class="docutils literal"><span class="pre">diagnose()</span></tt> 関数に渡してみてください。(これはBeautiful Soup 4.2.0の新機能です)
+Beautiful Soupは、どのようにパーサーがそのドキュメントを扱ったかというレポートを出力し、BeautifulSoupが使っているパーサーが失っているかどうかを教えてくれます。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4.diagnose</span> <span class="kn">import</span> <span class="n">diagnose</span>
+<span class="n">data</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">&quot;bad.html&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
+<span class="n">diagnose</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
+
+<span class="c"># Diagnostic running on Beautiful Soup 4.2.0</span>
+<span class="c"># Python version 2.7.3 (default, Aug 1 2012, 05:16:07)</span>
+<span class="c"># I noticed that html5lib is not installed. Installing it may help.</span>
+<span class="c"># Found lxml version 2.3.2.0</span>
+<span class="c">#</span>
+<span class="c"># Trying to parse your data with html.parser</span>
+<span class="c"># Here&#39;s what html.parser did with the document:</span>
+<span class="c"># ...</span>
+</pre></div>
+</div>
+<p>diagnose()の出力をみてみると、どのように問題を解決すればよいかわかるでしょう。もし、わからなくても、助けをもとめるときに、 <tt class="docutils literal"><span class="pre">diagnose()</span></tt> の出力を貼り付けることができます。</p>
+</div>
+<div class="section" id="id56">
+<h2>パース時に出るエラー<a class="headerlink" href="#id56" title="Permalink to this headline">¶</a></h2>
+<p>パースエラーには2種類あります。
+1つは、クラッシュです。Beautifuls Soupにドキュメントを読み込ませたときに、例外が発生します。たいていそれは <tt class="docutils literal"><span class="pre">HTMLParser.HTMPParserError</span></tt> です。
+もう1つは、想定外の動作です。Beautiful Soupのパースツリーが、元のドキュメントのパースツリーとかなり違うことがあります。</p>
+<p>これらのエラーは、たいていBeautiful Soupが原因ではありません。そのように言えるのは、Beautiful Soupがよくできたソフトウェアだからではなく、Beautiful Soupがパース処理のコードを含んでいないためです。
+代わりに、Beautiful Soupは外部のパーサーに頼っています。もしあるパーサーが正しいドキュメントをパースできないときは、他のパーサーを試してみるというのが一番良い対処です。
+<a class="reference internal" href="#id10">パーサーのインストール</a> に、これについての詳細とパーサーの比較が載っています。</p>
+<p>一番よくみるパースエラーは、 <tt class="docutils literal"><span class="pre">HTMLParser.HTMLParseError:</span> <span class="pre">malformed</span> <span class="pre">start</span> <span class="pre">tag</span></tt> と
+<tt class="docutils literal"><span class="pre">HTMLParser.HTMLPraseError:</span> <span class="pre">bad</span> <span class="pre">end</span> <span class="pre">tag</span></tt> でしょう。
+これらはともに、Python組み込みのHTMLパーサーライブラリが返します。
+この場合は、 <a class="reference internal" href="#parser-installation"><em>lxml か html5lib をインストール</em></a> するとよいです。</p>
+<p>想定外の動作のエラーで最も多いのは、あると思っていたタグを見つけられないときです。
+見たことあると思いますが、そのとき <tt class="docutils literal"><span class="pre">find_all()</span></tt> は <tt class="docutils literal"><span class="pre">[]</span></tt> を返し、 <tt class="docutils literal"><span class="pre">find()</span></tt> は <tt class="docutils literal"><span class="pre">None</span></tt> を返します。
+これも、Python組み込みHTMLパーサーにとっては、よくある問題です。やはり、一番よい対処は、 <a class="reference internal" href="#parser-installation"><em>lxml か html5lib をインストール</em></a> することです。</p>
+</div>
+<div class="section" id="id57">
+<h2>バージョン違いの問題<a class="headerlink" href="#id57" title="Permalink to this headline">¶</a></h2>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">SyntaxError:</span> <span class="pre">Invalid</span> <span class="pre">syntax</span></tt> (on the line <tt class="docutils literal"><span class="pre">ROOT_TAG_NAME</span> <span class="pre">=</span>
+<span class="pre">u'[document]'</span></tt>): Python 2バージョンのBeautiful Soupを、変換しないでPython 3で実行したためです。</li>
+</ul>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">HTMLParser</span></tt> - Python 2バージョンのBeautiful Soupを、Python 3で実行したためです。</li>
+</ul>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">html.parser</span></tt> - Python 3バージョンのBeautiful Soupを、Python 2で実行したためです。</li>
+</ul>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">BeautifulSoup</span></tt> - Beautiful Soup 3のコードを、BS3がインストールされてない環境で実行したため、またはBeautiful Soup 4のコードをパッケージ名を <tt class="docutils literal"><span class="pre">bs4</span></tt> に変えずに実行したためです。</li>
+</ul>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">bs4</span></tt> - Beautiful Soup 4のコードを、BS4がインストールされてない環境で実行したためです。</li>
+</ul>
+<span class="target" id="parsing-xml"></span></div>
+<div class="section" id="xml">
+<h2>XMLのパース<a class="headerlink" href="#xml" title="Permalink to this headline">¶</a></h2>
+<p>デフォルトでは、Beautiful SoupはドキュメントをHTMLとしてパースします。XMLとしてパースするには、 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> コンストラクタの第二引数に、 &#8220;xml&#8221; を渡します。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="s">&quot;xml&quot;</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>このためには、 <a class="reference internal" href="#parser-installation"><em>lxml をインストール</em></a> している必要があります。</p>
+</div>
+<div class="section" id="id58">
+<h2>その他のパーサーの問題<a class="headerlink" href="#id58" title="Permalink to this headline">¶</a></h2>
+<ul class="simple">
+<li>If your script works on one computer but not another, it&#8217;s probably
+because the two computers have different parser libraries
+available. For example, you may have developed the script on a
+computer that has lxml installed, and then tried to run it on a
+computer that only has html5lib installed. See <a class="reference internal" href="#id47">パーサーの違い</a>
+for why this matters, and fix the problem by mentioning a
+specific parser library in the <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> constructor.</li>
+<li>Because <a class="reference external" href="http://www.w3.org/TR/html5/syntax.html#syntax">HTML tags and attributes are case-insensitive</a>, all three HTML
+parsers convert tag and attribute names to lowercase. That is, the
+markup &lt;TAG&gt;&lt;/TAG&gt; is converted to &lt;tag&gt;&lt;/tag&gt;. If you want to
+preserve mixed-case or uppercase tags and attributes, you&#8217;ll need to
+<a class="reference internal" href="#parsing-xml"><em>parse the document as XML.</em></a></li>
+</ul>
+</div>
+<div class="section" id="misc">
+<span id="id59"></span><h2>その他<a class="headerlink" href="#misc" title="Permalink to this headline">¶</a></h2>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">UnicodeEncodeError:</span> <span class="pre">'charmap'</span> <span class="pre">codec</span> <span class="pre">can't</span> <span class="pre">encode</span> <span class="pre">character</span>
+<span class="pre">u'\xfoo'</span> <span class="pre">in</span> <span class="pre">position</span> <span class="pre">bar</span></tt> (or just about any other
+<tt class="docutils literal"><span class="pre">UnicodeEncodeError</span></tt>) - This is not a problem with Beautiful Soup.
+This problem shows up in two main situations. First, when you try to
+print a Unicode character that your console doesn&#8217;t know how to
+display. (See <a class="reference external" href="http://wiki.python.org/moin/PrintFails">this page on the Python wiki</a> for help.) Second, when
+you&#8217;re writing to a file and you pass in a Unicode character that&#8217;s
+not supported by your default encoding. In this case, the simplest
+solution is to explicitly encode the Unicode string into UTF-8 with
+<tt class="docutils literal"><span class="pre">u.encode(&quot;utf8&quot;)</span></tt>.</li>
+<li><tt class="docutils literal"><span class="pre">KeyError:</span> <span class="pre">[attr]</span></tt> - Caused by accessing <tt class="docutils literal"><span class="pre">tag['attr']</span></tt> when the
+tag in question doesn&#8217;t define the <tt class="docutils literal"><span class="pre">attr</span></tt> attribute. The most
+common errors are <tt class="docutils literal"><span class="pre">KeyError:</span> <span class="pre">'href'</span></tt> and <tt class="docutils literal"><span class="pre">KeyError:</span>
+<span class="pre">'class'</span></tt>. Use <tt class="docutils literal"><span class="pre">tag.get('attr')</span></tt> if you&#8217;re not sure <tt class="docutils literal"><span class="pre">attr</span></tt> is
+defined, just as you would with a Python dictionary.</li>
+<li><tt class="docutils literal"><span class="pre">AttributeError:</span> <span class="pre">'ResultSet'</span> <span class="pre">object</span> <span class="pre">has</span> <span class="pre">no</span> <span class="pre">attribute</span> <span class="pre">'foo'</span></tt> - This
+usually happens because you expected <tt class="docutils literal"><span class="pre">find_all()</span></tt> to return a
+single tag or string. But <tt class="docutils literal"><span class="pre">find_all()</span></tt> returns a _list_ of tags
+and strings&#8211;a <tt class="docutils literal"><span class="pre">ResultSet</span></tt> object. You need to iterate over the
+list and look at the <tt class="docutils literal"><span class="pre">.foo</span></tt> of each one. Or, if you really only
+want one result, you need to use <tt class="docutils literal"><span class="pre">find()</span></tt> instead of
+<tt class="docutils literal"><span class="pre">find_all()</span></tt>.</li>
+<li><tt class="docutils literal"><span class="pre">AttributeError:</span> <span class="pre">'NoneType'</span> <span class="pre">object</span> <span class="pre">has</span> <span class="pre">no</span> <span class="pre">attribute</span> <span class="pre">'foo'</span></tt> - This
+usually happens because you called <tt class="docutils literal"><span class="pre">find()</span></tt> and then tried to
+access the <cite>.foo`</cite> attribute of the result. But in your case,
+<tt class="docutils literal"><span class="pre">find()</span></tt> didn&#8217;t find anything, so it returned <tt class="docutils literal"><span class="pre">None</span></tt>, instead of
+returning a tag or a string. You need to figure out why your
+<tt class="docutils literal"><span class="pre">find()</span></tt> call isn&#8217;t returning anything.</li>
+</ul>
+</div>
+<div class="section" id="id60">
+<h2>パフォーマンス改善<a class="headerlink" href="#id60" title="Permalink to this headline">¶</a></h2>
+<p>Beautiful Soup will never be as fast as the parsers it sits on top
+of. If response time is critical, if you&#8217;re paying for computer time
+by the hour, or if there&#8217;s any other reason why computer time is more
+valuable than programmer time, you should forget about Beautiful Soup
+and work directly atop <a class="reference external" href="http://lxml.de/">lxml</a>.</p>
+<p>That said, there are things you can do to speed up Beautiful Soup. If
+you&#8217;re not using lxml as the underlying parser, my advice is to
+<a class="reference internal" href="#parser-installation"><em>start</em></a>. Beautiful Soup parses documents
+significantly faster using lxml than using html.parser or html5lib.</p>
+<p>You can speed up encoding detection significantly by installing the
+<a class="reference external" href="http://pypi.python.org/pypi/cchardet/">cchardet</a> library.</p>
+<p><a class="reference internal" href="#id53">ドキュメントの一部をパース</a> won&#8217;t save you much time parsing
+the document, but it can save a lot of memory, and it&#8217;ll make
+<cite>searching</cite> the document much faster.</p>
+</div>
+</div>
+<div class="section" id="beautiful-soup-3">
+<h1>Beautiful Soup 3<a class="headerlink" href="#beautiful-soup-3" title="Permalink to this headline">¶</a></h1>
+<p>Beautiful Soup 3は一つ前のリリースで、すでに開発は停止しています。
+現在でも、全ての主なLinuxディストリビューションに含まれています。:</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">python-beautifulsoup</span></tt></p>
+<p>Pypiでも <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> として利用できます。</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">BeautifulSoup</span></tt></p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">BeautifulSoup</span></tt></p>
+<p>次のリンクからダウンロードできます。<a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz">tarball of Beautiful Soup 3.2.0</a>.</p>
+<p><tt class="docutils literal"><span class="pre">easy_install</span> <span class="pre">beautifulsoup</span></tt> , <tt class="docutils literal"><span class="pre">easy_install</span> <span class="pre">BeautifulSoup</span></tt> というコマンドでBeautiful Soupをインストールすると、あなたのコードは動きません。 <tt class="docutils literal"><span class="pre">easy_install</span> <span class="pre">beautifulsoup4</span></tt> と入力しましょう。</p>
+<p><a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">Beautiful Soup 3 のドキュメントはアーカイブされています。</a></p>
+<p>日本語版は次のリンクから参照できます。 <a class="reference external" href="http://tdoc.info/beautifulsoup/">Beautiful Soup ドキュメント</a>
+Beautiful Soup 4での変更点が理解するために、これらのドキュメントを読んでみてください。</p>
+<div class="section" id="bs4">
+<h2>BS4への移行<a class="headerlink" href="#bs4" title="Permalink to this headline">¶</a></h2>
+<p>多くのBS3で書かれたコードは、一か所変更するだけでBS4で動きます。パッケージ名を <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> から <tt class="docutils literal"><span class="pre">bs4</span></tt> に変更するだけです。これを、、:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">BeautifulSoup</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
+</pre></div>
+</div>
+<p>以下のようにします。:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
+</pre></div>
+</div>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">ImportError</span></tt> &#8220;No module named BeautifulSoup&#8221; が表示された場合、BS4しかインストールされていないのに、BS3のコードを実行しようとしたのが問題です。</li>
+</ul>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">ImportError</span></tt> &#8220;No module named bs4&#8221; が表示された場合、BS3しかインストールされていないのに、BS4のコードを実行しようとしたのが問題です。</li>
+</ul>
+<p>BS4はBS3の大部分について後方互換性がありますが、それらのメソッドのほとんどは変更され`PEP 8 規約 &lt;<a class="reference external" href="http://www.python.org/dev/peps/pep-0008/">http://www.python.org/dev/peps/pep-0008/</a>&gt;`_ に沿った新しい名前になっています。多くの名前等の変更により、後方互換性の一部が損なわれています。</p>
+<p>以下は、BS3のコードをBS4に変換するのに知っておくべき事項です。:</p>
+<div class="section" id="id63">
+<h3>パーサー<a class="headerlink" href="#id63" title="Permalink to this headline">¶</a></h3>
+<p>Beautiful Soup 3 used Python&#8217;s <tt class="docutils literal"><span class="pre">SGMLParser</span></tt>, a module that was
+deprecated and removed in Python 3.0. Beautiful Soup 4 uses
+<tt class="docutils literal"><span class="pre">html.parser</span></tt> by default, but you can plug in lxml or html5lib and
+use that instead. See <a class="reference internal" href="#id10">パーサーのインストール</a> for a comparison.</p>
+<p>Since <tt class="docutils literal"><span class="pre">html.parser</span></tt> is not the same parser as <tt class="docutils literal"><span class="pre">SGMLParser</span></tt>, it
+will treat invalid markup differently. Usually the &#8220;difference&#8221; is
+that <tt class="docutils literal"><span class="pre">html.parser</span></tt> crashes. In that case, you&#8217;ll need to install
+another parser. But sometimes <tt class="docutils literal"><span class="pre">html.parser</span></tt> just creates a different
+parse tree than <tt class="docutils literal"><span class="pre">SGMLParser</span></tt> would. If this happens, you may need to
+update your BS3 scraping code to deal with the new tree.</p>
+</div>
+<div class="section" id="id64">
+<h3>メソッド名<a class="headerlink" href="#id64" title="Permalink to this headline">¶</a></h3>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">renderContents</span></tt> -&gt; <tt class="docutils literal"><span class="pre">encode_contents</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">replaceWith</span></tt> -&gt; <tt class="docutils literal"><span class="pre">replace_with</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">replaceWithChildren</span></tt> -&gt; <tt class="docutils literal"><span class="pre">unwrap</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findAll</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_all</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findAllNext</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_all_next</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findAllPrevious</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_all_previous</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findNext</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_next</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findNextSibling</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_next_sibling</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findNextSiblings</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_next_siblings</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findParent</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_parent</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findParents</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_parents</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findPrevious</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_previous</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findPreviousSibling</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_previous_sibling</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findPreviousSiblings</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_previous_siblings</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">nextSibling</span></tt> -&gt; <tt class="docutils literal"><span class="pre">next_sibling</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">previousSibling</span></tt> -&gt; <tt class="docutils literal"><span class="pre">previous_sibling</span></tt></li>
+</ul>
+<p>Some arguments to the Beautiful Soup constructor were renamed for the
+same reasons:</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">BeautifulSoup(parseOnlyThese=...)</span></tt> -&gt; <tt class="docutils literal"><span class="pre">BeautifulSoup(parse_only=...)</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">BeautifulSoup(fromEncoding=...)</span></tt> -&gt; <tt class="docutils literal"><span class="pre">BeautifulSoup(from_encoding=...)</span></tt></li>
+</ul>
+<p>I renamed one method for compatibility with Python 3:</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">Tag.has_key()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">Tag.has_attr()</span></tt></li>
+</ul>
+<p>I renamed one attribute to use more accurate terminology:</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">Tag.isSelfClosing</span></tt> -&gt; <tt class="docutils literal"><span class="pre">Tag.is_empty_element</span></tt></li>
+</ul>
+<p>I renamed three attributes to avoid using words that have special
+meaning to Python. Unlike the others, these changes are <em>not backwards
+compatible.</em> If you used these attributes in BS3, your code will break
+on BS4 until you change them.</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">UnicodeDammit.unicode</span></tt> -&gt; <tt class="docutils literal"><span class="pre">UnicodeDammit.unicode_markup</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">Tag.next</span></tt> -&gt; <tt class="docutils literal"><span class="pre">Tag.next_element</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">Tag.previous</span></tt> -&gt; <tt class="docutils literal"><span class="pre">Tag.previous_element</span></tt></li>
+</ul>
+</div>
+<div class="section" id="id65">
+<h3>ジェネレーター<a class="headerlink" href="#id65" title="Permalink to this headline">¶</a></h3>
+<p>I gave the generators PEP 8-compliant names, and transformed them into
+properties:</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">childGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">children</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">nextGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">next_elements</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">nextSiblingGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">next_siblings</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">previousGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">previous_elements</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">previousSiblingGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">previous_siblings</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">recursiveChildGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">descendants</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">parentGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">parents</span></tt></li>
+</ul>
+<p>So instead of this:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">tag</span><span class="o">.</span><span class="n">parentGenerator</span><span class="p">():</span>
+ <span class="o">...</span>
+</pre></div>
+</div>
+<p>You can write this:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">tag</span><span class="o">.</span><span class="n">parents</span><span class="p">:</span>
+ <span class="o">...</span>
+</pre></div>
+</div>
+<p>(But the old code will still work.)</p>
+<p>Some of the generators used to yield <tt class="docutils literal"><span class="pre">None</span></tt> after they were done, and
+then stop. That was a bug. Now the generators just stop.</p>
+<p>There are two new generators, <a class="reference internal" href="#string-generators"><em>.strings and
+.stripped_strings</em></a>. <tt class="docutils literal"><span class="pre">.strings</span></tt> yields
+NavigableString objects, and <tt class="docutils literal"><span class="pre">.stripped_strings</span></tt> yields Python
+strings that have had whitespace stripped.</p>
+</div>
+<div class="section" id="id66">
+<h3>XML<a class="headerlink" href="#id66" title="Permalink to this headline">¶</a></h3>
+<p>There is no longer a <tt class="docutils literal"><span class="pre">BeautifulStoneSoup</span></tt> class for parsing XML. To
+parse XML you pass in &#8220;xml&#8221; as the second argument to the
+<tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> constructor. For the same reason, the
+<tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> constructor no longer recognizes the <tt class="docutils literal"><span class="pre">isHTML</span></tt>
+argument.</p>
+<p>Beautiful Soup&#8217;s handling of empty-element XML tags has been
+improved. Previously when you parsed XML you had to explicitly say
+which tags were considered empty-element tags. The <tt class="docutils literal"><span class="pre">selfClosingTags</span></tt>
+argument to the constructor is no longer recognized. Instead,
+Beautiful Soup considers any empty tag to be an empty-element tag. If
+you add a child to an empty-element tag, it stops being an
+empty-element tag.</p>
+</div>
+<div class="section" id="id67">
+<h3>エンティティ<a class="headerlink" href="#id67" title="Permalink to this headline">¶</a></h3>
+<p>An incoming HTML or XML entity is always converted into the
+corresponding Unicode character. Beautiful Soup 3 had a number of
+overlapping ways of dealing with entities, which have been
+removed. The <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> constructor no longer recognizes the
+<tt class="docutils literal"><span class="pre">smartQuotesTo</span></tt> or <tt class="docutils literal"><span class="pre">convertEntities</span></tt> arguments. (<a class="reference internal" href="#unicode-dammit">Unicode,
+Dammit</a> still has <tt class="docutils literal"><span class="pre">smart_quotes_to</span></tt>, but its default is now to turn
+smart quotes into Unicode.) The constants <tt class="docutils literal"><span class="pre">HTML_ENTITIES</span></tt>,
+<tt class="docutils literal"><span class="pre">XML_ENTITIES</span></tt>, and <tt class="docutils literal"><span class="pre">XHTML_ENTITIES</span></tt> have been removed, since they
+configure a feature (transforming some but not all entities into
+Unicode characters) that no longer exists.</p>
+<p>If you want to turn Unicode characters back into HTML entities on
+output, rather than turning them into UTF-8 characters, you need to
+use an <a class="reference internal" href="#output-formatters"><em>output formatter</em></a>.</p>
+</div>
+<div class="section" id="id68">
+<h3>その他<a class="headerlink" href="#id68" title="Permalink to this headline">¶</a></h3>
+<p><a class="reference internal" href="#string"><em>Tag.string</em></a> now operates recursively. If tag A
+contains a single tag B and nothing else, then A.string is the same as
+B.string. (Previously, it was None.)</p>
+<p><a class="reference internal" href="#id15">値が複数のとき</a> like <tt class="docutils literal"><span class="pre">class</span></tt> have lists of strings as
+their values, not strings. This may affect the way you search by CSS
+class.</p>
+<p>If you pass one of the <tt class="docutils literal"><span class="pre">find*</span></tt> methods both <a class="reference internal" href="#text"><em>text</em></a> <cite>and</cite>
+a tag-specific argument like <a class="reference internal" href="#name"><em>name</em></a>, Beautiful Soup will
+search for tags that match your tag-specific criteria and whose
+<a class="reference internal" href="#string"><em>Tag.string</em></a> matches your value for <a class="reference internal" href="#text"><em>text</em></a>. It will <cite>not</cite> find the strings themselves. Previously,
+Beautiful Soup ignored the tag-specific arguments and looked for
+strings.</p>
+<p>The <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> constructor no longer recognizes the
+<cite>markupMassage</cite> argument. It&#8217;s now the parser&#8217;s responsibility to
+handle markup correctly.</p>
+<p>The rarely-used alternate parser classes like
+<tt class="docutils literal"><span class="pre">ICantBelieveItsBeautifulSoup</span></tt> and <tt class="docutils literal"><span class="pre">BeautifulSOAP</span></tt> have been
+removed. It&#8217;s now the parser&#8217;s decision how to handle ambiguous
+markup.</p>
+<p>The <tt class="docutils literal"><span class="pre">prettify()</span></tt> method now returns a Unicode string, not a bytestring.</p>
+</div>
+</div>
+</div>
+
+
+ </div>
+ </div>
+ </div>
+ <div class="sphinxsidebar">
+ <div class="sphinxsidebarwrapper">
+ <h3><a href="#">Table Of Contents</a></h3>
+ <ul>
+<li><a class="reference internal" href="#">Beautiful Soup</a><ul>
+<li><a class="reference internal" href="#id2">(訳注)石鹸は食べられない</a></li>
+<li><a class="reference internal" href="#id3">この文書について</a></li>
+<li><a class="reference internal" href="#id5">助けてほしいときは</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id7">クイックスタート</a></li>
+<li><a class="reference internal" href="#id8">インストール</a><ul>
+<li><a class="reference internal" href="#id9">インストール後の問題</a></li>
+<li><a class="reference internal" href="#parser-installation">パーサーのインストール</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id11">スープの作成</a></li>
+<li><a class="reference internal" href="#id12">4種類のオブジェクト</a><ul>
+<li><a class="reference internal" href="#tag-obj">Tag obj.</a><ul>
+<li><a class="reference internal" href="#id13">名前</a></li>
+<li><a class="reference internal" href="#id14">属性</a><ul>
+<li><a class="reference internal" href="#id15">値が複数のとき</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li><a class="reference internal" href="#navigablestring-obj">NavigableString obj.</a></li>
+<li><a class="reference internal" href="#beautifulsoup-obj">BeautifulSoup obj.</a></li>
+<li><a class="reference internal" href="#comments-obj">Comments obj. 他</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id16">パースツリーを探索</a><ul>
+<li><a class="reference internal" href="#id17">子要素へ下移動</a><ul>
+<li><a class="reference internal" href="#id18">タグ名で探索</a></li>
+<li><a class="reference internal" href="#contents-children"><tt class="docutils literal"><span class="pre">.contents</span></tt> / <tt class="docutils literal"><span class="pre">.children</span></tt></a></li>
+<li><a class="reference internal" href="#descendants"><tt class="docutils literal"><span class="pre">.descendants</span></tt></a></li>
+<li><a class="reference internal" href="#string"><tt class="docutils literal"><span class="pre">.string</span></tt></a></li>
+<li><a class="reference internal" href="#strings-stripped-strings"><tt class="docutils literal"><span class="pre">.strings</span></tt> / <tt class="docutils literal"><span class="pre">.stripped_strings</span></tt></a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id20">親要素へ上移動</a><ul>
+<li><a class="reference internal" href="#parent"><tt class="docutils literal"><span class="pre">.parent</span></tt></a></li>
+<li><a class="reference internal" href="#parents"><tt class="docutils literal"><span class="pre">.parents</span></tt></a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id23">兄弟要素へ横移動</a><ul>
+<li><a class="reference internal" href="#next-sibling-previous-sibling"><tt class="docutils literal"><span class="pre">.next_sibling</span></tt> / <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt></a></li>
+<li><a class="reference internal" href="#next-siblings-previous-siblings"><tt class="docutils literal"><span class="pre">.next_siblings</span></tt> / <tt class="docutils literal"><span class="pre">.previous_siblings</span></tt></a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id24">前後の要素へ移動</a><ul>
+<li><a class="reference internal" href="#next-element-previous-element"><tt class="docutils literal"><span class="pre">.next_element</span></tt> / <tt class="docutils literal"><span class="pre">.previous_element</span></tt></a></li>
+<li><a class="reference internal" href="#next-elements-previous-elements"><tt class="docutils literal"><span class="pre">.next_elements</span></tt> / <tt class="docutils literal"><span class="pre">.previous_elements</span></tt></a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id25">パースツリーを検索</a><ul>
+<li><a class="reference internal" href="#id26">フィルターの種類</a><ul>
+<li><a class="reference internal" href="#id27">文字列</a></li>
+<li><a class="reference internal" href="#id28">正規表現</a></li>
+<li><a class="reference internal" href="#id29">リスト</a></li>
+<li><a class="reference internal" href="#true">True値</a></li>
+<li><a class="reference internal" href="#id30">関数</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#find-all"><tt class="docutils literal"><span class="pre">find_all()</span></tt></a><ul>
+<li><a class="reference internal" href="#id31">name引数</a></li>
+<li><a class="reference internal" href="#id32">キーワード引数</a></li>
+<li><a class="reference internal" href="#css">CSSのクラスで検索</a></li>
+<li><a class="reference internal" href="#id33">text引数</a></li>
+<li><a class="reference internal" href="#limit">limit引数</a></li>
+<li><a class="reference internal" href="#recursive">recursive引数</a></li>
+<li><a class="reference internal" href="#id36">ショートカット</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#find"><tt class="docutils literal"><span class="pre">find()</span></tt></a></li>
+<li><a class="reference internal" href="#find-parents-find-parent"><tt class="docutils literal"><span class="pre">find_parents()</span></tt> / <tt class="docutils literal"><span class="pre">find_parent()</span></tt></a></li>
+<li><a class="reference internal" href="#find-next-siblings-find-next-sibling"><tt class="docutils literal"><span class="pre">find_next_siblings()</span></tt> / <tt class="docutils literal"><span class="pre">find_next_sibling()</span></tt></a></li>
+<li><a class="reference internal" href="#find-previous-siblings-find-previous-sibling"><tt class="docutils literal"><span class="pre">find_previous_siblings()</span></tt> / <tt class="docutils literal"><span class="pre">find_previous_sibling()</span></tt></a></li>
+<li><a class="reference internal" href="#find-all-next-find-next"><tt class="docutils literal"><span class="pre">find_all_next()</span></tt> / <tt class="docutils literal"><span class="pre">find_next()</span></tt></a></li>
+<li><a class="reference internal" href="#find-all-previous-find-previous"><tt class="docutils literal"><span class="pre">find_all_previous()</span></tt> / <tt class="docutils literal"><span class="pre">find_previous()</span></tt></a></li>
+<li><a class="reference internal" href="#id37">CSSセレクタ</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id38">パースツリーを修正</a><ul>
+<li><a class="reference internal" href="#id39">名前や属性の変更</a></li>
+<li><a class="reference internal" href="#id40"><tt class="docutils literal"><span class="pre">.string</span></tt> の修正</a></li>
+<li><a class="reference internal" href="#append"><tt class="docutils literal"><span class="pre">append()</span></tt></a></li>
+<li><a class="reference internal" href="#beautifulsoup-new-string-new-tag"><tt class="docutils literal"><span class="pre">BeautifulSoup.new_string()</span></tt> / <tt class="docutils literal"><span class="pre">.new_tag()</span></tt></a></li>
+<li><a class="reference internal" href="#insert"><tt class="docutils literal"><span class="pre">insert()</span></tt></a></li>
+<li><a class="reference internal" href="#insert-before-insert-after"><tt class="docutils literal"><span class="pre">insert_before()</span></tt> / <tt class="docutils literal"><span class="pre">insert_after()</span></tt></a></li>
+<li><a class="reference internal" href="#clear"><tt class="docutils literal"><span class="pre">clear()</span></tt></a></li>
+<li><a class="reference internal" href="#extract"><tt class="docutils literal"><span class="pre">extract()</span></tt></a></li>
+<li><a class="reference internal" href="#decompose"><tt class="docutils literal"><span class="pre">decompose()</span></tt></a></li>
+<li><a class="reference internal" href="#replace-with"><tt class="docutils literal"><span class="pre">replace_with()</span></tt></a></li>
+<li><a class="reference internal" href="#wrap"><tt class="docutils literal"><span class="pre">wrap()</span></tt></a></li>
+<li><a class="reference internal" href="#unwrap"><tt class="docutils literal"><span class="pre">unwrap()</span></tt></a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id42">出力</a><ul>
+<li><a class="reference internal" href="#id43">きれいに出力</a></li>
+<li><a class="reference internal" href="#id44">一行に出力</a></li>
+<li><a class="reference internal" href="#id45">フォーマットを指定</a></li>
+<li><a class="reference internal" href="#get-text"><tt class="docutils literal"><span class="pre">get_text()</span></tt></a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id46">パーサーの指定</a><ul>
+<li><a class="reference internal" href="#id47">パーサーの違い</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id48">エンコード</a><ul>
+<li><a class="reference internal" href="#id49">出力のエンコード</a></li>
+<li><a class="reference internal" href="#unicode-dammit">Unicode, Dammit</a><ul>
+<li><a class="reference internal" href="#id50">スマート引用符</a></li>
+<li><a class="reference internal" href="#id52">複数の文字コード</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id53">ドキュメントの一部をパース</a><ul>
+<li><a class="reference internal" href="#soupstrainer"><tt class="docutils literal"><span class="pre">SoupStrainer</span></tt></a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id54">トラブルシューティング</a><ul>
+<li><a class="reference internal" href="#diagnose"><tt class="docutils literal"><span class="pre">diagnose()</span></tt></a></li>
+<li><a class="reference internal" href="#id56">パース時に出るエラー</a></li>
+<li><a class="reference internal" href="#id57">バージョン違いの問題</a></li>
+<li><a class="reference internal" href="#xml">XMLのパース</a></li>
+<li><a class="reference internal" href="#id58">その他のパーサーの問題</a></li>
+<li><a class="reference internal" href="#misc">その他</a></li>
+<li><a class="reference internal" href="#id60">パフォーマンス改善</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#beautiful-soup-3">Beautiful Soup 3</a><ul>
+<li><a class="reference internal" href="#bs4">BS4への移行</a><ul>
+<li><a class="reference internal" href="#id63">パーサー</a></li>
+<li><a class="reference internal" href="#id64">メソッド名</a></li>
+<li><a class="reference internal" href="#id65">ジェネレーター</a></li>
+<li><a class="reference internal" href="#id66">XML</a></li>
+<li><a class="reference internal" href="#id67">エンティティ</a></li>
+<li><a class="reference internal" href="#id68">その他</a></li>
+</ul>
+</li>
+</ul>
+</li>
+</ul>
+
+ <h3>This Page</h3>
+ <ul class="this-page-menu">
+ <li><a href="_sources/index.txt"
+ rel="nofollow">Show Source</a></li>
+ </ul>
+
+<div id="searchbox" style="display: none">
+ <h3>Quick search</h3>
+ <form class="search" action="search.html" method="get">
+ <input type="text" name="q" />
+ <input type="submit" value="Go" />
+ <input type="hidden" name="check_keywords" value="yes" />
+ <input type="hidden" name="area" value="default" />
+ </form>
+ <p class="searchtip" style="font-size: 90%">
+ Enter search terms or a module, class or function name.
+ </p>
+</div>
+<script type="text/javascript">$('#searchbox').show(0);</script>
+
+<script>
+ (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+ m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+ })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+
+ ga('create', 'UA-23795927-33', 'kondou.com');
+ ga('send', 'pageview');
+
+</script>
+
+
+ </div>
+ </div>
+ <div class="clearer"></div>
+ </div>
+ <div class="related">
+ <h3>Navigation</h3>
+ <ul>
+ <li class="right" style="margin-right: 10px">
+ <a href="genindex.html" title="General Index"
+ >index</a></li>
+ <li><a href="#">Beautiful Soup 4.2.0 Doc. 日本語訳 (2013-11-19最終更新)</a> &raquo;</li>
+ </ul>
+ </div>
+ <div class="footer">
+ &copy; Copyright 2013, Leonard Richardson.
+ Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.2.
+ </div>
+ </body>
+</html>
diff --git a/doc.html/index.kr.html b/doc.html/index.kr.html
new file mode 100644
index 0000000..03319c6
--- /dev/null
+++ b/doc.html/index.kr.html
@@ -0,0 +1,2476 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml"><head>
+
+<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+
+
+
+ <title>뷰티플수프 문서 — 뷰티플수프 4.0.0 문서</title>
+
+
+
+
+
+
+
+ <link rel="top" title="뷰티플수프 4.0.0 documentation" href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#">
+
+<link rel="stylesheet" type="text/css" href="/web/20150319200824cs_/http://coreapython.hosting.paran.com/etc/index_files/index.css" media="all">
+</head>
+<body>
+ <div class="related">
+ <h3>목차</h3>
+ <ul>
+ <li class="right" style="margin-right: 10px;">
+ <a href="https://web.archive.org/web/20150319200824/http://www.crummy.com/software/BeautifulSoup/bs4/doc/genindex.html" title="General Index" accesskey="I">인덱스</a></li>
+ <li><a href="#">뷰티플수프 4.0.0 문서</a> »</li>
+ </ul>
+ </div>
+
+ <div class="document">
+ <div class="documentwrapper">
+ <div class="bodywrapper">
+ <div class="body">
+
+ <div class="section" id="beautiful-soup-documentation">
+<h1>뷰티플수프 문서<a class="headerlink" href="#beautiful-soup-documentation" title="Permalink to this headline">¶</a></h1> 한글판 johnsonj 2012.11.08 <a href="https://web.archive.org/web/20150319200824/http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html">원문 위치</a>
+<img alt="&quot;The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself.&quot;" class="align-right" src="/web/20150319200824im_/http://coreapython.hosting.paran.com/etc/index_files/6.jpg">
+<p><a class="reference external" href="https://web.archive.org/web/20150319200824/http://www.crummy.com/software/BeautifulSoup/">뷰티플수프</a>는 HTML과 XML 파일로부터 데이터를 뽑아내기 위한 파이썬 라이브러리이다. 여러분이 선호하는 해석기와 함께 사용하여 일반적인 방식으로 해석 트리를 항해, 검색, 변경할 수 있다. 주로 프로그래머의 수고를 덜어준다.</p>
+<p>이 지도서에서는 뷰티플수프 4의 중요한 특징들을 예제와 함께 모두 보여준다. 이 라이브러리가 어느 곳에 유용한지, 어떻게 작동하는지, 또 어떻게 사용하는지, 어떻게 원하는대로 바꿀 수 있는지, 예상을 빗나갔을 때 어떻게 해야 하는지를 보여준다.</p>
+<p>이 문서의 예제들은 파이썬 2.7과 Python 3.2에서 똑 같이 작동한다.</p>
+<p>혹시 <a class="reference external" href="https://web.archive.org/web/20150319200824/http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">뷰티플수프 3</a>에 관한 문서를 찾고 계신다면 뷰티플수프 3는 더 이상 개발되지 않는다는 사실을 꼭 아셔야겠다. 새로 프로젝트를 시작한다면 뷰티플수프 4를 적극 추천한다. 뷰티플수프 3와 뷰티플수프 4의 차이점은 <a class="reference internal" href="#porting-code-to-bs4">BS4 코드 이식하기</a>를 참조하자.</p>
+<div class="section" id="getting-help">
+<h2>도움 얻기<a class="headerlink" href="#getting-help" title="Permalink to this headline">¶</a></h2>
+<p>뷰피플수프에 의문이 있거나, 문제에 봉착하면 <a class="reference external" href="https://web.archive.org/web/20150319200824/http://groups.google.com/group/beautifulsoup/">토론 그룹에 메일을 보내자</a>.</p>
+</div>
+</div>
+<div class="section" id="quick-start">
+<h1>바로 시작<a class="headerlink" href="#quick-start" title="Permalink to this headline">¶</a></h1>
+<p>다음은 이 문서에서 예제로 사용할 HTML 문서이다. <cite>이상한 나라의 앨리스</cite> 이야기의 일부이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">html_doc</span> <span class="o">=</span> <span class="s">"""</span>
+<span class="s">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;</span>
+
+<span class="s">&lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</span>
+
+<span class="s">&lt;p class="story"&gt;Once upon a time there were three little sisters; and their names were</span>
+<span class="s">&lt;a href="http://example.com/elsie" class="sister" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="s">&lt;a href="http://example.com/lacie" class="sister" id="link2"&gt;Lacie&lt;/a&gt; and</span>
+<span class="s">&lt;a href="http://example.com/tillie" class="sister" id="link3"&gt;Tillie&lt;/a&gt;;</span>
+<span class="s">and they lived at the bottom of a well.&lt;/p&gt;</span>
+
+<span class="s">&lt;p class="story"&gt;...&lt;/p&gt;</span>
+<span class="s">"""</span>
+</pre></div>
+</div>
+<p>“three sisters” 문서를 뷰피플수프에 넣으면 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체가 나오는데, 이 객체는 문서를 내포된 데이터 구조로 나타낸다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;head&gt;</span>
+<span class="c"># &lt;title&gt;</span>
+<span class="c"># The Dormouse's story</span>
+<span class="c"># &lt;/title&gt;</span>
+<span class="c"># &lt;/head&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p class="title"&gt;</span>
+<span class="c"># &lt;b&gt;</span>
+<span class="c"># The Dormouse's story</span>
+<span class="c"># &lt;/b&gt;</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;p class="story"&gt;</span>
+<span class="c"># Once upon a time there were three little sisters; and their names were</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;</span>
+<span class="c"># Elsie</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># ,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;</span>
+<span class="c"># Lacie</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># and</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link2"&gt;</span>
+<span class="c"># Tillie</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># ; and they lived at the bottom of a well.</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;p class="story"&gt;</span>
+<span class="c"># ...</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>다음은 간단하게 데이터 구조를 항해하는 몇 가지 방법이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">title</span>
+<span class="c"># &lt;title&gt;The Dormouse's story&lt;/title&gt;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">name</span>
+<span class="c"># u'title'</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">string</span>
+<span class="c"># u'The Dormouse's story'</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span>
+<span class="c"># u'head'</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">p</span>
+<span class="c"># &lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
+<span class="c"># u'title'</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">"link3"</span><span class="p">)</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p>일반적인 과업으로 한 페이지에서 &lt;a&gt; 태그에 존재하는 모든 URL을 뽑아 낼 일이 많다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'a'</span><span class="p">):</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">link</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'href'</span><span class="p">))</span>
+<span class="c"># http://example.com/elsie</span>
+<span class="c"># http://example.com/lacie</span>
+<span class="c"># http://example.com/tillie</span>
+</pre></div>
+</div>
+<p>또 다른 과업으로 페이지에서 텍스트를 모두 뽑아낼 일이 많다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">get_text</span><span class="p">())</span>
+<span class="c"># The Dormouse's story</span>
+<span class="c">#</span>
+<span class="c"># The Dormouse's story</span>
+<span class="c">#</span>
+<span class="c"># Once upon a time there were three little sisters; and their names were</span>
+<span class="c"># Elsie,</span>
+<span class="c"># Lacie and</span>
+<span class="c"># Tillie;</span>
+<span class="c"># and they lived at the bottom of a well.</span>
+<span class="c">#</span>
+<span class="c"># ...</span>
+</pre></div>
+</div>
+<p>이것이 여러분이 필요한 것인가? 그렇다면, 계속 읽어 보자.</p>
+</div>
+<div class="section" id="installing-beautiful-soup">
+<h1>뷰티플 수프 설치하기<a class="headerlink" href="#installing-beautiful-soup" title="Permalink to this headline">¶</a></h1>
+<p>데비안이나 우분투 리눅스 최신 버전을 사용중이라면, 시스템 꾸러미 관리자로 뷰티플수프를 설치하자:</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">python-bs4</span></tt></p>
+<p>
+뷰티블수프 4는 PyPi를 통하여도 출간되어 있으므로, 시스템 꾸러미 관리자로 설치할 수 없을 경우, <tt class="docutils literal"><span class="pre">easy_install</span></tt>로 설치하거나
+<tt class="docutils literal"><span class="pre">pip</span></tt>로 설치할 수 있다. 꾸러미 이름은 <tt class="docutils literal"><span class="pre">beautifulsoup4</span></tt>이며, 같은 꾸러미로 파이썬 2 그리고 파이썬 3에 작동한다.</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">beautifulsoup4</span></tt></p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">beautifulsoup4</span></tt></p>
+<p>(이 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 꾸러미가 혹시 <cite>원하는 것이 아니라면</cite>. 이전 버전으로 <a class="reference external" href="https://web.archive.org/web/20150319200824/http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">뷰티플수프 3</a>가 있다. 많은 소프트웨에서 BS3를 사용하고 있으므로, 여전히 사용할 수 있다. 그러나 새로 코드를 작성할 생각이라면 <tt class="docutils literal"><span class="pre">beautifulsoup4</span></tt>를 설치하시기 바란다.)</p>
+<p><tt class="docutils literal"><span class="pre">easy_install</span></tt>도 <tt class="docutils literal"><span class="pre">pip</span></tt>도 설치되어 있지 않다면,
+<a class="reference external" href="https://web.archive.org/web/20150319200824/http://www.crummy.com/software/BeautifulSoup/download/4.x/">뷰티플수프 4 소스</a>를 내려 받아 <tt class="docutils literal"><span class="pre">setup.py</span></tt>로 설치하실 수 있다.</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">python</span> <span class="pre">setup.py</span> <span class="pre">install</span></tt></p>
+<p>
+다른 모든 것이 실패하더라도, 뷰티플수프 라이센스는 여러분의 어플리케이션에 통채로 꾸려 넣는 것을 허용하므로 전혀 설치할 필요없이 소스를 내려받아 <tt class="docutils literal"><span class="pre">bs4</span></tt> 디렉토리를 통채로 코드베이스에 복사해서 사용하셔도 된다.</p>
+<p>
+본인은 파이썬 2.7과 파이썬 3.2에서 뷰티플수프를 개발하였지만, 다른 최신 버전에도 작동하리라 믿는 바이다.</p>
+<div class="section" id="problems-after-installation">
+<h2>설치 이후의 문제<a class="headerlink" href="#problems-after-installation" title="Permalink to this headline">¶</a></h2>
+<p>뷰티플 수프는 파이썬 2 코드로 꾸려 넣어져 있다. 파이썬 3에 사용하기 위해 설치하면, 파이썬 3 코드로 자동으로 변환된다. 꾸러미가 설치되어 있지 않다면, 당연히 변환되지 않는다. 또한 윈도우즈 머신이라면 잘못된 버전이 설치되어 있다고 보고된다.</p>
+<p>“No module named HTMLParser”와 같은 <tt class="docutils literal"><span class="pre">ImportError</span></tt> 에러가 일어나면, 파이썬 3 아래에서 파이썬 2 버전의 코드를 실행하고 있기 때문이다.</p>
+<p>“No module named html.parser”와 같은 <tt class="docutils literal"><span class="pre">ImportError</span></tt> 에러라면, 파이썬 3 버전의 코드를 파이썬 2 아래에서 실행하고 있기 때문이다.</p>
+<p>두 경우 모두 최선의 선택은 시스템에서 (압축파일을 풀 때 만들어진 디렉토리를 모두 포함하여) 뷰티플수프를 제거하고 다시 설치하는 것이다.</p>
+<p>다음 <tt class="docutils literal"><span class="pre">ROOT_TAG_NAME</span> <span class="pre">=</span> <span class="pre">u'[document]'</span></tt> 줄에서 <tt class="docutils literal"><span class="pre">SyntaxError</span></tt> “Invalid syntax”를 맞이한다면, 파이썬 2 코드를 파이썬 3 코드로 변환할 필요가 있다. 이렇게 하려면 다음과 같이 패키지를 설치하거나:</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">python3</span> <span class="pre">setup.py</span> <span class="pre">install</span></tt></p>
+<p>아니면 직접 파이썬의 <tt class="docutils literal"><span class="pre">2to3</span></tt> 변환 스크립트를
+<tt class="docutils literal"><span class="pre">bs4</span></tt> 디렉토리에 실행하면 된다:</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">2to3-3.2</span> <span class="pre">-w</span> <span class="pre">bs4</span></tt></p>
+</div>
+<div class="section" id="installing-a-parser">
+<span id="parser-installation"></span><h2>해석기 설치하기<a class="headerlink" href="#installing-a-parser" title="Permalink to this headline">¶</a></h2>
+<p>뷰티플수프는 파이썬 표준 라이브러리에 포함된 HTML 해석기를 지원하지만, 또 수 많은 제-삼자 파이썬 해석기도 지원한다. 그 중 하나는 <a class="reference external" href="https://web.archive.org/web/20150319200824/http://lxml.de/">lxml 해석기</a>이다. 설정에 따라, 다음 명령어들 중 하나로 lxml을 설치하는 편이 좋을 경우가 있다:</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">python-lxml</span></tt></p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">lxml</span></tt></p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">lxml</span></tt></p>
+<p>파이썬 2를 사용중이라면, 또다른 대안은 순수-파이썬 <a class="reference external" href="https://web.archive.org/web/20150319200824/http://code.google.com/p/html5lib/">html5lib 해석기</a>를 사용하는 것인데, 이 해석기는 HTML을 웹 브라우저가 해석하는 방식으로 해석한다. 설정에 따라 다음 명령어중 하나로 html5lib를 설치하는 것이 좋을 때가 있다:</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">python-html5lib</span></tt></p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">html5lib</span></tt></p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">html5lib</span></tt></p>
+<p>다음 표에 각 해석 라이브러리의 장점과 단점을 요약해 놓았다:</p>
+<table class="docutils" border="1">
+<colgroup>
+<col width="18%">
+<col width="35%">
+<col width="26%">
+<col width="21%">
+</colgroup>
+<tbody valign="top">
+<tr class="row-odd"><td>해석기</td>
+<td>전형적 사용방법</td>
+<td>장점</td>
+<td>단점</td>
+</tr>
+<tr class="row-even"><td>파이썬의 html.parser</td>
+<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">"html.parser")</span></tt></td>
+<td><ul class="first last simple">
+<li>각종 기능 완비</li>
+<li>적절한 속도</li>
+<li>관대함 (파이썬 2.7.3과 3.2에서.)</li>
+</ul>
+</td>
+<td><ul class="first last simple">
+<li>별로 관대하지 않음
+(파이썬 2.7.3이나 3.2.2 이전 버전에서)</li>
+</ul>
+</td>
+</tr>
+<tr class="row-odd"><td>lxml의 HTML 해석기</td>
+<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">"lxml")</span></tt></td>
+<td><ul class="first last simple">
+<li>아주 빠름</li>
+<li>관대함</li>
+</ul>
+</td>
+<td><ul class="first last simple">
+<li>외부 C 라이브러리 의존</li>
+</ul>
+</td>
+</tr>
+<tr class="row-even"><td>lxml의 XML 해석기</td>
+<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">["lxml",</span> <span class="pre">"xml"])</span></tt>
+<tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">"xml")</span></tt></td>
+<td><ul class="first last simple">
+<li>아주 빠름</li>
+<li>유일하게 XML 해석기 지원</li>
+</ul>
+</td>
+<td><ul class="first last simple">
+<li>외부 C 라이브러리 의존</li>
+</ul>
+</td>
+</tr>
+<tr class="row-odd"><td>html5lib</td>
+<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">html5lib)</span></tt></td>
+<td><ul class="first last simple">
+<li>아주 관대함</li>
+<li>웹 브라우저의 방식으로 페이지를 해석함</li>
+<li>유효한 HTML5를 생성함</li>
+</ul>
+</td>
+<td><ul class="first last simple">
+<li>아주 느림</li>
+<li>외부 파이썬 라이브러리 의존</li>
+<li>파이썬 2 전용</li>
+</ul>
+</td>
+</tr>
+</tbody>
+</table>
+<p>가능하다면, 속도를 위해 lxml을 설치해 사용하시기를 권장한다. 2.7.3 이전의 파이썬2, 또는3.2.2 이전의 파이썬 3 버전을 사용한다면, lxml을 사용하는 것이 <cite>필수이다</cite>. 그렇지 않고 구형 버전의 파이썬 내장 HTML 해석기 html5lib는 별로 좋지 않다.</p>
+<p>문서가 유효하지 않을 경우 해석기마다 다른 뷰티플수프 트리를 생산한다는 사실을 주목하자. 자세한 것은 <a class="reference internal" href="#differences-between-parsers">해석기들 사이의 차이점들</a>을 살펴보자.</p>
+</div>
+</div>
+<div class="section" id="making-the-soup">
+<h1>수프 만들기<a class="headerlink" href="#making-the-soup" title="Permalink to this headline">¶</a></h1>
+<p>문서를 해석하려면, 문서를 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 구성자에 건네주자. 문자열 혹은 열린 파일 핸들을 건네면 된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
+
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="s">"index.html"</span><span class="p">))</span>
+
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&lt;html&gt;data&lt;/html&gt;"</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>먼저, 문서는 유니코드로 변환되고 HTML 개체는 유니코드 문자로 변환된다:</p>
+<div class="highlight-python"><pre>BeautifulSoup("Sacr&amp;eacute; bleu!")
+&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;Sacré bleu!&lt;/body&gt;&lt;/html&gt;</pre>
+</div>
+<p>다음 뷰티플수프는 문서를 가장 적당한 해석기를 사용하여 해석한다. 특별히 XML 해석기를 사용하라고 지정해 주지 않으면 HTML 해석기를 사용한다. (<a class="reference internal" href="#id11"> XML 해석하기</a> 참조.)</p>
+</div>
+<div class="section" id="kinds-of-objects">
+<h1>객체의 종류<a class="headerlink" href="#kinds-of-objects" title="Permalink to this headline">¶</a></h1>
+<p>뷰티플수프는 복합적인 HTML 문서를 파이썬 객체로 구성된 복합적인 문서로 변환한다. 그러나
+<cite>객체의 종류</cite>를 다루는 법만 알면 된다.</p>
+<div class="section" id="tag">
+<span id="id1"></span><h2><tt class="docutils literal"><span class="pre">태그</span></tt><a class="headerlink" href="#tag" title="Permalink to this headline">¶</a></h2>
+<p> <tt class="docutils literal"><span class="pre">Tag</span></tt> 객체는 원래 문서의 XML 태그 또는 HTML 태그에 상응한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'&lt;b class="boldest"&gt;Extremely bold&lt;/b&gt;'</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
+<span class="nb">type</span><span class="p">(</span><span class="n">tag</span><span class="p">)</span>
+<span class="c"># &lt;class 'bs4.element.Tag'&gt;</span>
+</pre></div>
+</div>
+<p>태그는 많은 속성과 메쏘드가 있지만, 그 대부분을 나중에 <a class="reference internal" href="#navigating-the-tree">트리 항해하기</a> 그리고 <a class="reference internal" href="#searching-the-tree">트리 검색하기</a>에서 다룰 생각이다. 지금은 태그의 가장 중요한 특징인 이름과 속성을 설명한다.</p>
+<div class="section" id="name">
+<h3>이름<a class="headerlink" href="#name" title="Permalink to this headline">¶</a></h3>
+<p>태그마다 이름이 있고, 다음 <tt class="docutils literal"><span class="pre">.name</span></tt> 과 같이 접근할 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">name</span>
+<span class="c"># u'b'</span>
+</pre></div>
+</div>
+<p>태그의 이름을 바꾸면, 그 변화는 뷰티블수프가 생산한 HTML 조판에 반영된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">"blockquote"</span>
+<span class="n">tag</span>
+<span class="c"># &lt;blockquote class="boldest"&gt;Extremely bold&lt;/blockquote&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="attributes">
+<h3>속성<a class="headerlink" href="#attributes" title="Permalink to this headline">¶</a></h3>
+<p>태그는 속성을 여러개 가질 수 있다. <tt class="docutils literal"><span class="pre">&lt;b</span>
+<span class="pre">class="boldest"&gt;</span></tt> 태그는 속성으로 “class”가 있는데 그 값은
+“boldest”이다. 태그의 속성에는 사전처럼 태그를 반복해 접근하면 된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
+<span class="c"># u'boldest'</span>
+</pre></div>
+</div>
+<p>사전에 <tt class="docutils literal"><span class="pre">.attrs</span></tt>와 같이 바로 접근할 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">attrs</span>
+<span class="c"># {u'class': u'boldest'}</span>
+</pre></div>
+</div>
+<p>태그의 속성을 추가, 제거, 변경할 수 있다. 역시 태그를 사전처럼 취급해서 처리한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'verybold'</span>
+<span class="n">tag</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
+<span class="n">tag</span>
+<span class="c"># &lt;blockquote class="verybold" id="1"&gt;Extremely bold&lt;/blockquote&gt;</span>
+
+<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
+<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span>
+<span class="n">tag</span>
+<span class="c"># &lt;blockquote&gt;Extremely bold&lt;/blockquote&gt;</span>
+
+<span class="n">tag</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
+<span class="c"># KeyError: 'class'</span>
+<span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'class'</span><span class="p">))</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+<div class="section" id="multi-valued-attributes">
+<span id="multivalue"></span><h4>값이-여럿인 속성<a class="headerlink" href="#multi-valued-attributes" title="Permalink to this headline">¶</a></h4>
+<p>HTML 4에서 몇몇 속성은 값을 여러 개 가질 수 있도록 정의된다. HTML 5에서 그 중 2개는 제거되었지만, 몇 가지가 더 정의되었다. 가장 흔한 다중값 속성은 <tt class="docutils literal"><span class="pre">class</span></tt>이다 (다시 말해, 태그가 하나 이상의 CSS 클래스를 가질 수 있다). 다른 것으로는 <tt class="docutils literal"><span class="pre">rel</span></tt>, <tt class="docutils literal"><span class="pre">rev</span></tt>, <tt class="docutils literal"><span class="pre">accept-charset</span></tt>,
+<tt class="docutils literal"><span class="pre">headers</span></tt>, 그리고 <tt class="docutils literal"><span class="pre">accesskey</span></tt>가 포함된다. 뷰티플수프는 다중-값 속성의 값들을 리스트로 나타낸다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">css_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'&lt;p class="body strikeout"&gt;&lt;/p&gt;'</span><span class="p">)</span>
+<span class="n">css_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
+<span class="c"># ["body", "strikeout"]</span>
+
+<span class="n">css_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'&lt;p class="body"&gt;&lt;/p&gt;'</span><span class="p">)</span>
+<span class="n">css_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
+<span class="c"># ["body"]</span>
+</pre></div>
+</div>
+<p>속성에 <cite>하나 이상의 값이 있는 것처럼 보이지만</cite>, HTML 표준에 정의된 다중-값 속성이 아니라면, 뷰티플수프는 그 속성을 그대로 둔다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">id_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'&lt;p id="my id"&gt;&lt;/p&gt;'</span><span class="p">)</span>
+<span class="n">id_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span>
+<span class="c"># 'my id'</span>
+</pre></div>
+</div>
+<p>태그를 다시 문자열로 바꾸면, 다중-값 속성은 합병된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">rel_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'&lt;p&gt;Back to the &lt;a rel="index"&gt;homepage&lt;/a&gt;&lt;/p&gt;'</span><span class="p">)</span>
+<span class="n">rel_soup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">'rel'</span><span class="p">]</span>
+<span class="c"># ['index']</span>
+<span class="n">rel_soup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">'rel'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="s">'index'</span><span class="p">,</span> <span class="s">'contents'</span><span class="p">]</span>
+<span class="k">print</span><span class="p">(</span><span class="n">rel_soup</span><span class="o">.</span><span class="n">p</span><span class="p">)</span>
+<span class="c"># &lt;p&gt;Back to the &lt;a rel="index contents"&gt;homepage&lt;/a&gt;&lt;/p&gt;</span>
+</pre></div>
+</div>
+<p>문서를 XML로 해석하면, 다중-값 속성은 없다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">xml_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'&lt;p class="body strikeout"&gt;&lt;/p&gt;'</span><span class="p">,</span> <span class="s">'xml'</span><span class="p">)</span>
+<span class="n">xml_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
+<span class="c"># u'body strikeout'</span>
+</pre></div>
+</div>
+</div>
+</div>
+</div>
+<div class="section" id="navigablestring">
+<h2><tt class="docutils literal"><span class="pre">NavigableString</span></tt><a class="headerlink" href="#navigablestring" title="Permalink to this headline">¶</a></h2>
+<p>문자열은 태그 안에 있는 일군의 텍스트에 상응한다. 뷰티플수프는 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 클래스 안에다 이런 텍스트를 보관한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">string</span>
+<span class="c"># u'Extremely bold'</span>
+<span class="nb">type</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
+<span class="c"># &lt;class 'bs4.element.NavigableString'&gt;</span>
+</pre></div>
+</div>
+<p>
+<tt class="docutils literal"><span class="pre">NavigableString</span></tt>은 파이썬의 유니코드 문자열과 똑 같은데, 단 <a class="reference internal" href="#navigating-the-tree">트리 항해하기</a>와 <a class="reference internal" href="#searching-the-tree">트리 탐색하기</a>에 기술된 특징들도 지원한다는 점이 다르다.
+<tt class="docutils literal"><span class="pre">NavigableString</span></tt>을 유니코드 문자열로 변환하려면 <tt class="docutils literal"><span class="pre">unicode()</span></tt>를 사용한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">unicode_string</span> <span class="o">=</span> <span class="nb">unicode</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
+<span class="n">unicode_string</span>
+<span class="c"># u'Extremely bold'</span>
+<span class="nb">type</span><span class="p">(</span><span class="n">unicode_string</span><span class="p">)</span>
+<span class="c"># &lt;type 'unicode'&gt;</span>
+</pre></div>
+</div>
+<p>문자열을 바로바로 편집할 수는 없지만, <a class="reference internal" href="#replace-with"><em>replace_with()</em></a>을 사용하면 한 문자열을 또다른 문자열로 바꿀 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="s">"No longer bold"</span><span class="p">)</span>
+<span class="n">tag</span>
+<span class="c"># &lt;blockquote&gt;No longer bold&lt;/blockquote&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">NavigableString</span></tt>은 <a class="reference internal" href="#navigating-the-tree">트리 항해하기</a>와 <a class="reference internal" href="#searching-the-tree">트리 탐색하기</a>에 기술된 특징들을 모두는 아니지만, 대부분 지원한다. 특히, (태그에는 다른 문자열이나 또다른 태그가 담길 수 있지만) 문자열에는 다른 어떤 것도 담길 수 없기 때문에, 문자열은 <tt class="docutils literal"><span class="pre">.contents</span></tt>나 <tt class="docutils literal"><span class="pre">.string</span></tt> 속성, 또는 <tt class="docutils literal"><span class="pre">find()</span></tt> 메쏘드를 지원하지 않는다.</p>
+</div>
+<div class="section" id="beautifulsoup">
+<h2><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt><a class="headerlink" href="#beautifulsoup" title="Permalink to this headline">¶</a></h2>
+<p>
+<tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체 자신은 문서 전체를 대표한다. 대부분의 목적에, 그것을 <a class="reference internal" href="#tag"><em>Tag</em></a> 객체로 취급해도 좋다. 이것은 곧 <a class="reference internal" href="#navigating-the-tree">트리 항해하기</a>와 <a class="reference internal" href="#searching-the-tree">트리 검색하기</a>에 기술된 메쏘드들을 지원한다는 뜻이다.</p>
+<p>
+<tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체는 실제 HTML 태그나 XML 태그에 상응하지 않기 때문에, 이름도 속성도 없다. 그러나 가끔 그의 이름 <tt class="docutils literal"><span class="pre">.name</span></tt>을 살펴보는 것이 유용할 경우가 있다. 그래서 특별히
+<tt class="docutils literal"><span class="pre">.name</span></tt>에 “[document]”라는 이름이 주어졌다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">name</span>
+<span class="c"># u'[document]'</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="comments-and-other-special-strings">
+<h2>주석과 기타 특수 문자열들<a class="headerlink" href="#comments-and-other-special-strings" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag</span></tt>, <tt class="docutils literal"><span class="pre">NavigableString</span></tt>, 그리고 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 정도면 HTML이나 XML 파일에서 보게될 거의 모든 것들을 망라한다. 그러나 몇 가지 남은 것들이 있다. 아마도 신경쓸 필요가 있는 것이 유일하게 있다면 바로 주석이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">"&lt;b&gt;&lt;!--Hey, buddy. Want to buy a used parser?--&gt;&lt;/b&gt;"</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">comment</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span>
+<span class="nb">type</span><span class="p">(</span><span class="n">comment</span><span class="p">)</span>
+<span class="c"># &lt;class 'bs4.element.Comment'&gt;</span>
+</pre></div>
+</div>
+<p>
+<tt class="docutils literal"><span class="pre">Comment</span></tt> 객체는 그냥 특별한 유형의 <tt class="docutils literal"><span class="pre">NavigableString</span></tt>이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">comment</span>
+<span class="c"># u'Hey, buddy. Want to buy a used parser'</span>
+</pre></div>
+</div>
+<p>그러나 HTML 문서의 일부에 나타나면, <tt class="docutils literal"><span class="pre">Comment</span></tt>는 특별한 형태로 화면에 표시된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;b&gt;</span>
+<span class="c"># &lt;!--Hey, buddy. Want to buy a used parser?--&gt;</span>
+<span class="c"># &lt;/b&gt;</span>
+</pre></div>
+</div>
+<p>뷰티플수프는 XML 문서에 나올만한 것들을 모두 클래스에다 정의한다: <tt class="docutils literal"><span class="pre">CData</span></tt>, <tt class="docutils literal"><span class="pre">ProcessingInstruction</span></tt>,
+<tt class="docutils literal"><span class="pre">Declaration</span></tt>, 그리고 <tt class="docutils literal"><span class="pre">Doctype</span></tt>이 그것이다. <tt class="docutils literal"><span class="pre">Comment</span></tt>와 똑같이, 이런 클래스들은 <tt class="docutils literal"><span class="pre">NavigableString</span></tt>의 하위클래스로서 자신의 문자열에 다른 어떤것들을 추가한다. 다음은 주석을 CDATA 블록으로 교체하는 예이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">CData</span>
+<span class="n">cdata</span> <span class="o">=</span> <span class="n">CData</span><span class="p">(</span><span class="s">"A CDATA block"</span><span class="p">)</span>
+<span class="n">comment</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="n">cdata</span><span class="p">)</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;b&gt;</span>
+<span class="c"># &lt;![CDATA[A CDATA block]]&gt;</span>
+<span class="c"># &lt;/b&gt;</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="navigating-the-tree">
+<h1>트리 항해하기<a class="headerlink" href="#navigating-the-tree" title="Permalink to this headline">¶</a></h1>
+<p>다시 또 “Three sisters” HTML 문서를 보자:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">html_doc</span> <span class="o">=</span> <span class="s">"""</span>
+<span class="s">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;</span>
+
+<span class="s">&lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</span>
+
+<span class="s">&lt;p class="story"&gt;Once upon a time there were three little sisters; and their names were</span>
+<span class="s">&lt;a href="http://example.com/elsie" class="sister" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="s">&lt;a href="http://example.com/lacie" class="sister" id="link2"&gt;Lacie&lt;/a&gt; and</span>
+<span class="s">&lt;a href="http://example.com/tillie" class="sister" id="link3"&gt;Tillie&lt;/a&gt;;</span>
+<span class="s">and they lived at the bottom of a well.&lt;/p&gt;</span>
+
+<span class="s">&lt;p class="story"&gt;...&lt;/p&gt;</span>
+<span class="s">"""</span>
+
+<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>이 예제로 한 문서에서 일부를 다른 곳으로 이동하는 법을 보여주겠다.</p>
+<div class="section" id="going-down">
+<h2>내려가기<a class="headerlink" href="#going-down" title="Permalink to this headline">¶</a></h2>
+<p>태그에는 또다른 태그가 담길 수 있다. 이런 요소들은 그 태그의 자손(<cite>children</cite>)이라고 부른다. 뷰티플수프는 한 태그의 자손을 항해하고 반복하기 위한 속성을 다양하게 제공한다.</p>
+<p>뷰티플수프의 문자열은 이런 속성들을 제공하지 않음에 유의하자. 왜냐하면 문자열은 자손을 가질 수 없기 때문이다.</p>
+<div class="section" id="navigating-using-tag-names">
+<h3>태그 이름을 사용하여 항해하기<a class="headerlink" href="#navigating-using-tag-names" title="Permalink to this headline">¶</a></h3>
+<p>가장 단순하게 해석 트리를 항해하는 방법은 원하는 태그의 이름을 지정해 주는 것이다. &lt;head&gt; 태그를 원한다면, 그냥 <tt class="docutils literal"><span class="pre">soup.head</span></tt>라고 지정하면 된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">head</span>
+<span class="c"># &lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">title</span>
+<span class="c"># &lt;title&gt;The Dormouse's story&lt;/title&gt;</span>
+</pre></div>
+</div>
+<p>이 트릭을 반복해 사용하면 해석 트리의 특정 부분을 확대해 볼 수 있다. 다음 코드는 &lt;body&gt; 태그 아래에서 첫 번째 &lt;b&gt; 태그를 얻는다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">body</span><span class="o">.</span><span class="n">b</span>
+<span class="c"># &lt;b&gt;The Dormouse's story&lt;/b&gt;</span>
+</pre></div>
+</div>
+<p>태그 이름을 속성으로 사용하면 오직 그 이름으로 <cite>첫 번째</cite> 태그만 얻는다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p>
+&lt;a&gt; 태그를 <cite>모두</cite> 얻거나, 특정이름으로 첫 번째 태그 말고 좀 더 복잡한 어떤 것을 얻고 싶다면, <a class="reference internal" href="#searching-the-tree">트리 탐색하기</a>에 기술된 메쏘드들을 사용해야 한다. 예를 들어, <cite>find_all()</cite>과 같은 메쏘드를 사용하면 된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="contents-and-children">
+<h3><tt class="docutils literal"><span class="pre">.contents</span></tt> 그리고 <tt class="docutils literal"><span class="pre">.children</span></tt><a class="headerlink" href="#contents-and-children" title="Permalink to this headline">¶</a></h3>
+<p>태그의 자손은 <tt class="docutils literal"><span class="pre">.contents</span></tt>라고 부르는 리스트로 얻을 수 있다:</p>
+<div class="highlight-python"><pre>head_tag = soup.head
+head_tag
+# &lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;
+
+head_tag.contents
+[&lt;title&gt;The Dormouse's story&lt;/title&gt;]
+
+title_tag = head_tag.contents[0]
+title_tag
+# &lt;title&gt;The Dormouse's story&lt;/title&gt;
+title_tag.contents
+# [u'The Dormouse's story']</pre>
+</div>
+<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체 자체에 자손이 있다. 이 경우, &lt;html&gt; 태그가 바로 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체의 자손이다.:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="nb">len</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">contents</span><span class="p">)</span>
+<span class="c"># 1</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">contents</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">name</span>
+<span class="c"># u'html'</span>
+</pre></div>
+</div>
+<p>문자열은 <tt class="docutils literal"><span class="pre">.contents</span></tt>를 가질 수 없는데, 왜냐하면 문자열 안에는 아무것도 담을 수 없기 때문이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">text</span> <span class="o">=</span> <span class="n">title_tag</span><span class="o">.</span><span class="n">contents</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
+<span class="n">text</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># AttributeError: 'NavigableString' object has no attribute 'contents'</span>
+</pre></div>
+</div>
+<p>자손을 리스트로 얻는 대신에, <tt class="docutils literal"><span class="pre">.children</span></tt> 발생자를 사용하면 태그의 자손을 반복할 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">title_tag</span><span class="o">.</span><span class="n">children</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">child</span><span class="p">)</span>
+<span class="c"># The Dormouse's story</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="descendants">
+<h3><tt class="docutils literal"><span class="pre">.descendants</span></tt><a class="headerlink" href="#descendants" title="Permalink to this headline">¶</a></h3>
+<p>
+내용물(<tt class="docutils literal"><span class="pre">.contents</span></tt>)과 자손(<tt class="docutils literal"><span class="pre">.children</span></tt>) 속성은 오직 한 태그의 직계(
+<cite>direct</cite>) 자손만 고려한다. 예를 들면, &lt;head&gt; 태그는 오직 한 개의 직계 자손으로 &lt;title&gt; 태그가 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">head_tag</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># [&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span>
+</pre></div>
+</div>
+<p>그러나 &lt;title&gt; 태그 자체에 자손이 하나 있다: 문자열 “The Dormouse’s
+story”가 그것이다. 그 문자열도 역시 &lt;head&gt; 태그의 자손이다. <tt class="docutils literal"><span class="pre">.descendants</span></tt> 속성은 한 태그의 자손들을 <cite>모두</cite> 재귀적으로, 반복할 수 있도록 해준다: 그의 직계 자손, 그 직계 자손의 자손, 등등:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">head_tag</span><span class="o">.</span><span class="n">descendants</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">child</span><span class="p">)</span>
+<span class="c"># &lt;title&gt;The Dormouse's story&lt;/title&gt;</span>
+<span class="c"># The Dormouse's story</span>
+</pre></div>
+</div>
+<p>
+ &lt;head&gt; 태그는 오직 자손이 하나이지만, 후손은 둘이다:
+&lt;title&gt; 태그와 &lt;title&gt; 태그의 자손이 그것이다. <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체는 오직 하나의 직계 자손(&lt;html&gt; 태그)만 있지만, 수 많은 후손을 가진다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">children</span><span class="p">))</span>
+<span class="c"># 1</span>
+<span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">descendants</span><span class="p">))</span>
+<span class="c"># 25</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="string">
+<span id="id2"></span><h3><tt class="docutils literal"><span class="pre">.string</span></tt><a class="headerlink" href="#string" title="Permalink to this headline">¶</a></h3>
+<p>태그에 오직 자손이 하나라면, 그리고 그 자손이 <tt class="docutils literal"><span class="pre">NavigableString</span></tt>이라면, 그 자손은 <tt class="docutils literal"><span class="pre">.string</span></tt>으로 얻을 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">title_tag</span><span class="o">.</span><span class="n">string</span>
+<span class="c"># u'The Dormouse's story'</span>
+</pre></div>
+</div>
+<p>태그의 유일한 자손이 또다른 태그라면, 그리고 <cite>그</cite> 태그가
+<tt class="docutils literal"><span class="pre">.string</span></tt>을 가진다면, 그 부모 태그는 같은 <tt class="docutils literal"><span class="pre">.string</span></tt>을 그의 자손으로 가진다고 간주된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">head_tag</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># [&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span>
+
+<span class="n">head_tag</span><span class="o">.</span><span class="n">string</span>
+<span class="c"># u'The Dormouse's story'</span>
+</pre></div>
+</div>
+<p>태그에 하나 이상의 태그가 있다면, <tt class="docutils literal"><span class="pre">.string</span></tt>이 무엇을 가리킬지 확실하지 않다. 그래서 그럴 경우 <tt class="docutils literal"><span class="pre">.string</span></tt>은 <tt class="docutils literal"><span class="pre">None</span></tt>으로 정의된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="strings-and-stripped-strings">
+<span id="string-generators"></span><h3><tt class="docutils literal"><span class="pre">.strings</span></tt> 그리고 <tt class="docutils literal"><span class="pre">stripped_strings</span></tt><a class="headerlink" href="#strings-and-stripped-strings" title="Permalink to this headline">¶</a></h3>
+<p>한 태그 안에 여러개의 태그가 있더라도 여전히 문자열을 볼 수 있다. <tt class="docutils literal"><span class="pre">.strings</span></tt> 발생자를 사용하자:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">string</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">strings</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">string</span><span class="p">))</span>
+<span class="c"># u"The Dormouse's story"</span>
+<span class="c"># u'\n\n'</span>
+<span class="c"># u"The Dormouse's story"</span>
+<span class="c"># u'\n\n'</span>
+<span class="c"># u'Once upon a time there were three little sisters; and their names were\n'</span>
+<span class="c"># u'Elsie'</span>
+<span class="c"># u',\n'</span>
+<span class="c"># u'Lacie'</span>
+<span class="c"># u' and\n'</span>
+<span class="c"># u'Tillie'</span>
+<span class="c"># u';\nand they lived at the bottom of a well.'</span>
+<span class="c"># u'\n\n'</span>
+<span class="c"># u'...'</span>
+<span class="c"># u'\n'</span>
+</pre></div>
+</div>
+<p>이런 문자열들은 공백이 쓸데 없이 많은 경향이 있으므로, 대신에 <tt class="docutils literal"><span class="pre">.stripped_strings</span></tt> 발생자를 사용해 제거해 버릴 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">string</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">stripped_strings</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">string</span><span class="p">))</span>
+<span class="c"># u"The Dormouse's story"</span>
+<span class="c"># u"The Dormouse's story"</span>
+<span class="c"># u'Once upon a time there were three little sisters; and their names were'</span>
+<span class="c"># u'Elsie'</span>
+<span class="c"># u','</span>
+<span class="c"># u'Lacie'</span>
+<span class="c"># u'and'</span>
+<span class="c"># u'Tillie'</span>
+<span class="c"># u';\nand they lived at the bottom of a well.'</span>
+<span class="c"># u'...'</span>
+</pre></div>
+</div>
+<p>여기에서, 전적으로 공백만으로 구성된 문자열은 무시되고 문자열 앞과 뒤의 공백은 제거된다.</p>
+</div>
+</div>
+<div class="section" id="going-up">
+<h2>올라가기<a class="headerlink" href="#going-up" title="Permalink to this headline">¶</a></h2>
+<p>“가족 트리” 비유를 계속 사용해 보자. 태그마다 그리고 문자열마다 부모(
+<cite>parent</cite>)가 있다: 즉 자신을 담고 있는 태그가 있다.</p>
+<div class="section" id="parent">
+<span id="id3"></span><h3><tt class="docutils literal"><span class="pre">.parent</span></tt><a class="headerlink" href="#parent" title="Permalink to this headline">¶</a></h3>
+<p>한 요소의 부모는 <tt class="docutils literal"><span class="pre">.parent</span></tt> 속성으로 접근한다. 예제 “three sisters”문서에서, &lt;head&gt; 태그는 &lt;title&gt; 태그의 부모이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">title_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">title</span>
+<span class="n">title_tag</span>
+<span class="c"># &lt;title&gt;The Dormouse's story&lt;/title&gt;</span>
+<span class="n">title_tag</span><span class="o">.</span><span class="n">parent</span>
+<span class="c"># &lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;</span>
+</pre></div>
+</div>
+<p>title 문자열 자체로 부모가 있다: 그 문자열을 담고 있는 &lt;title&gt; 태그가 그것이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">title_tag</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">parent</span>
+<span class="c"># &lt;title&gt;The Dormouse's story&lt;/title&gt;</span>
+</pre></div>
+</div>
+<p>&lt;html&gt; 태그와 같은 최상위 태그의 부모는 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체 자신이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">html_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">html</span>
+<span class="nb">type</span><span class="p">(</span><span class="n">html_tag</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
+<span class="c"># &lt;class 'bs4.BeautifulSoup'&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체의 <tt class="docutils literal"><span class="pre">.parent</span></tt>는 None으로 정의된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="parents">
+<span id="id4"></span><h3><tt class="docutils literal"><span class="pre">.parents</span></tt><a class="headerlink" href="#parents" title="Permalink to this headline">¶</a></h3>
+<p> <tt class="docutils literal"><span class="pre">.parents</span></tt>로 한 요소의 부모들을 모두 다 반복할 수 있다.
+
+다음 예제는 <tt class="docutils literal"><span class="pre">.parents</span></tt>를 사용하여 문서 깊숙히 묻힌 &lt;a&gt; 태그로부터 시작하여, 문서의 최상단까지 순회한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="n">link</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;</span>
+<span class="k">for</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">link</span><span class="o">.</span><span class="n">parents</span><span class="p">:</span>
+ <span class="k">if</span> <span class="n">parent</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">parent</span><span class="p">)</span>
+ <span class="k">else</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
+<span class="c"># p</span>
+<span class="c"># body</span>
+<span class="c"># html</span>
+<span class="c"># [document]</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="going-sideways">
+<h2>옆으로 가기<a class="headerlink" href="#going-sideways" title="Permalink to this headline">¶</a></h2>
+<p>다음과 같은 간단한 문서를 생각해 보자:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">sibling_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&lt;a&gt;&lt;b&gt;text1&lt;/b&gt;&lt;c&gt;text2&lt;/c&gt;&lt;/b&gt;&lt;/a&gt;"</span><span class="p">)</span>
+<span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;a&gt;</span>
+<span class="c"># &lt;b&gt;</span>
+<span class="c"># text1</span>
+<span class="c"># &lt;/b&gt;</span>
+<span class="c"># &lt;c&gt;</span>
+<span class="c"># text2</span>
+<span class="c"># &lt;/c&gt;</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>&lt;b&gt; 태그와 &lt;c&gt; 태그는 같은 수준에 있다: 둘 다 같은 태그의 직계 자손이다. 이를 형제들(<cite>siblings</cite>)이라고 부른다. 문서가 pretty-printed로 출력되면, 형제들은 같은 들여쓰기 수준에서 나타난다. 이런 관계를 코드 작성에도 이용할 수 있다.</p>
+<div class="section" id="next-sibling-and-previous-sibling">
+<h3><tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 그리고 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt><a class="headerlink" href="#next-sibling-and-previous-sibling" title="Permalink to this headline">¶</a></h3>
+<p>
+<tt class="docutils literal"><span class="pre">.next_sibling</span></tt>과 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt>를 사용하면 해석 트리에서 같은 수준에 있는 페이지 요소들 사이를 항해할 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">next_sibling</span>
+<span class="c"># &lt;c&gt;text2&lt;/c&gt;</span>
+
+<span class="n">sibling_soup</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">previous_sibling</span>
+<span class="c"># &lt;b&gt;text1&lt;/b&gt;</span>
+</pre></div>
+</div>
+<p>
+&lt;b&gt; 태그는 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt>이 있지만, <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt>은 없는데,
+그 이유는 &lt;b&gt; 태그 앞에 <cite>트리에서 같은 수준에</cite> 아무것도 없기 때문이다. 같은 이유로, &lt;c&gt; 태그는 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt>은 있지만 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt>은 없다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">previous_sibling</span><span class="p">)</span>
+<span class="c"># None</span>
+<span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">next_sibling</span><span class="p">)</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+<p>
+문자열“text1”과 “text2”는 <cite>형제 사이가 아니다</cite>. 왜냐하면 부모가 같지 않기 때문이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span>
+<span class="c"># u'text1'</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">next_sibling</span><span class="p">)</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+<p>실제 문서에서, 한 태그의 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt>이나 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt>은 보통 공백이 포함된 문자열이다.
+“three sisters” 문서로 되돌아 가보자:</p>
+<div class="highlight-python"><pre>&lt;a href="http://example.com/elsie" class="sister" id="link1"&gt;Elsie&lt;/a&gt;
+&lt;a href="http://example.com/lacie" class="sister" id="link2"&gt;Lacie&lt;/a&gt;
+&lt;a href="http://example.com/tillie" class="sister" id="link3"&gt;Tillie&lt;/a&gt;</pre>
+</div>
+<p>
+첫번째 &lt;a&gt; 태그의 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt>이 두 번째 &lt;a&gt; 태그가 될 것이라고 생각하실지 모르겠다. 그러나 실제로는 문자열이 다음 형제이다: 즉, 첫 번째 &lt;a&gt; 태그와 두 번째 태그를 가르는 쉼표와 새줄 문자가 그것이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="n">link</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;</span>
+
+<span class="n">link</span><span class="o">.</span><span class="n">next_sibling</span>
+<span class="c"># u',\n'</span>
+</pre></div>
+</div>
+<p>두 번째 &lt;a&gt; 태그는 실제로는 그 쉼표의 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt>이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">link</span><span class="o">.</span><span class="n">next_sibling</span><span class="o">.</span><span class="n">next_sibling</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="next-siblings-and-previous-siblings">
+<span id="sibling-generators"></span><h3><tt class="docutils literal"><span class="pre">.next_siblings</span></tt> 그리고 <tt class="docutils literal"><span class="pre">.previous_siblings</span></tt><a class="headerlink" href="#next-siblings-and-previous-siblings" title="Permalink to this headline">¶</a></h3>
+<p>
+태그의 형제들은 <tt class="docutils literal"><span class="pre">.next_siblings</span></tt>이나
+<tt class="docutils literal"><span class="pre">.previous_siblings</span></tt>로 반복할 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">sibling</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">next_siblings</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">sibling</span><span class="p">))</span>
+<span class="c"># u',\n'</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;</span>
+<span class="c"># u' and\n'</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;</span>
+<span class="c"># u'; and they lived at the bottom of a well.'</span>
+<span class="c"># None</span>
+
+<span class="k">for</span> <span class="n">sibling</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">"link3"</span><span class="p">)</span><span class="o">.</span><span class="n">previous_siblings</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">sibling</span><span class="p">))</span>
+<span class="c"># ' and\n'</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;</span>
+<span class="c"># u',\n'</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;</span>
+<span class="c"># u'Once upon a time there were three little sisters; and their names were\n'</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="going-back-and-forth">
+<h2>앞뒤로 가기<a class="headerlink" href="#going-back-and-forth" title="Permalink to this headline">¶</a></h2>
+<p>“three sisters” 문서의 앞부분을 살펴보자:</p>
+<div class="highlight-python"><pre>&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;
+&lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</pre>
+</div>
+<p>
+HTML 해석기는 이 문자열들을 취해서 일련의 이벤트로 변환한다: “&lt;html&gt; 태그 열기”, “&lt;head&gt; 태그 열기”, “
+&lt;title&gt; 태그 열기”, “문자열 추가”, “&lt;title&gt; 태그 닫기”, “&lt;p&gt; 태그 열기”, 등등. 뷰티플수프는 문서의 최초 해석 상태를 재구성하는 도구들을 제공한다.</p>
+<div class="section" id="next-element-and-previous-element">
+<span id="element-generators"></span><h3><tt class="docutils literal"><span class="pre">.next_element</span></tt> 그리고 <tt class="docutils literal"><span class="pre">.previous_element</span></tt><a class="headerlink" href="#next-element-and-previous-element" title="Permalink to this headline">¶</a></h3>
+<p>
+문자열이나 태그의 <tt class="docutils literal"><span class="pre">.next_element</span></tt> 속성은 바로 다음에 해석된 것을 가리킨다.
+<tt class="docutils literal"><span class="pre">.next_sibling</span></tt>과 같을 것 같지만, 보통 완전히 다르다.</p>
+<p>
+다음은 “three sisters”문서에서 마지막 &lt;a&gt; 태그이다. 그의 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt>은 문자열이다: &lt;a&gt; 태그가 시작되어 중단되었던 문장의 끝부분이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">last_a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="s">"link3"</span><span class="p">)</span>
+<span class="n">last_a_tag</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;</span>
+
+<span class="n">last_a_tag</span><span class="o">.</span><span class="n">next_sibling</span>
+<span class="c"># '; and they lived at the bottom of a well.'</span>
+</pre></div>
+</div>
+<p>
+그러나 &lt;a&gt; 태그의 <tt class="docutils literal"><span class="pre">.next_element</span></tt>는, 다시 말해 &lt;a&gt; 태그 바로 다음에 해석된 것은, 나머지 문장이 <cite>아니다</cite>: 그것은 단어 “Tillie”이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">last_a_tag</span><span class="o">.</span><span class="n">next_element</span>
+<span class="c"># u'Tillie'</span>
+</pre></div>
+</div>
+<p>그 이유는 원래의 조판에서 단어“Tillie”가 쌍반점보다 먼저 나타나기 때문이다. 해석기는 &lt;a&gt; 태그를 맞이하고, 다음으로 단어 “Tillie”, 그 다음 닫는 &lt;/a&gt; 태그, 그 다음에 쌍반점과 나머지 문장을 맞이한다. 쌍반점은 &lt;a&gt; 태그와 같은 수준에 있지만, 단어 “Tillie”를 먼저 만난다.</p>
+<p><tt class="docutils literal"><span class="pre">.previous_element</span></tt> 속성은 <tt class="docutils literal"><span class="pre">.next_element</span></tt>와 정반대이다. 바로 앞에 해석된 요소를 가리킨다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">last_a_tag</span><span class="o">.</span><span class="n">previous_element</span>
+<span class="c"># u' and\n'</span>
+<span class="n">last_a_tag</span><span class="o">.</span><span class="n">previous_element</span><span class="o">.</span><span class="n">next_element</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="next-elements-and-previous-elements">
+<h3><tt class="docutils literal"><span class="pre">.next_elements</span></tt> 그리고 <tt class="docutils literal"><span class="pre">.previous_elements</span></tt><a class="headerlink" href="#next-elements-and-previous-elements" title="Permalink to this headline">¶</a></h3>
+<p>이제 이해가 가셨으리라 믿는다. 이런 반복자들을 사용하면 문서에서 해석하는 동안 앞 뒤로 이동할 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">element</span> <span class="ow">in</span> <span class="n">last_a_tag</span><span class="o">.</span><span class="n">next_elements</span><span class="p">:</span>
+ <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">element</span><span class="p">))</span>
+<span class="c"># u'Tillie'</span>
+<span class="c"># u';\nand they lived at the bottom of a well.'</span>
+<span class="c"># u'\n\n'</span>
+<span class="c"># &lt;p class="story"&gt;...&lt;/p&gt;</span>
+<span class="c"># u'...'</span>
+<span class="c"># u'\n'</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+</div>
+</div>
+</div>
+<div class="section" id="searching-the-tree">
+<h1>트리 탐색하기<a class="headerlink" href="#searching-the-tree" title="Permalink to this headline">¶</a></h1>
+<p>뷰티플수프에는 해석 트리를 탐색하기 위한 메쏘드들이 많이 정의되어 있지만, 모두 다 거의 비슷하다. 가장 많이 사용되는 두 가지 메쏘드를 설명하는데 시간을 많이 할애할 생각이다: <tt class="docutils literal"><span class="pre">find()</span></tt>와 <tt class="docutils literal"><span class="pre">find_all()</span></tt>이 그것이다. 다른 메쏘드는 거의 똑 같은 인자를 취한다. 그래서 그것들은 그냥 간략하게 다루겠다.</p>
+<p>다시 또, “three sisters” 문서를 예제로 사용하자:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">html_doc</span> <span class="o">=</span> <span class="s">"""</span>
+<span class="s">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;</span>
+
+<span class="s">&lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</span>
+
+<span class="s">&lt;p class="story"&gt;Once upon a time there were three little sisters; and their names were</span>
+<span class="s">&lt;a href="http://example.com/elsie" class="sister" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="s">&lt;a href="http://example.com/lacie" class="sister" id="link2"&gt;Lacie&lt;/a&gt; and</span>
+<span class="s">&lt;a href="http://example.com/tillie" class="sister" id="link3"&gt;Tillie&lt;/a&gt;;</span>
+<span class="s">and they lived at the bottom of a well.&lt;/p&gt;</span>
+
+<span class="s">&lt;p class="story"&gt;...&lt;/p&gt;</span>
+<span class="s">"""</span>
+
+<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">find_all()</span></tt>과 같이 인자에 여과기를 건네면, 얼마든지 문서에서 관심있는 부분을 뜯어낼 수 있다.</p>
+<div class="section" id="kinds-of-filters">
+<h2>여과기의 종류<a class="headerlink" href="#kinds-of-filters" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">find_all()</span></tt>과 유사 메쏘드들에 관하여 자세히 설명하기 전에 먼저, 이런 메쏘드들에 건넬 수 있는 다양한 여과기의 예제들을 보여주고 싶다. 이런 여과기들은
+탐색 API 전체에 걸쳐서 나타나고 또 나타난다. 태그의 이름, 그의 속성, 문자열 텍스트, 또는 이런 것들을 조합하여 여과할 수 있다.</p>
+<div class="section" id="a-string">
+<span id="id5"></span><h3>문자열<a class="headerlink" href="#a-string" title="Permalink to this headline">¶</a></h3>
+<p>
+가장 단순한 여과기는 문자열이다. 문자열을 탐색 메쏘드에 건네면 뷰티플수프는 그 정확한 문자열에 맞게 부합을 수행한다. 다음 코드는 문서에서 &lt;b&gt; 태그를 모두 찾는다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'b'</span><span class="p">)</span>
+<span class="c"># [&lt;b&gt;The Dormouse's story&lt;/b&gt;]</span>
+</pre></div>
+</div>
+<p>바이트 문자열을 건네면, 뷰티플수프는 그 문자열이 UTF-8로 인코드되어 있다고 간주한다. 이를 피하려면 대신에 유니코드 문자열을 건네면 된다.</p>
+</div>
+<div class="section" id="a-regular-expression">
+<span id="id6"></span><h3>정규 표현식<a class="headerlink" href="#a-regular-expression" title="Permalink to this headline">¶</a></h3>
+<p>정규 표현식 객체를 건네면, 뷰티플수프는 <tt class="docutils literal"><span class="pre">match()</span></tt> 메쏘드를 사용하여 그 정규 표현식에 맞게 여과한다. 다음 코드는 이름이 “b”로 시작하는 태그를 모두 찾는다; 이 경우, &lt;body&gt; 태그와 &lt;b&gt; 태그를 찾을 것이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">import</span> <span class="nn">re</span>
+<span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"^b"</span><span class="p">)):</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
+<span class="c"># body</span>
+<span class="c"># b</span>
+</pre></div>
+</div>
+<p>다음 코드는 이름에 ‘t’가 포함된 태그를 모두 찾는다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"t"</span><span class="p">)):</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
+<span class="c"># html</span>
+<span class="c"># title</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="a-list">
+<span id="id7"></span><h3>리스트<a class="headerlink" href="#a-list" title="Permalink to this headline">¶</a></h3>
+<p>리스트를 건네면, 뷰티플수프는 그 리스트에 담긴 <cite>항목마다</cite> 문자열 부합을 수행한다. 다음 코드는 모든 &lt;a&gt; 태그 <cite>그리고 </cite> 모든 &lt;b&gt; 태그를 찾는다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">([</span><span class="s">"a"</span><span class="p">,</span> <span class="s">"b"</span><span class="p">])</span>
+<span class="c"># [&lt;b&gt;The Dormouse's story&lt;/b&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="true">
+<span id="the-value-true"></span><h3><tt class="docutils literal"><span class="pre">True</span></tt><a class="headerlink" href="#true" title="Permalink to this headline">¶</a></h3>
+<p> <tt class="docutils literal"><span class="pre">True</span></tt> 값은 참이면 모두 부합시킨다.
+다음 코드는 문서에서 태그를 <cite>모두</cite> 찾지만, 텍스트 문자열은 전혀 찾지 않는다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="bp">True</span><span class="p">):</span>
+ <span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
+<span class="c"># html</span>
+<span class="c"># head</span>
+<span class="c"># title</span>
+<span class="c"># body</span>
+<span class="c"># p</span>
+<span class="c"># b</span>
+<span class="c"># p</span>
+<span class="c"># a</span>
+<span class="c"># a</span>
+<span class="c"># a</span>
+<span class="c"># p</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="a-function">
+<h3>함수<a class="headerlink" href="#a-function" title="Permalink to this headline">¶</a></h3>
+<p>다른 어떤 부합 기준도 마음에 안든다면, 요소를 그의 유일한 인자로 취하는 함수를 정의하면 된다. 함수는 인자가 부합하면
+<tt class="docutils literal"><span class="pre">True</span></tt>를 돌려주고, 그렇지 않으면 <tt class="docutils literal"><span class="pre">False</span></tt>를 돌려주어야 한다.</p>
+<p>다음은 태그에 “class”속성이 정의되어 있지만 “id” 속성은 없으면 <tt class="docutils literal"><span class="pre">True</span></tt> 를 돌려주는 함수이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">def</span> <span class="nf">has_class_but_no_id</span><span class="p">(</span><span class="n">tag</span><span class="p">):</span>
+ <span class="k">return</span> <span class="n">tag</span><span class="o">.</span><span class="n">has_key</span><span class="p">(</span><span class="s">'class'</span><span class="p">)</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">tag</span><span class="o">.</span><span class="n">has_key</span><span class="p">(</span><span class="s">'id'</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>이 함수를 <tt class="docutils literal"><span class="pre">find_all()</span></tt>에 건네면 &lt;p&gt; 태그를 모두 얻게 된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">has_class_but_no_id</span><span class="p">)</span>
+<span class="c"># [&lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;,</span>
+<span class="c"># &lt;p class="story"&gt;Once upon a time there were...&lt;/p&gt;,</span>
+<span class="c"># &lt;p class="story"&gt;...&lt;/p&gt;]</span>
+</pre></div>
+</div>
+<p>이 함수는 &lt;p&gt; 태그만 얻는다. &lt;a&gt; 태그는 획득하지 않는데, 왜냐하면 “class”와 “id”가 모두 정의되어 있기 때문이다. &lt;html&gt;과 &lt;title&gt;도 얻지 않는데, 왜냐하면 “class”가 정의되어 있지 않기 때문이다.</p>
+<p>
+다음은 태그가 문자열 객체로 둘러 싸여 있으면 <tt class="docutils literal"><span class="pre">True</span></tt>를 돌려주는 함수이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">NavigableString</span>
+<span class="k">def</span> <span class="nf">surrounded_by_strings</span><span class="p">(</span><span class="n">tag</span><span class="p">):</span>
+ <span class="k">return</span> <span class="p">(</span><span class="nb">isinstance</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">next_element</span><span class="p">,</span> <span class="n">NavigableString</span><span class="p">)</span>
+ <span class="ow">and</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">previous_element</span><span class="p">,</span> <span class="n">NavigableString</span><span class="p">))</span>
+
+<span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">surrounded_by_strings</span><span class="p">):</span>
+ <span class="k">print</span> <span class="n">tag</span><span class="o">.</span><span class="n">name</span>
+<span class="c"># p</span>
+<span class="c"># a</span>
+<span class="c"># a</span>
+<span class="c"># a</span>
+<span class="c"># p</span>
+</pre></div>
+</div>
+<p>이제 탐색 메쏘드들을 자세하게 살펴볼 준비가 되었다.</p>
+</div>
+</div>
+<div class="section" id="find-all">
+<h2><tt class="docutils literal"><span class="pre">find_all()</span></tt><a class="headerlink" href="#find-all" title="Permalink to this headline">¶</a></h2>
+<p>서명: find_all(<a class="reference internal" href="#id8"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#recursive"><em>recursive</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#limit"><em>limit</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>
+<tt class="docutils literal"><span class="pre">find_all()</span></tt> 메쏘드는 태그의 후손들을 찾아서 지정한 여과기에 부합하면 <cite>모두</cite> 추출한다. <a class="reference internal" href="#kinds-of-filters">몇 가지 여과기</a>에서 예제들을 제시했지만, 여기에 몇 가지 더 보여주겠다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"title"</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="s">"title"</span><span class="p">)</span>
+<span class="c"># [&lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">"link2"</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;]</span>
+
+<span class="kn">import</span> <span class="nn">re</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"sisters"</span><span class="p">))</span>
+<span class="c"># u'Once upon a time there were three little sisters; and their names were\n'</span>
+</pre></div>
+</div>
+<p>어떤 것은 익숙하지만, 다른 것들은 새로울 것이다. <tt class="docutils literal"><span class="pre">text</span></tt> 혹은 <tt class="docutils literal"><span class="pre">id</span></tt>에 값을 건넨다는 것이 무슨 뜻인가? 왜 다음
+<tt class="docutils literal"><span class="pre">find_all("p",</span> <span class="pre">"title")</span></tt>은 CSS 클래스가 “title”인 &lt;p&gt; 태그를 찾는가?
+ <tt class="docutils literal"><span class="pre">find_all()</span></tt>에 건넨 인자들을 살펴보자.</p>
+<div class="section" id="the-name-argument">
+<span id="id8"></span><h3> <tt class="docutils literal"><span class="pre">name</span></tt> 인자<a class="headerlink" href="#the-name-argument" title="Permalink to this headline">¶</a></h3>
+<p>인자를 <tt class="docutils literal"><span class="pre">name</span></tt>에 건네면 뷰티플수프는 특정 이름을 가진 태그에만 관심을 가진다. 이름이 부합되지 않는 태그와 마찬가지로, 텍스트 문자열은 무시된다.</p>
+<p>다음은 가장 단순한 사용법이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"title"</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span>
+</pre></div>
+</div>
+<p><a class="reference internal" href="#kinds-of-filters">여과기의 종류</a>에서 보았듯이 <tt class="docutils literal"><span class="pre">name</span></tt>에 건넨 값이 <a class="reference internal" href="#a-string">문자열</a>, <a class="reference internal" href="#a-regular-expression">정규 표현식</a>, <a class="reference internal" href="#a-list">리스트</a>, <a class="reference internal" href="#a-function">함수</a>, 또는 <a class="reference internal" href="#the-value-true">True</a> 값일 수 있다는 사실을 기억하자.</p>
+</div>
+<div class="section" id="the-keyword-arguments">
+<span id="kwargs"></span><h3>키워드 인자<a class="headerlink" href="#the-keyword-arguments" title="Permalink to this headline">¶</a></h3>
+<p>인지되지 않는 인자는 한 태그의 속성중 하나에 대한 여과기로 변환된다.
+
+<tt class="docutils literal"><span class="pre">id</span></tt>라는 인자에 대하여 값을 하나 건네면, 뷰티플수프는 각 태그의 ‘id’속성에 대하여 걸러낸다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">'link2'</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">href</span></tt>에 대하여 값을 건네면, 뷰티플수프는 각 태그의 ‘href’속성에 대하여 걸러낸다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"elsie"</span><span class="p">))</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p><a class="reference internal" href="#a-string">문자열</a>, <a class="reference internal" href="#a-regular-expression">정규 표현식</a>, <a class="reference internal" href="#a-list">리스트</a>, <a class="reference internal" href="#a-function">함수</a>, 또는 <a class="reference internal" href="#the-value-true">True 값</a>에 기반하여 속성을 걸러낼 수 있다.</p>
+<p>다음 코드는 그 값이 무엇이든 상관없이, <tt class="docutils literal"><span class="pre">id</span></tt> 속성을 가진 태그를 모두 찾는다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>하나 이상의 키워드 인자를 건네면 한 번에 여러 값들을 걸러낼 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"elsie"</span><span class="p">),</span> <span class="nb">id</span><span class="o">=</span><span class="s">'link1'</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;three&lt;/a&gt;]</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="searching-by-css-class">
+<span id="attrs"></span><h3>CSS 클래스로 탐색하기<a class="headerlink" href="#searching-by-css-class" title="Permalink to this headline">¶</a></h3>
+<p>특정 CSS 클래스를 가진 태그를 탐색하면 아주 유용하지만, CSS 속성의 이름인 “class”는 파이썬에서 예약어이다. 키워드 인자로 <tt class="docutils literal"><span class="pre">class</span></tt>를 사용하면 구문 에러를 만나게 된다. 뷰티플 4.1.2 부터, CSS 클래스로 검색할 수 있는데 <tt class="docutils literal"><span class="pre">class_</span></tt> 키워드 인자를 사용하면 된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">"sister"</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>다른 키워드 인자와 마찬가지로, <tt class="docutils literal"><span class="pre">class_</span></tt>에 문자열, 정규 표현식, 함수, 또는 <tt class="docutils literal"><span class="pre">True</span></tt>를 건넬 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">class_</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"itl"</span><span class="p">))</span>
+<span class="c"># [&lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;]</span>
+
+<span class="k">def</span> <span class="nf">has_six_characters</span><span class="p">(</span><span class="n">css_class</span><span class="p">):</span>
+ <span class="k">return</span> <span class="n">css_class</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="n">css_class</span><span class="p">)</span> <span class="o">==</span> <span class="mi">6</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">class_</span><span class="o">=</span><span class="n">has_six_characters</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p><a class="reference internal" href="#multivalue"><em>기억하자</em></a>. 하나의 태그에 그의 “class” 속성에 대하여 값이 여러개 있을 수 있다. 특정 CSS 클래스에 부합하는 태그를 탐색할 때, 그의 CSS 클래스들 <cite>모두</cite>에 대하여 부합을 수행하는 것이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">css_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'&lt;p class="body strikeout"&gt;&lt;/p&gt;'</span><span class="p">)</span>
+<span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">"strikeout"</span><span class="p">)</span>
+<span class="c"># [&lt;p class="body strikeout"&gt;&lt;/p&gt;]</span>
+
+<span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">"body"</span><span class="p">)</span>
+<span class="c"># [&lt;p class="body strikeout"&gt;&lt;/p&gt;]</span>
+</pre></div>
+</div>
+<p>
+<tt class="docutils literal"><span class="pre">class</span></tt> 속성의 정확한 문자열 값을 탐색할 수도 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">"body strikeout"</span><span class="p">)</span>
+<span class="c"># [&lt;p class="body strikeout"&gt;&lt;/p&gt;]</span>
+</pre></div>
+</div>
+<p>
+그러나 문자열 값을 변형해서 탐색하면 작동하지 않는다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">"strikeout body"</span><span class="p">)</span>
+<span class="c"># []</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">class_</span></tt>를 위한 간편한 방법이 뷰티플수프 모든 버전에 존재한다. <tt class="docutils literal"><span class="pre">find()</span></tt>-유형의 메쏘드에 건네는 두 번째 인자는 <tt class="docutils literal"><span class="pre">attrs</span></tt>인데, 문자열을 <tt class="docutils literal"><span class="pre">attrs</span></tt>에 건네면 그 문자열을 CSS 클래스처럼 탐색한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="s">"sister"</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>정규 표현식, 함수 또는 사전을 제외하고 True–유형으로도 건넬 수 있다. 무엇을 건네든지 그 CSS 클래스를 탐색하는데 사용된다. <tt class="docutils literal"><span class="pre">class_</span></tt> 키워드 인자에 건넬 때와 똑같다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"itl"</span><span class="p">))</span>
+<span class="c"># [&lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;]</span>
+</pre></div>
+</div>
+<p>사전을 <tt class="docutils literal"><span class="pre">attrs</span></tt>에 건네면, 단지 그 CSS 클래스만 아니라 한번에 많은 HTML 속성을 탐색할 수 있다. 다음 코드 두 줄은 동등하다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"elsie"</span><span class="p">),</span> <span class="nb">id</span><span class="o">=</span><span class="s">'link1'</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'href'</span> <span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"elsie"</span><span class="p">),</span> <span class="s">'id'</span><span class="p">:</span> <span class="s">'link1'</span><span class="p">})</span>
+</pre></div>
+</div>
+<p>이것은 별로 유용한 특징은 아니다. 왜냐하면 보통 키워드 인자를 사용하는 편이 더 쉽기 때문이다.</p>
+</div>
+<div class="section" id="the-text-argument">
+<span id="text"></span><h3> <tt class="docutils literal"><span class="pre">text</span></tt> 인자<a class="headerlink" href="#the-text-argument" title="Permalink to this headline">¶</a></h3>
+<p>
+<tt class="docutils literal"><span class="pre">text</span></tt> 인자로 태그 대신 문자열을 탐색할 수 있다. <tt class="docutils literal"><span class="pre">name</span></tt>과 키워드 인자에서처럼, <a class="reference internal" href="#a-string">문자열</a>, <a class="reference internal" href="#a-regular-expression">정규 표현식</a>, <a class="reference internal" href="#a-list">리스트</a>, <a class="reference internal" href="#a-function">함수</a>, 또는 <a class="reference internal" href="#the-value-true">True 값</a>을 건넬 수 있다.
+다음은 몇 가지 예이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="s">"Elsie"</span><span class="p">)</span>
+<span class="c"># [u'Elsie']</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="p">[</span><span class="s">"Tillie"</span><span class="p">,</span> <span class="s">"Elsie"</span><span class="p">,</span> <span class="s">"Lacie"</span><span class="p">])</span>
+<span class="c"># [u'Elsie', u'Lacie', u'Tillie']</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">"Dormouse"</span><span class="p">))</span>
+<span class="p">[</span><span class="s">u"The Dormouse's story"</span><span class="p">,</span> <span class="s">u"The Dormouse's story"</span><span class="p">]</span>
+
+<span class="k">def</span> <span class="nf">is_the_only_string_within_a_tag</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
+ <span class="sd">"""Return True if this string is the only child of its parent tag."""</span>
+ <span class="k">return</span> <span class="p">(</span><span class="n">s</span> <span class="o">==</span> <span class="n">s</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">is_the_only_string_within_a_tag</span><span class="p">)</span>
+<span class="c"># [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']</span>
+</pre></div>
+</div>
+<p>
+<tt class="docutils literal"><span class="pre">text</span></tt>가 문자열 찾기에 사용되지만, 태그를 찾는 인자와 결합해 사용할 수 있다: 뷰티플수프는 <tt class="docutils literal"><span class="pre">text</span></tt>에 대한 값에 자신의 <tt class="docutils literal"><span class="pre">.string</span></tt>이 부합하는 태그를 모두 찾는다.
+
+다음 코드는 자신의 <tt class="docutils literal"><span class="pre">.string</span></tt>이 “Elsie”인 &lt;a&gt; 태그를 찾는다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="s">"Elsie"</span><span class="p">)</span>
+<span class="c"># [&lt;a href="http://example.com/elsie" class="sister" id="link1"&gt;Elsie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="the-limit-argument">
+<span id="limit"></span><h3><tt class="docutils literal"><span class="pre">limit</span></tt> 인자<a class="headerlink" href="#the-limit-argument" title="Permalink to this headline">¶</a></h3>
+<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> 메쏘드는 여과기에 부합하는 문자열과 태그를 모두 돌려준다. 이런 방법은 문서가 방대하면 시간이 좀 걸릴 수 있다. 결과가 <cite>모조리</cite> 필요한 것은 아니라면, <tt class="docutils literal"><span class="pre">limit</span></tt>에 숫자를 건넬 수 있다. 이 방법은 SQL에서의 LIMIT 키워드와 정확히 똑같이 작동한다. 뷰티플수프에게 특정 횟수를 넘어서면 결과 수집을 중지하라고 명령한다.</p>
+<p>“three sisters” 문서에 링크가 세 개 있지만, 다음 코드는 앞의 두 링크만 찾는다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="the-recursive-argument">
+<span id="recursive"></span><h3> <tt class="docutils literal"><span class="pre">recursive</span></tt> 인자<a class="headerlink" href="#the-recursive-argument" title="Permalink to this headline">¶</a></h3>
+<p><tt class="docutils literal"><span class="pre">mytag.find_all()</span></tt>를 호출하면, 뷰티플수프는 <tt class="docutils literal"><span class="pre">mytag</span></tt>의 후손을 모두 조사한다: 그의 자손, 그 자손의 자손, 그리고 등등. 뷰티플수프에게 직계 자손만 신경쓰라고 시키고 싶다면, <tt class="docutils literal"><span class="pre">recursive=False</span></tt>를 건네면 된다. 다음에 차이점을 살펴보자:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"title"</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"title"</span><span class="p">,</span> <span class="n">recursive</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
+<span class="c"># []</span>
+</pre></div>
+</div>
+<p>다음은 예제 문서의 일부이다:</p>
+<div class="highlight-python"><pre>&lt;html&gt;
+ &lt;head&gt;
+ &lt;title&gt;
+ The Dormouse's story
+ &lt;/title&gt;
+ &lt;/head&gt;
+...</pre>
+</div>
+<p>
+&lt;title&gt; 태그는 &lt;html&gt; 태그 아래에 있지만, &lt;html&gt; 태그 <cite>바로 아래에 있는 것은 아니다</cite>: &lt;head&gt; 태그가 사이에 있다. 뷰티플수프는 &lt;html&gt; 태그의 모든 후손을 찾아 보도록 허용해야만 &lt;title&gt; 태그를 발견한다. 그러나 <tt class="docutils literal"><span class="pre">recursive=False</span></tt>가 검색을
+&lt;html&gt; 태그의 직접 자손으로 제한하기 때문에, 아무것도 찾지 못한다.</p>
+<p>뷰티플수프는 트리-탐색 메쏘드들을 다양하게 제공한다 (아래에 다룸). 대부분 <tt class="docutils literal"><span class="pre">find_all()</span></tt>과 같은 인자를 취한다: <tt class="docutils literal"><span class="pre">name</span></tt>,
+<tt class="docutils literal"><span class="pre">attrs</span></tt>, <tt class="docutils literal"><span class="pre">text</span></tt>, <tt class="docutils literal"><span class="pre">limit</span></tt>, 그리고 키워드 인자를 취한다. 그러나 <tt class="docutils literal"><span class="pre">recursive</span></tt> 인자는 다르다: <tt class="docutils literal"><span class="pre">find_all()</span></tt>과 <tt class="docutils literal"><span class="pre">find()</span></tt>만 유일하게 지원한다. <tt class="docutils literal"><span class="pre">recursive=False</span></tt>를 <tt class="docutils literal"><span class="pre">find_parents()</span></tt> 같은 인자에 건네면 별로 유용하지 않을 것이다.</p>
+</div>
+</div>
+<div class="section" id="calling-a-tag-is-like-calling-find-all">
+<h2>태그를 호출하는 것은 <tt class="docutils literal"><span class="pre">find_all()</span></tt>을 호출하는 것과 똑같다<a class="headerlink" href="#calling-a-tag-is-like-calling-find-all" title="Permalink to this headline">¶</a></h2>
+<p>
+<tt class="docutils literal"><span class="pre">find_all()</span></tt>는 뷰티플수프 탐색 API에서 가장 많이 사용되므로, 그에 대한 간편 방법을 사용할 수 있다. <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체나 <tt class="docutils literal"><span class="pre">Tag</span></tt> 객체를 마치 함수처럼 다루면, 그 객체에 대하여 <tt class="docutils literal"><span class="pre">find_all()</span></tt>를 호출하는 것과 똑같다. 다음 코드 두 줄은 동등하다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span>
+<span class="n">soup</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>다음 두 줄도 역시 동등하다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="find">
+<h2><tt class="docutils literal"><span class="pre">find()</span></tt><a class="headerlink" href="#find" title="Permalink to this headline">¶</a></h2>
+<p>서명: find(<a class="reference internal" href="#id8"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#recursive"><em>recursive</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>
+<tt class="docutils literal"><span class="pre">find_all()</span></tt> 메쏘드는 전체 문서를 훓어서 결과를 찾지만, 어떤 경우는 결과 하나만 원할 수도 있다. 문서에 오직 &lt;body&gt; 태그가 하나 뿐임을 안다면, 전체 문서를 훓어 가면서 더 찾는 것은 시간 낭비이다. <tt class="docutils literal"><span class="pre">find_all</span></tt> 메쏘드를 호출할 때마다, <tt class="docutils literal"><span class="pre">limit=1</span></tt>을 건네기 보다는 <tt class="docutils literal"><span class="pre">find()</span></tt> 메쏘드를 사용하는 편이 좋다. 다음 코드 두 줄은 <cite>거의 동등하다</cite>:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'title'</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'title'</span><span class="p">)</span>
+<span class="c"># &lt;title&gt;The Dormouse's story&lt;/title&gt;</span>
+</pre></div>
+</div>
+<p>유일한 차이점은 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 메쏘드가 단 한개의 결과만 담고 있는 리스트를 돌려주고, <tt class="docutils literal"><span class="pre">find()</span></tt>는 그냥 그 결과를 돌려준다는 점이다.</p>
+<p><tt class="docutils literal"><span class="pre">find_all()</span></tt>이 아무것도 찾을 수 없다면, 빈 리스트를 돌려준다. <tt class="docutils literal"><span class="pre">find()</span></tt>가 아무것도 찾을 수 없다면, <tt class="docutils literal"><span class="pre">None</span></tt>을 돌려준다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"nosuchtag"</span><span class="p">))</span>
+<span class="c"># None</span>
+</pre></div>
+</div>
+<p>
+<a class="reference internal" href="#navigating-using-tag-names">태그 이름을 사용하여 항해하기</a>에서 <tt class="docutils literal"><span class="pre">soup.head.title</span></tt> 트릭을 기억하시는지? 그 트릭은 반복적으로 <tt class="docutils literal"><span class="pre">find()</span></tt> 를 호출해서 작동한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">head</span><span class="o">.</span><span class="n">title</span>
+<span class="c"># &lt;title&gt;The Dormouse's story&lt;/title&gt;</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"head"</span><span class="p">)</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"title"</span><span class="p">)</span>
+<span class="c"># &lt;title&gt;The Dormouse's story&lt;/title&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="find-parents-and-find-parent">
+<h2><tt class="docutils literal"><span class="pre">find_parents()</span></tt> 그리고 <tt class="docutils literal"><span class="pre">find_parent()</span></tt><a class="headerlink" href="#find-parents-and-find-parent" title="Permalink to this headline">¶</a></h2>
+<p>서명: find_parents(<a class="reference internal" href="#id8"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#limit"><em>limit</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>서명: find_parent(<a class="reference internal" href="#id8"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>
+많은 시간을 할애해 <tt class="docutils literal"><span class="pre">find_all()</span></tt>과
+<tt class="docutils literal"><span class="pre">find()</span></tt>를 다루었다. 뷰티플수프 API에는 트리 탐색을 위해 다른 메쏘드가 열가지 정의되어 있지만, 걱정하지 말자. 이런 메쏘드중 다섯가지는 기본적으로 <tt class="docutils literal"><span class="pre">find_all()</span></tt>과 똑같고, 다른 다섯가지는 기본적으로 <tt class="docutils literal"><span class="pre">find()</span></tt>와 똑같다. 유일한 차이점은 트리의 어떤 부분을 검색할 것인가에 있다.</p>
+<p>
+먼저 <tt class="docutils literal"><span class="pre">find_parents()</span></tt>와
+<tt class="docutils literal"><span class="pre">find_parent()</span></tt>를 살펴보자. <tt class="docutils literal"><span class="pre">find_all()</span></tt>과 <tt class="docutils literal"><span class="pre">find()</span></tt>는 트리를 내려 오면서, 태그의 후손들을 찾음을 기억하자. 다음 메쏘드들은 정 반대로 일을 한다: 트리를 <cite>위로</cite> 올라가며, 한 태그의 (또는 문자열의) 부모를 찾는다. 시험해 보자.“three daughters” 문서 깊숙히 묻힌 문자열부터 시작해 보자:</p>
+<div class="highlight-python"><pre>a_string = soup.find(text="Lacie")
+a_string
+# u'Lacie'
+
+a_string.find_parents("a")
+# [&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;]
+
+a_string.find_parent("p")
+# &lt;p class="story"&gt;Once upon a time there were three little sisters; and their names were
+# &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,
+# &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt; and
+# &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;;
+# and they lived at the bottom of a well.&lt;/p&gt;
+
+a_string.find_parents("p", class="title")
+# []</pre>
+</div>
+<p>세가지 &lt;a&gt; 태그 중 하나는 해당 문자열의 직계 부모이다. 그래서 탐색해서 그것을 찾는다. 세가지 &lt;p&gt; 태그 중 하나는 그 문자열의 방계 부모이고, 그것도 역시 잘 탐색한다. CSS 클래스가“title”인 &lt;p&gt; 태그가 문서 <cite>어딘가에</cite> 존재하지만, 그것은 이 문자열의 부모가 아니므로, <tt class="docutils literal"><span class="pre">find_parents()</span></tt>로 부모를 찾을 수 없다.</p>
+<p>아마도 <tt class="docutils literal"><span class="pre">find_parent()</span></tt>와 <tt class="docutils literal"><span class="pre">find_parents()</span></tt>, 그리고 앞서 언급한 <a class="reference internal" href="#parent">.parent</a>와 <a class="reference internal" href="#parents">.parents</a> 속성 사이에 관련이 있으리라 짐작했을 것이다. 이 관련은 매우 강력하다. 이 탐색 메쏘드들은 실제로 <tt class="docutils literal"><span class="pre">.parents</span></tt>로 부모들을 모두 찾아서, 제공된 여과기준에 부합하는지 하나씩 점검한다.</p>
+</div>
+<div class="section" id="find-next-siblings-and-find-next-sibling">
+<h2><tt class="docutils literal"><span class="pre">find_next_siblings()</span></tt> 그리고 <tt class="docutils literal"><span class="pre">find_next_sibling()</span></tt><a class="headerlink" href="#find-next-siblings-and-find-next-sibling" title="Permalink to this headline">¶</a></h2>
+<p>서명: find_next_siblings(<a class="reference internal" href="#id8"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#limit"><em>limit</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>서명: find_next_sibling(<a class="reference internal" href="#id8"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>이 메쏘드들은 <a class="reference internal" href="#sibling-generators"><em>.next_siblings</em></a>을 사용하여 트리에서 한 요소의 나머지 형제들을 반복한다. <tt class="docutils literal"><span class="pre">find_next_siblings()</span></tt> 메쏘드는 부합하는 형제들을 모두 돌려주고, <tt class="docutils literal"><span class="pre">find_next_sibling()</span></tt> 메쏘드는 그 중 첫 째만 돌려준다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">first_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="n">first_link</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;</span>
+
+<span class="n">first_link</span><span class="o">.</span><span class="n">find_next_siblings</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">first_story_paragraph</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="s">"story"</span><span class="p">)</span>
+<span class="n">first_story_paragraph</span><span class="o">.</span><span class="n">find_next_sibling</span><span class="p">(</span><span class="s">"p"</span><span class="p">)</span>
+<span class="c"># &lt;p class="story"&gt;...&lt;/p&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="find-previous-siblings-and-find-previous-sibling">
+<h2><tt class="docutils literal"><span class="pre">find_previous_siblings()</span></tt> 그리고 <tt class="docutils literal"><span class="pre">find_previous_sibling()</span></tt><a class="headerlink" href="#find-previous-siblings-and-find-previous-sibling" title="Permalink to this headline">¶</a></h2>
+<p>서명: find_previous_siblings(<a class="reference internal" href="#id8"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#limit"><em>limit</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>서명: find_previous_sibling(<a class="reference internal" href="#id8"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>
+이 메쏘드들은 <a class="reference internal" href="#sibling-generators"><em>.previous_siblings</em></a>를 사용하여 트리에서 한 원소의 앞에 나오는 형제들을 반복한다. <tt class="docutils literal"><span class="pre">find_previous_siblings()</span></tt> 메쏘는 부합하는 형제들을 모두 돌려주고, <tt class="docutils literal"><span class="pre">find_previous_sibling()</span></tt>는 첫 째만 돌려준다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">last_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="s">"link3"</span><span class="p">)</span>
+<span class="n">last_link</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;</span>
+
+<span class="n">last_link</span><span class="o">.</span><span class="n">find_previous_siblings</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;]</span>
+
+<span class="n">first_story_paragraph</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="s">"story"</span><span class="p">)</span>
+<span class="n">first_story_paragraph</span><span class="o">.</span><span class="n">find_previous_sibling</span><span class="p">(</span><span class="s">"p"</span><span class="p">)</span>
+<span class="c"># &lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="find-all-next-and-find-next">
+<h2><tt class="docutils literal"><span class="pre">find_all_next()</span></tt> 그리고 <tt class="docutils literal"><span class="pre">find_next()</span></tt><a class="headerlink" href="#find-all-next-and-find-next" title="Permalink to this headline">¶</a></h2>
+<p>서명: find_all_next(<a class="reference internal" href="#id8"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#limit"><em>limit</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>서명: find_next(<a class="reference internal" href="#id8"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>
+이 메쏘드들은 <a class="reference internal" href="#element-generators"><em>.next_elements</em></a>를 사용하여 문서에서 한 태그의 뒤에 오는 태그이든 문자열이든 무엇이든지 반복한다. <tt class="docutils literal"><span class="pre">find_all_next()</span></tt> 메쏘드는 부합하는 것들을 모두 돌려주고, <tt class="docutils literal"><span class="pre">find_next()</span></tt>는 첫 번째 부합하는 것만 돌려준다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">first_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="n">first_link</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;</span>
+
+<span class="n">first_link</span><span class="o">.</span><span class="n">find_all_next</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
+<span class="c"># [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',</span>
+<span class="c"># u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']</span>
+
+<span class="n">first_link</span><span class="o">.</span><span class="n">find_next</span><span class="p">(</span><span class="s">"p"</span><span class="p">)</span>
+<span class="c"># &lt;p class="story"&gt;...&lt;/p&gt;</span>
+</pre></div>
+</div>
+<p>첫 예제에서, 문자열 “Elsie”가 나타났다. 물론 그 안에 우리가 시작했던 &lt;a&gt; 태그 안에 포함되어 있음에도 불구하고 말이다.
+
+두 번째 예제를 보면, 문서의 마지막 &lt;p&gt; 태그가 나타났다. 물론 트리에서 우리가 시작했던 &lt;a&gt; 태그와 같은 부분에 있지 않음에도 불구하고 말이다. 이런 메쏘드들에게, 유일한 관심 사항은 원소가 여과 기준에 부합하는가 그리고 시작 원소 말고 나중에 문서에 나타나는가이다.</p>
+</div>
+<div class="section" id="find-all-previous-and-find-previous">
+<h2><tt class="docutils literal"><span class="pre">find_all_previous()</span></tt> 그리고 <tt class="docutils literal"><span class="pre">find_previous()</span></tt><a class="headerlink" href="#find-all-previous-and-find-previous" title="Permalink to this headline">¶</a></h2>
+<p>서명: find_all_previous(<a class="reference internal" href="#id8"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#limit"><em>limit</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>서명: find_previous(<a class="reference internal" href="#id8"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>)</p>
+<p>
+이 메쏘드들은 <a class="reference internal" href="#element-generators"><em>.previous_elements</em></a>를 사용하여 문서에서 앞에 오는 태그나 문자열들을 반복한다. <tt class="docutils literal"><span class="pre">find_all_previous()</span></tt> 메쏘드는 부합하는 모든 것을 돌려주고,
+<tt class="docutils literal"><span class="pre">find_previous()</span></tt>는 첫 번째 부합만 돌려준다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">first_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="n">first_link</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;</span>
+
+<span class="n">first_link</span><span class="o">.</span><span class="n">find_all_previous</span><span class="p">(</span><span class="s">"p"</span><span class="p">)</span>
+<span class="c"># [&lt;p class="story"&gt;Once upon a time there were three little sisters; ...&lt;/p&gt;,</span>
+<span class="c"># &lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;]</span>
+
+<span class="n">first_link</span><span class="o">.</span><span class="n">find_previous</span><span class="p">(</span><span class="s">"title"</span><span class="p">)</span>
+<span class="c"># &lt;title&gt;The Dormouse's story&lt;/title&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">find_all_previous("p")</span></tt>를 호출하면 문서에서 첫 번째 문단(class=”title”)을 찾지만, 두 번째 문단 &lt;p&gt; 태그도 찾는다. 이 안에 우리가 시작한 &lt;a&gt; 태그가 들어 있다. 이것은 그렇게 놀랄 일이 아니다: 시작한 위치보다 더 앞에 나타나는 태그들을 모두 찾고 있는 중이다.
+&lt;a&gt; 태그가 포함된 &lt;p&gt; 태그는 자신 안에 든 &lt;a&gt; 태그보다 먼저 나타나는 것이 당연하다.</p>
+</div>
+<div class="section" id="css-selectors">
+<h2>CSS 선택자<a class="headerlink" href="#css-selectors" title="Permalink to this headline">¶</a></h2>
+<p>뷰티플수프는 <a class="reference external" href="https://web.archive.org/web/20150319200824/http://www.w3.org/TR/CSS2/selector.html">CSS 선택자 표준</a>의 부분집합을 지원한다. 그냥 문자열로 선택자를 구성하고 그것을 <tt class="docutils literal"><span class="pre">Tag</span></tt>의 <tt class="docutils literal"><span class="pre">.select()</span></tt> 메쏘드 또는 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체 자체에 건네면 된다.</p>
+<p>다음과 같이 태그를 검색할 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"title"</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span>
+</pre></div>
+</div>
+<p>다른 태그 아래의 태그를 찾을 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"body a"</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"html head title"</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span>
+</pre></div>
+</div>
+<p>다른 태그 <cite>바로 아래에 있는</cite> 태그를 찾을 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"head &gt; title"</span><span class="p">)</span>
+<span class="c"># [&lt;title&gt;The Dormouse's story&lt;/title&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"p &gt; a"</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"body &gt; a"</span><span class="p">)</span>
+<span class="c"># []</span>
+</pre></div>
+</div>
+<p>CSS 클래스로 태그를 찾는다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">".sister"</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"[class~=sister]"</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>ID로 태그를 찾는다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"#link1"</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">"a#link2"</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>속성이 존재하는지 테스트 한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'a[href]'</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>속성 값으로 태그를 찾는다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'a[href="http://example.com/elsie"]'</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'a[href^="http://example.com/"]'</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;,</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'a[href$="tillie"]'</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;]</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'a[href*=".com/el"]'</span><span class="p">)</span>
+<span class="c"># [&lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;]</span>
+</pre></div>
+</div>
+<p>언어 코덱을 일치 시킨다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">multilingual_markup</span> <span class="o">=</span> <span class="s">"""</span>
+<span class="s"> &lt;p lang="en"&gt;Hello&lt;/p&gt;</span>
+<span class="s"> &lt;p lang="en-us"&gt;Howdy, y'all&lt;/p&gt;</span>
+<span class="s"> &lt;p lang="en-gb"&gt;Pip-pip, old fruit&lt;/p&gt;</span>
+<span class="s"> &lt;p lang="fr"&gt;Bonjour mes amis&lt;/p&gt;</span>
+<span class="s">"""</span>
+<span class="n">multilingual_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">multilingual_markup</span><span class="p">)</span>
+<span class="n">multilingual_soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">'p[lang|=en]'</span><span class="p">)</span>
+<span class="c"># [&lt;p lang="en"&gt;Hello&lt;/p&gt;,</span>
+<span class="c"># &lt;p lang="en-us"&gt;Howdy, y'all&lt;/p&gt;,</span>
+<span class="c"># &lt;p lang="en-gb"&gt;Pip-pip, old fruit&lt;/p&gt;]</span>
+</pre></div>
+</div>
+<p>이것은 CSS 선택자 구문을 알고 있는 사용자에게 유용하다. 이 모든 일들을 뷰티플수프 API로 할 수 있다. CSS 선택자만 필요하다면, lxml을 직접 사용하는 편이 좋을 것이다. 왜냐하면, 더 빠르기 때문이다. 그러나 이렇게 하면 간단한 CSS 선택자들을 뷰티플수프 API와 <cite>조합해 사용할 수 있다</cite>.</p>
+</div>
+</div>
+<div class="section" id="modifying-the-tree">
+<h1>트리 변형하기<a class="headerlink" href="#modifying-the-tree" title="Permalink to this headline">¶</a></h1>
+<p>뷰티플수프의 강점은 해석 트리를 검색 하는데에 있다. 그러나 또한 해석 트리를 변형해서 새로운 HTML 또는 XML 문서로 저장할 수도 있다.</p>
+<div class="section" id="changing-tag-names-and-attributes">
+<h2>태그 이름과 속성 바꾸기<a class="headerlink" href="#changing-tag-names-and-attributes" title="Permalink to this headline">¶</a></h2>
+<p>이에 관해서는 <a class="reference internal" href="#attributes">속성</a> 부분에서 다룬 바 있지만, 다시 반복할 가치가 있다. 태그 이름을 바꾸고 그의 속성 값들을 바꾸며, 속성을 새로 추가하고, 속성을 삭제할 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'&lt;b class="boldest"&gt;Extremely bold&lt;/b&gt;'</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
+
+<span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">"blockquote"</span>
+<span class="n">tag</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'verybold'</span>
+<span class="n">tag</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
+<span class="n">tag</span>
+<span class="c"># &lt;blockquote class="verybold" id="1"&gt;Extremely bold&lt;/blockquote&gt;</span>
+
+<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">'class'</span><span class="p">]</span>
+<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span>
+<span class="n">tag</span>
+<span class="c"># &lt;blockquote&gt;Extremely bold&lt;/blockquote&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="modifying-string">
+<h2><tt class="docutils literal"><span class="pre">.string</span></tt> 변경하기<a class="headerlink" href="#modifying-string" title="Permalink to this headline">¶</a></h2>
+<p>태그의 <tt class="docutils literal"><span class="pre">.string</span></tt> 속성을 설정하면, 태그의 내용이 주어진 문자열로 교체된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">'&lt;a href="http://example.com/"&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;'</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">"New link text."</span>
+<span class="n">tag</span>
+<span class="c"># &lt;a href="http://example.com/"&gt;New link text.&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p>주의하자: 태그에 또 다른 태그가 들어 있다면, 그 태그는 물론 모든 내용이 사라진다.</p>
+</div>
+<div class="section" id="append">
+<h2><tt class="docutils literal"><span class="pre">append()</span></tt><a class="headerlink" href="#append" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag.append()</span></tt>로 태그에 내용을 추가할 수 있다. 파이썬 리스트에 <tt class="docutils literal"><span class="pre">.append()</span></tt>를 호출한 것과 똑같이 작동한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&lt;a&gt;Foo&lt;/a&gt;"</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s">"Bar"</span><span class="p">)</span>
+
+<span class="n">soup</span>
+<span class="c"># &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;a&gt;FooBar&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># [u'Foo', u'Bar']</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="beautifulsoup-new-string-and-new-tag">
+<h2><tt class="docutils literal"><span class="pre">BeautifulSoup.new_string()</span></tt> 그리고 <tt class="docutils literal"><span class="pre">.new_tag()</span></tt><a class="headerlink" href="#beautifulsoup-new-string-and-new-tag" title="Permalink to this headline">¶</a></h2>
+<p>문자열을 문서에 추가하고 싶다면, 파이썬 문자열을 <tt class="docutils literal"><span class="pre">append()</span></tt>에 건네기만 하면 된다. 아니면
+<tt class="docutils literal"><span class="pre">BeautifulSoup.new_string()</span></tt> 공장 메쏘드를 호출하면 된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&lt;b&gt;&lt;/b&gt;"</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s">"Hello"</span><span class="p">)</span>
+<span class="n">new_string</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_string</span><span class="p">(</span><span class="s">" there"</span><span class="p">)</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_string</span><span class="p">)</span>
+<span class="n">tag</span>
+<span class="c"># &lt;b&gt;Hello there.&lt;/b&gt;</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># [u'Hello', u' there']</span>
+</pre></div>
+</div>
+<p>완전히 새로 태그를 만들어야 한다면 어떻게 할까? 최선의 해결책은 <tt class="docutils literal"><span class="pre">BeautifulSoup.new_tag()</span></tt> 공장 메쏘드를 호출하는 것이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&lt;b&gt;&lt;/b&gt;"</span><span class="p">)</span>
+<span class="n">original_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
+
+<span class="n">new_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="n">href</span><span class="o">=</span><span class="s">"http://www.example.com"</span><span class="p">)</span>
+<span class="n">original_tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_tag</span><span class="p">)</span>
+<span class="n">original_tag</span>
+<span class="c"># &lt;b&gt;&lt;a href="http://www.example.com"&gt;&lt;/a&gt;&lt;/b&gt;</span>
+
+<span class="n">new_tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">"Link text."</span>
+<span class="n">original_tag</span>
+<span class="c"># &lt;b&gt;&lt;a href="http://www.example.com"&gt;Link text.&lt;/a&gt;&lt;/b&gt;</span>
+</pre></div>
+</div>
+<p>오직 첫 번째 인자, 즉 태그 이름만 있으면 된다.</p>
+</div>
+<div class="section" id="insert">
+<h2><tt class="docutils literal"><span class="pre">insert()</span></tt><a class="headerlink" href="#insert" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag.insert()</span></tt>는 <tt class="docutils literal"><span class="pre">Tag.append()</span></tt>와 거의 같은데, 단, 새 요소가 반드시 그의 부모의 <tt class="docutils literal"><span class="pre">.contents</span></tt> 끝에 갈 필요는 없다. 원하는 위치 어디든지 삽입된다. 파이썬 리스트의 <tt class="docutils literal"><span class="pre">.insert()</span></tt>와 똑같이 작동한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">'&lt;a href="http://example.com/"&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;'</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+
+<span class="n">tag</span><span class="o">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">"but did not endorse "</span><span class="p">)</span>
+<span class="n">tag</span>
+<span class="c"># &lt;a href="http://example.com/"&gt;I linked to but did not endorse &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># [u'I linked to ', u'but did not endorse', &lt;i&gt;example.com&lt;/i&gt;]</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="insert-before-and-insert-after">
+<h2><tt class="docutils literal"><span class="pre">insert_before()</span></tt> 그리고 <tt class="docutils literal"><span class="pre">insert_after()</span></tt><a class="headerlink" href="#insert-before-and-insert-after" title="Permalink to this headline">¶</a></h2>
+<p>
+<tt class="docutils literal"><span class="pre">insert_before()</span></tt> 메쏘드는 태그나 문자열을 해석 트리에서 어떤 것 바로 앞에 삽입한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&lt;b&gt;stop&lt;/b&gt;"</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">"i"</span><span class="p">)</span>
+<span class="n">tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">"Don't"</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">insert_before</span><span class="p">(</span><span class="n">tag</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">b</span>
+<span class="c"># &lt;b&gt;&lt;i&gt;Don't&lt;/i&gt;stop&lt;/b&gt;</span>
+</pre></div>
+</div>
+<p>
+<tt class="docutils literal"><span class="pre">insert_after()</span></tt> 메쏘드는 해석 트리에서 다른 어떤 것 바로 뒤에 나오도록 태그나 문자열을 이동시킨다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">insert_after</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_string</span><span class="p">(</span><span class="s">" ever "</span><span class="p">))</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">b</span>
+<span class="c"># &lt;b&gt;&lt;i&gt;Don't&lt;/i&gt; ever stop&lt;/b&gt;</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">contents</span>
+<span class="c"># [&lt;i&gt;Don't&lt;/i&gt;, u' ever ', u'stop']</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="clear">
+<h2><tt class="docutils literal"><span class="pre">clear()</span></tt><a class="headerlink" href="#clear" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag.clear()</span></tt>은 태그의 내용을 제거한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">'&lt;a href="http://example.com/"&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;'</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+
+<span class="n">tag</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
+<span class="n">tag</span>
+<span class="c"># &lt;a href="http://example.com/"&gt;&lt;/a&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="extract">
+<h2><tt class="docutils literal"><span class="pre">extract()</span></tt><a class="headerlink" href="#extract" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">PageElement.extract()</span></tt>는 해석 트리에서 태그나 문자열을 제거한다. 추출하고 남은 태그나 문자열을 돌려준다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">'&lt;a href="http://example.com/"&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;'</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+
+<span class="n">i_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
+
+<span class="n">a_tag</span>
+<span class="c"># &lt;a href="http://example.com/"&gt;I linked to&lt;/a&gt;</span>
+
+<span class="n">i_tag</span>
+<span class="c"># &lt;i&gt;example.com&lt;/i&gt;</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">i_tag</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
+<span class="bp">None</span>
+</pre></div>
+</div>
+<p>이 시점에서 두 가지 해석 트리를 가지는 효과가 있다: 하나는 문서를 해석하는데 사용된 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체에 뿌리를 두고, 또 하나는 추출된 그 태그에 뿌리를 둔다. 더 나아가 추출한 요소의 자손들에다 <tt class="docutils literal"><span class="pre">extract</span></tt>를 호출할 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">my_string</span> <span class="o">=</span> <span class="n">i_tag</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
+<span class="n">my_string</span>
+<span class="c"># u'example.com'</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">my_string</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
+<span class="c"># None</span>
+<span class="n">i_tag</span>
+<span class="c"># &lt;i&gt;&lt;/i&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="decompose">
+<h2><tt class="docutils literal"><span class="pre">decompose()</span></tt><a class="headerlink" href="#decompose" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag.decompose()</span></tt>는 태그를 트리에서 제거한 다음, 그와 그의 내용물을 <cite>완전히 파괴한다</cite>:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">'&lt;a href="http://example.com/"&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;'</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">decompose</span><span class="p">()</span>
+
+<span class="n">a_tag</span>
+<span class="c"># &lt;a href="http://example.com/"&gt;I linked to&lt;/a&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="replace-with">
+<span id="id9"></span><h2><tt class="docutils literal"><span class="pre">replace_with()</span></tt><a class="headerlink" href="#replace-with" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">PageElement.replace_with()</span></tt>는 트리에서 태그나 문자열을 제거하고 그것을 지정한 태그나 문자열로 교체한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">'&lt;a href="http://example.com/"&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;'</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+
+<span class="n">new_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">"b"</span><span class="p">)</span>
+<span class="n">new_tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">"example.net"</span>
+<span class="n">a_tag</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="n">new_tag</span><span class="p">)</span>
+
+<span class="n">a_tag</span>
+<span class="c"># &lt;a href="http://example.com/"&gt;I linked to &lt;b&gt;example.net&lt;/b&gt;&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">replace_with()</span></tt>는 교체된 후의 태그나 문자열을 돌려준다. 그래서 검사해 보거나 다시 트리의 다른 부분에 추가할 수 있다.</p>
+</div>
+<div class="section" id="wrap">
+<h2><tt class="docutils literal"><span class="pre">wrap()</span></tt><a class="headerlink" href="#wrap" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">PageElement.wrap()</span></tt>는 지정한 태그에 요소를 둘러싸서 새로운 포장자를 돌려준다:</p>
+<div class="highlight-python"><pre>soup = BeautifulSoup("&lt;p&gt;I wish I was bold.&lt;/p&gt;")
+soup.p.string.wrap(soup.new_tag("b"))
+# &lt;b&gt;I wish I was bold.&lt;/b&gt;
+
+soup.p.wrap(soup.new_tag("div")
+# &lt;div&gt;&lt;p&gt;&lt;b&gt;I wish I was bold.&lt;/b&gt;&lt;/p&gt;&lt;/div&gt;</pre>
+</div>
+<p>다음 메쏘드는 뷰티플수프 4.0.5에 새로 추가되었다.</p>
+</div>
+<div class="section" id="unwrap">
+<h2><tt class="docutils literal"><span class="pre">unwrap()</span></tt><a class="headerlink" href="#unwrap" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">Tag.unwrap()</span></tt>은 <tt class="docutils literal"><span class="pre">wrap()</span></tt>의 반대이다. 태그를 그 태그 안에 있는 것들로 교체한다. 조판을 걷어내 버릴 때 좋다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">'&lt;a href="http://example.com/"&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;'</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+
+<span class="n">a_tag</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">unwrap</span><span class="p">()</span>
+<span class="n">a_tag</span>
+<span class="c"># &lt;a href="http://example.com/"&gt;I linked to example.com&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">replace_with()</span></tt>처럼, <tt class="docutils literal"><span class="pre">unwrap()</span></tt>은 교체된 후의 태그를 돌려준다.</p>
+<p>(이전 뷰티플수프 버전에서, <tt class="docutils literal"><span class="pre">unwrap()</span></tt>는 <tt class="docutils literal"><span class="pre">replace_with_children()</span></tt>이라고 불리웠으며, 그 이름은 여전히 작동한다.)</p>
+</div>
+</div>
+<div class="section" id="output">
+<h1>출력<a class="headerlink" href="#output" title="Permalink to this headline">¶</a></h1>
+<div class="section" id="pretty-printing">
+<span id="prettyprinting"></span><h2>예쁘게-인쇄하기<a class="headerlink" href="#pretty-printing" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">prettify()</span></tt> 메쏘드는 뷰티플수프 해석 트리를 멋지게 모양을 낸 유니코드 문자열로 변환한다. HTML/XML 태그마다 따로따로 한 줄에 표시된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">'&lt;a href="http://example.com/"&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;'</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">()</span>
+<span class="c"># '&lt;html&gt;\n &lt;head&gt;\n &lt;/head&gt;\n &lt;body&gt;\n &lt;a href="http://example.com/"&gt;\n...'</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;head&gt;</span>
+<span class="c"># &lt;/head&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;a href="http://example.com/"&gt;</span>
+<span class="c"># I linked to</span>
+<span class="c"># &lt;i&gt;</span>
+<span class="c"># example.com</span>
+<span class="c"># &lt;/i&gt;</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>최상위 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체에 <tt class="docutils literal"><span class="pre">prettify()</span></tt>를 호출할 수 있으며, 또는 <tt class="docutils literal"><span class="pre">Tag</span></tt> 객체에 얼마든지 호출할 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;a href="http://example.com/"&gt;</span>
+<span class="c"># I linked to</span>
+<span class="c"># &lt;i&gt;</span>
+<span class="c"># example.com</span>
+<span class="c"># &lt;/i&gt;</span>
+<span class="c"># &lt;/a&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="non-pretty-printing">
+<h2>있는-그대로 인쇄하기<a class="headerlink" href="#non-pretty-printing" title="Permalink to this headline">¶</a></h2>
+<p>멋진 모양 말고 그냥 문자열을 원한다면, <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체, 또는 그 안의 <tt class="docutils literal"><span class="pre">Tag</span></tt>에 <tt class="docutils literal"><span class="pre">unicode()</span></tt> 또는 <tt class="docutils literal"><span class="pre">str()</span></tt>을 호출하면 된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
+<span class="c"># '&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;a href="http://example.com/"&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;'</span>
+
+<span class="nb">unicode</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="p">)</span>
+<span class="c"># u'&lt;a href="http://example.com/"&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;'</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">str()</span></tt> 함수는 UTF-8로 인코드된 문자열을 돌려준다. 다른 옵션은 <a class="reference internal" href="#encodings">인코딩</a>을 살펴보자.</p>
+<p>또 <tt class="docutils literal"><span class="pre">encode()</span></tt>를 호출하면 bytestring을 얻을 수 있고, <tt class="docutils literal"><span class="pre">decode()</span></tt>로는 유니코드를 얻는다.</p>
+</div>
+<div class="section" id="output-formatters">
+<span id="id10"></span><h2>출력 포맷터<a class="headerlink" href="#output-formatters" title="Permalink to this headline">¶</a></h2>
+<p>뷰티플수프 문서에 “&amp;lquot;”와 같은 HTML 개체가 들어 있다면, 그 개체들은 유니코드 문자로 변환된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&amp;ldquo;Dammit!&amp;rdquo; he said."</span><span class="p">)</span>
+<span class="nb">unicode</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
+<span class="c"># u'&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;\u201cDammit!\u201d he said.&lt;/body&gt;&lt;/html&gt;'</span>
+</pre></div>
+</div>
+<p>문서를 문자열로 변환하면, 유니코드 문자들은 UTF-8로 인코드된다. HTML 개체는 다시 복구할 수 없다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
+<span class="c"># '&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;\xe2\x80\x9cDammit!\xe2\x80\x9d he said.&lt;/body&gt;&lt;/html&gt;'</span>
+</pre></div>
+</div>
+<p>기본 값으로, 출력에서 피신 처리가 되는 유일한 문자들은 앰퍼센드와 옆꺽쇠 문자들이다. 이런 문자들은 “&amp;amp;”, “&amp;lt;”, 그리고 “&amp;gt;”로 변환된다. 그래서 뷰티플수프는 무효한 HTML이나 XML을 생성하는 실수를 하지 않게 된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&lt;p&gt;The law firm of Dewey, Cheatem, &amp; Howe&lt;/p&gt;"</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">p</span>
+<span class="c"># &lt;p&gt;The law firm of Dewey, Cheatem, &amp;amp; Howe&lt;/p&gt;</span>
+
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'&lt;a href="http://example.com/?foo=val1&amp;bar=val2"&gt;A link&lt;/a&gt;'</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">a</span>
+<span class="c"># &lt;a href="http://example.com/?foo=val1&amp;amp;bar=val2"&gt;A link&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p>이 행위를 바꾸려면 <tt class="docutils literal"><span class="pre">formatter</span></tt> 인자용 값을 <tt class="docutils literal"><span class="pre">prettify()</span></tt>, <tt class="docutils literal"><span class="pre">encode()</span></tt>, 또는 <tt class="docutils literal"><span class="pre">decode()</span></tt>에 제공하면 된다.
+
+뷰티플수프는 <tt class="docutils literal"><span class="pre">formatter</span></tt>에 대하여 가능한 네 가지 값을 인지한다.</p>
+<p>기본값은 <tt class="docutils literal"><span class="pre">formatter="minimal"</span></tt>이다. 문자열은 뷰티플수프가 유효한 HTML/XML을 생산한다고 확신할 만큼 처리된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">french</span> <span class="o">=</span> <span class="s">"&lt;p&gt;Il a dit &amp;lt;&amp;lt;Sacr&amp;eacute; bleu!&amp;gt;&amp;gt;&lt;/p&gt;"</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">french</span><span class="p">)</span>
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="s">"minimal"</span><span class="p">))</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p&gt;</span>
+<span class="c"># Il a dit &amp;lt;&amp;lt;Sacré bleu!&amp;gt;&amp;gt;</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>
+<tt class="docutils literal"><span class="pre">formatter="html"</span></tt>을 건네면, 뷰티플수프는 유니코드 문자를 가능한한 HTML 개체로 변환한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="s">"html"</span><span class="p">))</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p&gt;</span>
+<span class="c"># Il a dit &amp;lt;&amp;lt;Sacr&amp;eacute; bleu!&amp;gt;&amp;gt;</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">formatter=None</span></tt>을 건네면, 뷰티플수프는 출력시 전혀 문자열을 건드리지 않는다. 이것이 가장 빠른 선택이지만, 다음 예제에서와 같이 잘못해서 뷰티플수프가 무효한 HTML/XML을 생산할 가능성이 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="bp">None</span><span class="p">))</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p&gt;</span>
+<span class="c"># Il a dit &lt;&lt;Sacré bleu!&gt;&gt;</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+
+<span class="n">link_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">'&lt;a href="http://example.com/?foo=val1&amp;bar=val2"&gt;A link&lt;/a&gt;'</span><span class="p">)</span>
+<span class="k">print</span><span class="p">(</span><span class="n">link_soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="bp">None</span><span class="p">))</span>
+<span class="c"># &lt;a href="http://example.com/?foo=val1&amp;bar=val2"&gt;A link&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p>
+마지막으로, <tt class="docutils literal"><span class="pre">formatter</span></tt>에 함수를 건네면, 뷰티플수프는 문서에서 문자열과 속성 값에 대하여 하나하나 그 함수를 한 번 호출한다. 이 함수에서 무엇이든 할 수 있다. 다음은 문자열을 대문자로 바꾸고 다른 일은 절대로 하지 않는 포맷터이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">def</span> <span class="nf">uppercase</span><span class="p">(</span><span class="nb">str</span><span class="p">):</span>
+ <span class="k">return</span> <span class="nb">str</span><span class="o">.</span><span class="n">upper</span><span class="p">()</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="n">uppercase</span><span class="p">))</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p&gt;</span>
+<span class="c"># IL A DIT &lt;&lt;SACRÉ BLEU!&gt;&gt;</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">link_soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="n">uppercase</span><span class="p">))</span>
+<span class="c"># &lt;a href="HTTP://EXAMPLE.COM/?FOO=VAL1&amp;BAR=VAL2"&gt;</span>
+<span class="c"># A LINK</span>
+<span class="c"># &lt;/a&gt;</span>
+</pre></div>
+</div>
+<p>따로 함수를 작성하고 있다면, <tt class="docutils literal"><span class="pre">bs4.dammit</span></tt> 모듈에 있는 <tt class="docutils literal"><span class="pre">EntitySubstitution</span></tt> 클래스에 관하여 알아야 한다. 이 클래스는 뷰티플수프의 표준 포맷터를 클래스 메쏘드로 구현한다:
+“html”포맷터는 <tt class="docutils literal"><span class="pre">EntitySubstitution.substitute_html</span></tt>이고, “minimal” 포맷터는 <tt class="docutils literal"><span class="pre">EntitySubstitution.substitute_xml</span></tt>이다. 이 함수들을 사용하면 <tt class="docutils literal"><span class="pre">formatter=html</span></tt>나
+<tt class="docutils literal"><span class="pre">formatter==minimal</span></tt>를 흉내낼 수 있지만, 더 처리해야할 일이 있다.</p>
+<p>다음은 가능하면 유니코드 문자를 HTML 개체로 교체하는 예제이다. 그러나 <cite>또한</cite> 모든 문자열을 대문자로 바꾼다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4.dammit</span> <span class="kn">import</span> <span class="n">EntitySubstitution</span>
+<span class="k">def</span> <span class="nf">uppercase_and_substitute_html_entities</span><span class="p">(</span><span class="nb">str</span><span class="p">):</span>
+ <span class="k">return</span> <span class="n">EntitySubstitution</span><span class="o">.</span><span class="n">substitute_html</span><span class="p">(</span><span class="nb">str</span><span class="o">.</span><span class="n">upper</span><span class="p">())</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="n">formatter</span><span class="o">=</span><span class="n">uppercase_and_substitute_html_entities</span><span class="p">))</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p&gt;</span>
+<span class="c"># IL A DIT &amp;lt;&amp;lt;SACR&amp;Eacute; BLEU!&amp;gt;&amp;gt;</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>마지막 단점: <tt class="docutils literal"><span class="pre">CData</span></tt> 객체를 만들면, 그 객체 안의 텍스트는 언제나 <cite>포맷팅 없이도, 정확하게 똑같이 나타난다</cite>. 문서에서 문자열 같은 것들을 세는 메쏘드를 손수 만들 경우, 뷰티플수프는 포맷터 메쏘드를 호출한다. 그러나 반환 값은 무시된다.</p>
+<blockquote>
+<div>from bs4.element import CData
+soup = BeautifulSoup(“&lt;a&gt;&lt;/a&gt;”)
+soup.a.string = CData(“one &lt; three”)
+print(soup.a.prettify(formatter=”xml”))
+# &lt;a&gt;
+# &lt;![CDATA[one &lt; three]]&gt;
+# &lt;/a&gt;</div></blockquote>
+</div>
+<div class="section" id="get-text">
+<h2><tt class="docutils literal"><span class="pre">get_text()</span></tt><a class="headerlink" href="#get-text" title="Permalink to this headline">¶</a></h2>
+<p>문서나 태그에서 텍스트 부분만 추출하고 싶다면, <tt class="docutils literal"><span class="pre">get_text()</span></tt> 메쏘드를 사용할 수 있다. 이 메쏘드는 문서나 태그 아래의 텍스트를, 유니코드 문자열 하나로 모두 돌려준다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">'&lt;a href="http://example.com/"&gt;</span><span class="se">\n</span><span class="s">I linked to &lt;i&gt;example.com&lt;/i&gt;</span><span class="se">\n</span><span class="s">&lt;/a&gt;'</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span>
+<span class="s">u'</span><span class="se">\n</span><span class="s">I linked to example.com</span><span class="se">\n</span><span class="s">'</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span>
+<span class="s">u'example.com'</span>
+</pre></div>
+</div>
+<p>텍스트를 합칠 때 사용될 문자열을 지정해 줄 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="c"># soup.get_text("|")</span>
+<span class="s">u'</span><span class="se">\n</span><span class="s">I linked to |example.com|</span><span class="se">\n</span><span class="s">'</span>
+</pre></div>
+</div>
+<p>뷰티플수프에게 각 테스트의 앞과 뒤에 있는 공백을 걷어내라고 알려줄 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="c"># soup.get_text("|", strip=True)</span>
+<span class="s">u'I linked to|example.com'</span>
+</pre></div>
+</div>
+<p>그러나 이 시점에서 대신에 <a class="reference internal" href="#string-generators"><em>.stripped_strings</em></a> 발생자를 사용해서, 텍스트를 손수 처리하고 싶을 수 있겠다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="p">[</span><span class="n">text</span> <span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">stripped_strings</span><span class="p">]</span>
+<span class="c"># [u'I linked to', u'example.com']</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="specifying-the-parser-to-use">
+<h1>사용할 해석기 지정하기<a class="headerlink" href="#specifying-the-parser-to-use" title="Permalink to this headline">¶</a></h1>
+<p>단지 HTML만 해석하고 싶을 경우, 조판을 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 구성자에 넣기만 하면, 아마도 잘 처리될 것이다. 뷰티플수프는 해석기를 여러분 대신 선택해 데이터를 해석한다. 그러나 어느 해석기를 사용할지 바꾸기 위해 구성자에 건넬 수 있는 인자가 몇 가지 더 있다.</p>
+<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 구성자에 건네는 첫 번째 인자는 문자열이나 열린 파일핸들-즉 해석하기를 원하는 조판이 첫 번째 인자이다. 두 번째 인자는 그 조판이 <cite>어떻게</cite> 해석되기를 바라는지 지정한다.</p>
+<p>아무것도 지정하지 않으면, 설치된 해석기중 최적의 HTML 해석기가 배당된다. 뷰티플수프는 lxml 해석기를 최선으로 취급하고, 다음에 html5lib 해석기, 그 다음이 파이썬의 내장 해석기를 선택한다. 이것은 다음 중 하나로 덮어쓸 수 있다:</p>
+<ul class="simple">
+<li>해석하고 싶은 조판의 종류. 현재 “html”, “xml”, 그리고 “html5”가 지원된다.</li>
+<li>사용하고 싶은 해석기의 이름. 현재 선택은 “lxml”, “html5lib”, 그리고 “html.parser” (파이썬의 내장 HTML 해석기)이다.</li>
+</ul>
+<p><a class="reference internal" href="#installing-a-parser">해석기 설치하기</a> 섹션에 지원 해석기들을 비교해 놓았다.</p>
+<p>적절한 해석기가 설치되어 있지 않다면, 뷰티플수프는 여러분의 요구를 무시하고 다른 해석기를 선택한다. 지금 유일하게 지원되는 XML 해석기는 lxml이다. lxml 해석기가 설치되어 있지 않으면, XML 해석기를 요구할 경우 아무것도 얻을 수 없고, “lxml”을 요구하더라도 얻을 수 없다.</p>
+<div class="section" id="differences-between-parsers">
+<h2>해석기 사이의 차이점들<a class="headerlink" href="#differences-between-parsers" title="Permalink to this headline">¶</a></h2>
+<p>뷰티플수프는 다양한 해석기에 대하여 인터페이스가 같다. 그러나 각 해석기는 다르다. 해석기마다 같은 문서에서 다른 해석 트리를 만들어낸다. 가장 큰 차이점은 HTML 해석기와 XML 해석기 사이에 있다. 다음은 HTML로 해석된 짧은 문서이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&lt;a&gt;&lt;b /&gt;&lt;/a&gt;"</span><span class="p">)</span>
+<span class="c"># &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;a&gt;&lt;b&gt;&lt;/b&gt;&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>빈 &lt;b /&gt; 태그는 유효한 HTML이 아니므로, 해석기는 그것을 &lt;b&gt;&lt;/b&gt; 태그 쌍으로 변환한다.</p>
+<p>다음 똑같은 문서를 XML로 해석한 것이다 (이를 실행하려면 lxml이 설치되어 있어야 한다). 빈 &lt;b /&gt; 태그가 홀로 남았음에 유의하자. 그리고 &lt;html&gt; 태그를 출력하는 대신에 XML 선언이 주어졌음을 주목하자:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&lt;a&gt;&lt;b /&gt;&lt;/a&gt;"</span><span class="p">,</span> <span class="s">"xml"</span><span class="p">)</span>
+<span class="c"># &lt;?xml version="1.0" encoding="utf-8"?&gt;</span>
+<span class="c"># &lt;a&gt;&lt;b /&gt;&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p>
+HTML 해석기 사이에서도 차이가 있다. 뷰티플수프에 완벽하게 모양을 갖춘 HTML 문서를 주면, 이 차이는 문제가 되지 않는다. 비록 해석기마다 속도에 차이가 있기는 하지만, 모두 원래의 HTML 문서와 정확하게 똑같이 보이는 데이터 구조를 돌려준다.</p>
+<p>그러나 문서가 불완전하게 모양을 갖추었다면, 해석기마다 결과가 다르다. 다음은 짧은 무효한 문서를 lxml의 HTML 해석기로 해석한 것이다. 나홀로 &lt;/p&gt; 태그는 그냥 무시된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&lt;a&gt;&lt;/p&gt;"</span><span class="p">,</span> <span class="s">"lxml"</span><span class="p">)</span>
+<span class="c"># &lt;html&gt;&lt;body&gt;&lt;a&gt;&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>다음은 같은 문서를 html5lib로 해석하였다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&lt;a&gt;&lt;/p&gt;"</span><span class="p">,</span> <span class="s">"html5lib"</span><span class="p">)</span>
+<span class="c"># &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;a&gt;&lt;p&gt;&lt;/p&gt;&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>나홀로 &lt;/p&gt; 태그를 무시하는 대신에, html5lib는 여는 &lt;p&gt; 태그로 짝을 맞추어 준다. 이 해석기는 또한 빈 &lt;head&gt; 태그를 문서에 추가한다.</p>
+<p>다음은 같은 문서를 파이썬 내장 HTML 해석기로 해석한 것이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">"&lt;a&gt;&lt;/p&gt;"</span><span class="p">,</span> <span class="s">"html.parser"</span><span class="p">)</span>
+<span class="c"># &lt;a&gt;&lt;/a&gt;</span>
+</pre></div>
+</div>
+<p>html5lib처럼, 이 해석기는 닫는 &lt;/p&gt; 태그를 무시한다. html5lib와 다르게, 이 해석기는 &lt;body&gt; 태그를 추가해서 모양을 갖춘 HTML 문서를 생성하려고 아무 시도도 하지 않는다. lxml과 다르게, 심지어 &lt;html&gt; 태그를 추가하는 것에도 신경쓰지 않는다.</p>
+<p>문서 “&lt;a&gt;&lt;/p&gt;”는 무효하므로, 이 테크닉중 어느 것도 “올바른” 처리 방법이 아니다. html5lib 해석기는 HTML5 표준에 있는 테크닉을 사용하므로, 아무래도 “가장 올바른” 방법이라고 주장할 수 있지만, 세 가지 테크닉 모두 같은 주장을 할 수 있다.</p>
+<p>해석기 사이의 차이점 때문에 스크립트가 영향을 받을 수 있다. 스크립트를 다른 사람들에게 나누어 줄 계획이 있다면, 또는 여러 머신에서 실행할 생각이라면, <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 구성자에 해석기를 지정해 주는 편이 좋다. 그렇게 해야 여러분이 해석한 방식과 다르게 사용자가 문서를 해석할 위험성이 감소한다.</p>
+</div>
+</div>
+<div class="section" id="encodings">
+<h1>인코딩<a class="headerlink" href="#encodings" title="Permalink to this headline">¶</a></h1>
+<p>HTML이든 XML이든 문서는 ASCII나 UTF-8 같은 특정한 인코딩으로 작성된다. 그러나 문서를 뷰티플수프에 적재하면, 문서가 유니코드로 변환되었음을 알게 될 것이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">"&lt;h1&gt;Sacr</span><span class="se">\xc3\xa9</span><span class="s"> bleu!&lt;/h1&gt;"</span>
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">h1</span>
+<span class="c"># &lt;h1&gt;Sacré bleu!&lt;/h1&gt;</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">h1</span><span class="o">.</span><span class="n">string</span>
+<span class="c"># u'Sacr\xe9 bleu!'</span>
+</pre></div>
+</div>
+<p>마법이 아니다(확실히 좋은 것이다.). 뷰티플수프는 <a class="reference internal" href="#unicode-dammit">Unicode, Dammit</a>라는 하위 라이브러리를 사용하여 문서의 인코딩을 탐지하고 유니코드로 변환한다. 자동 인코딩 탐지는 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체의 <tt class="docutils literal"><span class="pre">.original_encoding</span></tt> 속성으로 얻을 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">original_encoding</span>
+<span class="s">'utf-8'</span>
+</pre></div>
+</div>
+<p>Unicode, Dammit은 대부분 올바르게 추측하지만, 가끔은 실수가 있다. 가끔 올바르게 추측하지만, 문서를 바이트 하나 하나 오랫동안 탐색한 후에야 그렇다. 혹시 문서의 인코딩을 미리 안다면, 그 인코딩을 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 구성자에 <tt class="docutils literal"><span class="pre">from_encoding</span></tt>로 건네면 실수를 피하고 시간을 절약할 수 있다.</p>
+<p>다음은 ISO-8859-8로 작성된 문서이다. 이 문서는 Unicode, Dammit이 충분히 살펴보기에는 너무 짧아서, ISO-8859-7로 잘못 인식한다:</p>
+<div class="highlight-python"><pre>markup = b"&lt;h1&gt;\xed\xe5\xec\xf9&lt;/h1&gt;"
+soup = BeautifulSoup(markup)
+soup.h1
+&lt;h1&gt;νεμω&lt;/h1&gt;
+soup.original_encoding
+'ISO-8859-7'</pre>
+</div>
+<p>이를 해결하려면 올바른 <tt class="docutils literal"><span class="pre">from_encoding</span></tt>을 건네면 된다:</p>
+<div class="highlight-python"><pre>soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
+soup.h1
+&lt;h1&gt;םולש&lt;/h1&gt;
+soup.original_encoding
+'iso8859-8'</pre>
+</div>
+<p>아주 드물게 (보통 UTF-8 문서 안에 텍스트가 완전히 다른 인코딩으로 작성되어 있을 경우), 유일하게 유니코드를 얻는 방법은 몇 가지 문자를 특별한 유니코드 문자 “REPLACEMENT CHARACTER” (U+FFFD, �)로 교체하는 것이다 . Unicode, Dammit이 이를 필요로 하면, <tt class="docutils literal"><span class="pre">UnicodeDammit</span></tt>이나 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체에 대하여 <tt class="docutils literal"><span class="pre">.contains_replacement_characters</span></tt> 속성에 <tt class="docutils literal"><span class="pre">True</span></tt>를 설정할 것이다.
+이렇게 하면 유니코드 표현이 원래의 정확한 표현이 아니라는 사실을 알 수 있다. 약간 데이터가 손실된다. 문서에 �가 있지만, <tt class="docutils literal"><span class="pre">.contains_replacement_characters</span></tt>가 <tt class="docutils literal"><span class="pre">False</span></tt>라면, 원래부터 거기에 있었고 데이터 손실을 감내하지 않는다는 사실을 알게 될 것이다.</p>
+<div class="section" id="output-encoding">
+<h2>출력 인코딩<a class="headerlink" href="#output-encoding" title="Permalink to this headline">¶</a></h2>
+<p>뷰티플수프로 문서를 작성할 때, UTF-8 문서를 얻는다. 그 문서가 처음에는 UTF-8이 아니었다고 할지라도 말이다. 다음은 Latin-1 인코딩으로 작성된 문서이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="n">b</span><span class="s">'''</span>
+<span class="s"> &lt;html&gt;</span>
+<span class="s"> &lt;head&gt;</span>
+<span class="s"> &lt;meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" /&gt;</span>
+<span class="s"> &lt;/head&gt;</span>
+<span class="s"> &lt;body&gt;</span>
+<span class="s"> &lt;p&gt;Sacr</span><span class="se">\xe9</span><span class="s"> bleu!&lt;/p&gt;</span>
+<span class="s"> &lt;/body&gt;</span>
+<span class="s"> &lt;/html&gt;</span>
+<span class="s">'''</span>
+
+<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;head&gt;</span>
+<span class="c"># &lt;meta content="text/html; charset=utf-8" http-equiv="Content-type" /&gt;</span>
+<span class="c"># &lt;/head&gt;</span>
+<span class="c"># &lt;body&gt;</span>
+<span class="c"># &lt;p&gt;</span>
+<span class="c"># Sacré bleu!</span>
+<span class="c"># &lt;/p&gt;</span>
+<span class="c"># &lt;/body&gt;</span>
+<span class="c"># &lt;/html&gt;</span>
+</pre></div>
+</div>
+<p>
+&lt;meta&gt; 태그가 재작성되어 문서가 이제 UTF-8이라는 사실을 반영하고 있음을 주목하자.</p>
+<p>UTF-8이 싫으면, 인코딩을 <tt class="docutils literal"><span class="pre">prettify()</span></tt>에 건넬 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="s">"latin-1"</span><span class="p">))</span>
+<span class="c"># &lt;html&gt;</span>
+<span class="c"># &lt;head&gt;</span>
+<span class="c"># &lt;meta content="text/html; charset=latin-1" http-equiv="Content-type" /&gt;</span>
+<span class="c"># ...</span>
+</pre></div>
+</div>
+<p>또 encode()를 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 객체, 또는 수프의 다른 어떤 요소에라도 호출할 수 있다. 마치 파이썬 문자열처럼 말이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"latin-1"</span><span class="p">)</span>
+<span class="c"># '&lt;p&gt;Sacr\xe9 bleu!&lt;/p&gt;'</span>
+
+<span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"utf-8"</span><span class="p">)</span>
+<span class="c"># '&lt;p&gt;Sacr\xc3\xa9 bleu!&lt;/p&gt;'</span>
+</pre></div>
+</div>
+<p>선택한 인코딩에서 표현이 불가능한 문자는 숫자의 XML 개체 참조로 변환된다. 다음은 유니코드 문자 SNOWMAN이 포함된 문자이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">u"&lt;b&gt;</span><span class="se">\N{SNOWMAN}</span><span class="s">&lt;/b&gt;"</span>
+<span class="n">snowman_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
+<span class="n">tag</span> <span class="o">=</span> <span class="n">snowman_soup</span><span class="o">.</span><span class="n">b</span>
+</pre></div>
+</div>
+<p>눈사람 문자는 UTF-8 문서에 포함될 수 있지만 (☃처럼 생김), ISO-Latin-1이나 ASCII에 그 문자에 대한 표현이 없다. 그래서 “&amp;#9731”으로 변환된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"utf-8"</span><span class="p">))</span>
+<span class="c"># &lt;b&gt;☃&lt;/b&gt;</span>
+
+<span class="k">print</span> <span class="n">tag</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"latin-1"</span><span class="p">)</span>
+<span class="c"># &lt;b&gt;&amp;#9731;&lt;/b&gt;</span>
+
+<span class="k">print</span> <span class="n">tag</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"ascii"</span><span class="p">)</span>
+<span class="c"># &lt;b&gt;&amp;#9731;&lt;/b&gt;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="unicode-dammit">
+<h2>이런, 유니코드군<a class="headerlink" href="#unicode-dammit" title="Permalink to this headline">¶</a></h2>
+<p>뷰티플수프를 사용하지 않더라도 유니코드를 사용할 수 있다. 인코딩을 알 수 없는 데이터가 있을 때마다 그냥 유니코드가 되어 주었으면 하고 바라기만 하면 된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">UnicodeDammit</span>
+<span class="n">dammit</span> <span class="o">=</span> <span class="n">UnicodeDammit</span><span class="p">(</span><span class="s">"Sacr</span><span class="se">\xc3\xa9</span><span class="s"> bleu!"</span><span class="p">)</span>
+<span class="k">print</span><span class="p">(</span><span class="n">dammit</span><span class="o">.</span><span class="n">unicode_markup</span><span class="p">)</span>
+<span class="c"># Sacré bleu!</span>
+<span class="n">dammit</span><span class="o">.</span><span class="n">original_encoding</span>
+<span class="c"># 'utf-8'</span>
+</pre></div>
+</div>
+<p>
+유니코드에 더 많은 데이터를 줄 수록, Dammit은 더 정확하게 추측할 것이다. 나름대로 어떤 인코딩일지 짐작이 간다면, 그것들을 리스트로 건넬 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">dammit</span> <span class="o">=</span> <span class="n">UnicodeDammit</span><span class="p">(</span><span class="s">"Sacr</span><span class="se">\xe9</span><span class="s"> bleu!"</span><span class="p">,</span> <span class="p">[</span><span class="s">"latin-1"</span><span class="p">,</span> <span class="s">"iso-8859-1"</span><span class="p">])</span>
+<span class="k">print</span><span class="p">(</span><span class="n">dammit</span><span class="o">.</span><span class="n">unicode_markup</span><span class="p">)</span>
+<span class="c"># Sacré bleu!</span>
+<span class="n">dammit</span><span class="o">.</span><span class="n">original_encoding</span>
+<span class="c"># 'latin-1'</span>
+</pre></div>
+</div>
+<p>Unicode, Dammit는 뷰티플수프가 사용하지 않는 특별한 특징이 두 가지 있다.</p>
+<div class="section" id="smart-quotes">
+<h3>지능형 따옴표<a class="headerlink" href="#smart-quotes" title="Permalink to this headline">¶</a></h3>
+<p>Unicode, Dammit을 사용하여 마이크로소프트 지능형 따옴표를 HTML이나 XML 개체로 변환할 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="n">b</span><span class="s">"&lt;p&gt;I just </span><span class="se">\x93</span><span class="s">love</span><span class="se">\x94</span><span class="s"> Microsoft Word</span><span class="se">\x92</span><span class="s">s smart quotes&lt;/p&gt;"</span>
+
+<span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">"windows-1252"</span><span class="p">],</span> <span class="n">smart_quotes_to</span><span class="o">=</span><span class="s">"html"</span><span class="p">)</span><span class="o">.</span><span class="n">unicode_markup</span>
+<span class="c"># u'&lt;p&gt;I just &amp;ldquo;love&amp;rdquo; Microsoft Word&amp;rsquo;s smart quotes&lt;/p&gt;'</span>
+
+<span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">"windows-1252"</span><span class="p">],</span> <span class="n">smart_quotes_to</span><span class="o">=</span><span class="s">"xml"</span><span class="p">)</span><span class="o">.</span><span class="n">unicode_markup</span>
+<span class="c"># u'&lt;p&gt;I just &amp;#x201C;love&amp;#x201D; Microsoft Word&amp;#x2019;s smart quotes&lt;/p&gt;'</span>
+</pre></div>
+</div>
+<p>또 마이크로소프트 지능형 따옴표를 ASCII 따옴표로 변환할 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">"windows-1252"</span><span class="p">],</span> <span class="n">smart_quotes_to</span><span class="o">=</span><span class="s">"ascii"</span><span class="p">)</span><span class="o">.</span><span class="n">unicode_markup</span>
+<span class="c"># u'&lt;p&gt;I just "love" Microsoft Word\'s smart quotes&lt;/p&gt;'</span>
+</pre></div>
+</div>
+<p>모쪼록 이 특징이 쓸모가 있기를 바라지만, 뷰티플수프는 사용하지 않는다. 뷰티플수프는 기본 행위를 선호하는데, 기본적으로 마이크로소프트 지능형 따옴표를 다른 모든 것과 함께 유니코드 문자로 변환한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">"windows-1252"</span><span class="p">])</span><span class="o">.</span><span class="n">unicode_markup</span>
+<span class="c"># u'&lt;p&gt;I just \u201clove\u201d Microsoft Word\u2019s smart quotes&lt;/p&gt;'</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="inconsistent-encodings">
+<h3>비 일관적인 인코딩<a class="headerlink" href="#inconsistent-encodings" title="Permalink to this headline">¶</a></h3>
+<p>어떤 경우 문서 대부분이 UTF-8이지만, 안에 (역시) 마이크로소프트 지능형 따옴표와 같이 Windows-1252 문자가 들어 있는 경우가 있다. 한 웹 사이트에 여러 소스로 부터 데이터가 포함될 경우에 이런 일이 일어날 수 있다.
+<tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt>을 사용하여 그런 문서를 순수한 UTF-8 문서로 변환할 수 있다. 다음은 간단한 예이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">snowmen</span> <span class="o">=</span> <span class="p">(</span><span class="s">u"</span><span class="se">\N{SNOWMAN}</span><span class="s">"</span> <span class="o">*</span> <span class="mi">3</span><span class="p">)</span>
+<span class="n">quote</span> <span class="o">=</span> <span class="p">(</span><span class="s">u"</span><span class="se">\N{LEFT DOUBLE QUOTATION MARK}</span><span class="s">I like snowmen!</span><span class="se">\N{RIGHT DOUBLE QUOTATION MARK}</span><span class="s">"</span><span class="p">)</span>
+<span class="n">doc</span> <span class="o">=</span> <span class="n">snowmen</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"utf8"</span><span class="p">)</span> <span class="o">+</span> <span class="n">quote</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"windows_1252"</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>이 문서는 뒤죽박죽이다. 눈사람은 UTF-8인데 따옴표는 Windows-1252이다. 눈사람 아니면 따옴표를 화면에 나타낼 수 있지만, 둘 다 나타낼 수는 없다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span>
+<span class="c"># ☃☃☃�I like snowmen!�</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">doc</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s">"windows-1252"</span><span class="p">))</span>
+<span class="c"># ☃☃☃“I like snowmen!”</span>
+</pre></div>
+</div>
+<p>문서를 UTF-8로 디코딩하면 <tt class="docutils literal"><span class="pre">UnicodeDecodeError</span></tt>가 일어나고, Windows-1252로 디코딩하면 알 수 없는 글자들이 출력된다. 다행스럽게도, <tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt>는 그 문자열을 순수 UTF-8로 변환해 주므로, 유니코드로 디코드하면 눈사람과 따옴표를 동시에 화면에 보여줄 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">new_doc</span> <span class="o">=</span> <span class="n">UnicodeDammit</span><span class="o">.</span><span class="n">detwingle</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span>
+<span class="k">print</span><span class="p">(</span><span class="n">new_doc</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s">"utf8"</span><span class="p">))</span>
+<span class="c"># ☃☃☃“I like snowmen!”</span>
+</pre></div>
+</div>
+<p><tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt>는 오직 UTF-8에 임베드된 (또는 그 반대일 수도 있지만) Windows-1252을 다루는 법만 아는데, 이것이 가장 일반적인 사례이다.</p>
+<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt>이나 <tt class="docutils literal"><span class="pre">UnicodeDammit</span></tt> 구성자에 건네기 전에 먼저 데이터에 <tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt>을 호출하는 법을 반드시 알아야 한다. 뷰티플수프는 문서에 하나의 인코딩만 있다고 간주한다. 그것이 무엇이든 상관없이 말이다. UTF-8과 Windows-1252를 모두 포함한 문서를 건네면, 전체 문서가 Windows-1252라고 생각할 가능성이 높고, 그 문서는 다음 ` ☃☃☃“I like snowmen!”`처럼 보일 것이다.</p>
+<p><tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt>은 뷰티플수프 4.1.0에서 새로 추가되었다.</p>
+</div>
+</div>
+</div>
+<div class="section" id="parsing-only-part-of-a-document">
+<h1>문서의 일부만을 해석하기<a class="headerlink" href="#parsing-only-part-of-a-document" title="Permalink to this headline">¶</a></h1>
+<p>뷰티플수프를 사용하여 문서에서 &lt;a&gt; 태그를 살펴보고 싶다고 해보자. 전체 문서를 해석해서 훓어가며 &lt;a&gt; 태그를 찾는 일은 시간 낭비이자 메모리 낭비이다. 처음부터 &lt;a&gt; 태그가 아닌 것들을 무시하는 편이 더 빠를 것이 분명하다. <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 클래스는 문서에 어느 부분을 해석할지 고르도록 해준다. 그냥 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt>를 만들고 그것을 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 구성자에 <tt class="docutils literal"><span class="pre">parse_only</span></tt> 인자로 건네면 된다.</p>
+<p>(<em>이 특징은 html5lib 해석기를 사용중이라면 작동하지 않음을 주목하자</em>. html5lib을 사용한다면, 어쨋거나 문서 전체가 해석된다. 이것은 html5lib가 작업하면서 항상 해석 트리를 재정렬하기 때문이다. 문서의 일부가 실제로 해석 트리에 맞지 않을 경우, 충돌을 일으킨다. 혼란을 피하기 위해, 아래의 예제에서 뷰티플수프에게 파이썬의 내장 해석기를 사용하라고 강제하겠다.)</p>
+<div class="section" id="soupstrainer">
+<h2><tt class="docutils literal"><span class="pre">SoupStrainer</span></tt><a class="headerlink" href="#soupstrainer" title="Permalink to this headline">¶</a></h2>
+<p><tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 클래스는 <a class="reference internal" href="#searching-the-tree">트리 탐색하기</a>의 전형적인 메쏘드와 같은 인자들을 취한다: <a class="reference internal" href="#id8"><em>name</em></a>, <a class="reference internal" href="#attrs"><em>attrs</em></a>, <a class="reference internal" href="#text"><em>text</em></a>, 그리고 <a class="reference internal" href="#kwargs"><em>**kwargs</em></a>이 그 인자들이다. 다음은 세 가지 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 객체이다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">SoupStrainer</span>
+
+<span class="n">only_a_tags</span> <span class="o">=</span> <span class="n">SoupStrainer</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span>
+
+<span class="n">only_tags_with_id_link2</span> <span class="o">=</span> <span class="n">SoupStrainer</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">"link2"</span><span class="p">)</span>
+
+<span class="k">def</span> <span class="nf">is_short_string</span><span class="p">(</span><span class="n">string</span><span class="p">):</span>
+ <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">string</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">10</span>
+
+<span class="n">only_short_strings</span> <span class="o">=</span> <span class="n">SoupStrainer</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">is_short_string</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>다시 한 번 더“three sisters” 문서로 돌아가 보겠다. 문서를 세 가지 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 객체로 해석하면 어떻게 보이는지 살펴보자:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">html_doc</span> <span class="o">=</span> <span class="s">"""</span>
+<span class="s">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;</span>
+
+<span class="s">&lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</span>
+
+<span class="s">&lt;p class="story"&gt;Once upon a time there were three little sisters; and their names were</span>
+<span class="s">&lt;a href="http://example.com/elsie" class="sister" id="link1"&gt;Elsie&lt;/a&gt;,</span>
+<span class="s">&lt;a href="http://example.com/lacie" class="sister" id="link2"&gt;Lacie&lt;/a&gt; and</span>
+<span class="s">&lt;a href="http://example.com/tillie" class="sister" id="link3"&gt;Tillie&lt;/a&gt;;</span>
+<span class="s">and they lived at the bottom of a well.&lt;/p&gt;</span>
+
+<span class="s">&lt;p class="story"&gt;...&lt;/p&gt;</span>
+<span class="s">"""</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">,</span> <span class="s">"html.parser"</span><span class="p">,</span> <span class="n">parse_only</span><span class="o">=</span><span class="n">only_a_tags</span><span class="p">)</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;</span>
+<span class="c"># Elsie</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;</span>
+<span class="c"># Lacie</span>
+<span class="c"># &lt;/a&gt;</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;</span>
+<span class="c"># Tillie</span>
+<span class="c"># &lt;/a&gt;</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">,</span> <span class="s">"html.parser"</span><span class="p">,</span> <span class="n">parse_only</span><span class="o">=</span><span class="n">only_tags_with_id_link2</span><span class="p">)</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;</span>
+<span class="c"># Lacie</span>
+<span class="c"># &lt;/a&gt;</span>
+
+<span class="k">print</span><span class="p">(</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">,</span> <span class="s">"html.parser"</span><span class="p">,</span> <span class="n">parse_only</span><span class="o">=</span><span class="n">only_short_strings</span><span class="p">)</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
+<span class="c"># Elsie</span>
+<span class="c"># ,</span>
+<span class="c"># Lacie</span>
+<span class="c"># and</span>
+<span class="c"># Tillie</span>
+<span class="c"># ...</span>
+<span class="c">#</span>
+</pre></div>
+</div>
+<p>또한 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt>를 <a class="reference internal" href="#searching-the-tree">트리 탐색하기</a>에서 다룬 메쏘드에 건넬 수 있다. 이는 별로 유용하지는 않지만, 그럼에도 언급해 둔다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
+<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">only_short_strings</span><span class="p">)</span>
+<span class="c"># [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',</span>
+<span class="c"># u'\n\n', u'...', u'\n']</span>
+</pre></div>
+</div>
+</div>
+</div>
+<div class="section" id="troubleshooting">
+<h1>문제 해결<a class="headerlink" href="#troubleshooting" title="Permalink to this headline">¶</a></h1>
+<div class="section" id="version-mismatch-problems">
+<h2>버전 불일치 문제<a class="headerlink" href="#version-mismatch-problems" title="Permalink to this headline">¶</a></h2>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">SyntaxError:</span> <span class="pre">Invalid</span> <span class="pre">syntax</span></tt> (다음 <tt class="docutils literal"><span class="pre">ROOT_TAG_NAME</span> <span class="pre">=</span>
+<span class="pre">u'[document]'</span></tt> 줄에서): 코드를 변경하지 않고서, 파이썬 2 버전의 뷰티플수프를 파이썬 3 아래에서 사용하기 때문에 야기된다.</li>
+<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">HTMLParser</span></tt> - 파이썬 2 버전의 뷰티플수프를 파이썬 3 아래에서 사용하기 때문에 야기된다.</li>
+<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">html.parser</span></tt> - 파이썬 3 버전의 뷰티플수프를 파이썬 2에서 실행하기 때문에 야기된다.</li>
+<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">BeautifulSoup</span></tt> - 뷰티플수프 3 코드를 BS3가 설치되어 있지 않은 시스템에서 실행할 때 야기된다. 또는 꾸러미 이름이 <tt class="docutils literal"><span class="pre">bs4</span></tt>로 바뀌었음을 알지 못하고 뷰티플수프 4 코드를 실행하면 야기된다.</li>
+<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">bs4</span></tt> - 뷰티플수프 4 코드를 BS4가 설치되어 있지 않은 시스템에서 실행하면 야기된다.</li>
+</ul>
+</div>
+<div class="section" id="parsing-xml">
+<span id="id11"></span><h2>XML 해석하기<a class="headerlink" href="#parsing-xml" title="Permalink to this headline">¶</a></h2>
+<p>기본값으로, 뷰티플수프는 문서를 HTML로 해석한다. 문서를 XML로 해석하려면, “xml”를 두 번째 인자로 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 구성자에 건네야 한다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="s">"xml"</span><span class="p">)</span>
+</pre></div>
+</div>
+<p><a class="reference internal" href="#parser-installation"><em>lxml이 설치되어 있어야 한다</em></a>.</p>
+</div>
+<div class="section" id="other-parser-problems">
+<h2>기타 해석기 문제<a class="headerlink" href="#other-parser-problems" title="Permalink to this headline">¶</a></h2>
+<ul class="simple">
+<li>스크립트가 한 컴퓨터에서는 잘 되는데 다른 컴퓨터에서는 작동하지 않는다면, 아마도 두 컴퓨터가 다른 해석기를 가지고 있기 때문일 것이다. 예를 들어, lxml이 설치된 컴퓨터에서 스크립트를 개발해 놓고, 그것을 html5lib만 설치된 컴퓨터에서 실행하려고 했을 수 있다. 왜 이것이 문제가 되는지는 <a class="reference internal" href="#differences-between-parsers">해석기들 사이의 차이점</a>을 참고하고, <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 구성자에 특정 라이브러리를 지정해서 문제를 해결하자.</li>
+<li><tt class="docutils literal"><span class="pre">HTMLParser.HTMLParseError:</span> <span class="pre">malformed</span> <span class="pre">start</span> <span class="pre">tag</span></tt> or
+<tt class="docutils literal"><span class="pre">HTMLParser.HTMLParseError:</span> <span class="pre">bad</span> <span class="pre">end</span> <span class="pre">tag</span></tt> - 파이썬의 내장 HTML 해석기에 처리가 불가능한 문서를 건네면 야기된다. 다른 <tt class="docutils literal"><span class="pre">HTMLParseError</span></tt>도 아마 같은 문제일 것이다. 해결책:
+<a class="reference internal" href="#parser-installation"><em>lxml이나 html5lib를 설치하자.</em></a></li>
+<li>알고 있는데 문서에서 그 태그를 발견할 수 없다면 (다시 말해,
+<tt class="docutils literal"><span class="pre">find_all()</span></tt>이 <tt class="docutils literal"><span class="pre">[]</span></tt>를 돌려주거나 <tt class="docutils literal"><span class="pre">find()</span></tt>가 <tt class="docutils literal"><span class="pre">None</span></tt>을 돌려줄 경우), 아마도 파이썬의 내장 HTML 해석기를 사용하고 있을 가능성이 높다. 이 해석기는 가끔 이해하지 못하면 그 태그를 무시하고 지나간다. 해결책: <a class="reference internal" href="#parser-installation"><em>lxml이나 html5lib를 설치하자.</em></a></li>
+<li>
+<a class="reference external" href="https://web.archive.org/web/20150319200824/http://www.w3.org/TR/html5/syntax.html#syntax">HTML 태그와 속성</a>은 대소문자를 구별하므로, 세가지 HTML 해석기 모두 태그와 속성 이름을 소문자로 변환한다. 다시 말해, 다음 조판 &lt;TAG&gt;&lt;/TAG&gt;는 &lt;tag&gt;&lt;/tag&gt;로 변환된다. 태그와 속성에 대소문자 혼합 또는 대문자를 그대로 유지하고 싶다면, <a class="reference internal" href="#parsing-xml"><em>문서를 XML로 해석할 필요가 있다.</em></a></li>
+</ul>
+</div>
+<div class="section" id="miscellaneous">
+<h2>기타<a class="headerlink" href="#miscellaneous" title="Permalink to this headline">¶</a></h2>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">KeyError:</span> <span class="pre">[attr]</span></tt> - <tt class="docutils literal"><span class="pre">tag['attr']</span></tt>에 접근했는데 해당 태그에 <tt class="docutils literal"><span class="pre">attr</span></tt> 속성이 정의되어 있지 않을 때 야기된다. 가장 흔한 에러는 <tt class="docutils literal"><span class="pre">KeyError:</span> <span class="pre">'href'</span></tt> 그리고 <tt class="docutils literal"><span class="pre">KeyError:</span>
+<span class="pre">'class'</span></tt>이다. <tt class="docutils literal"><span class="pre">attr</span></tt>이 정의되어 있는지 잘 모르겠다면, 파이썬 사전에 그렇게 하듯이, <tt class="docutils literal"><span class="pre">tag.get('attr')</span></tt>을 사용하자.</li>
+<li><tt class="docutils literal"><span class="pre">UnicodeEncodeError:</span> <span class="pre">'charmap'</span> <span class="pre">codec</span> <span class="pre">can't</span> <span class="pre">encode</span> <span class="pre">character</span>
+<span class="pre">u'\xfoo'</span> <span class="pre">in</span> <span class="pre">position</span> <span class="pre">bar</span></tt> (또는 그냥 기타 다른 <tt class="docutils literal"><span class="pre">UnicodeEncodeError</span></tt>에 관한 모든 것) - 이 에러는 뷰티플수프에 관련된 문제가 아니다 .이 문제는 두 가지 상황에서 출현한다. 첫 째, 유니코드 문자열을 인쇄했는데 콘솔이 표시할 줄 모를 경우가 있다. (<a class="reference external" href="https://web.archive.org/web/20150319200824/http://wiki.python.org/moin/PrintFails">파이썬 위키에서</a> 도움을 받자.) 둘째, 파일에 쓰는데 기본 인코딩으로 지원되지 않는 유니코드 문자열을 건넬 경우가 있다. 이런 경우, 가장 쉬운 해결책은 <tt class="docutils literal"><span class="pre">u.encode("utf8")</span></tt>을 지정해서 그 유니코드 문자열을 UTF-8로 명시적으로 인코드하는 것이다.</li>
+</ul>
+</div>
+<div class="section" id="improving-performance">
+<h2>수행성능 개선<a class="headerlink" href="#improving-performance" title="Permalink to this headline">¶</a></h2>
+<p>뷰티플수프는 그 밑에 깔린 해석기보다 더 빠를 수는 없다. 응답 시간이 중요하다면, 다시 말해, 시간제로 컴퓨터를 쓰고 있거나 아니면 컴퓨터 시간이 프로그래머 시간보다 더 가치가 있는 다른 이유가 조금이라도 있다면, 그렇다면 뷰티플수프는 잊어 버리고 직접 <a class="reference external" href="https://web.archive.org/web/20150319200824/http://lxml.de/">lxml</a> 위에 작업하는 편이 좋을 것이다.</p>
+<p>그렇지만, 뷰티플수프의 속도를 높일 수 있는 방법이 있다. 아래에 해석기로 lxml을 사용하고 있지 않다면, <a class="reference internal" href="#parser-installation"><em>당장 시작해 보기를</em></a> 조언한다. 뷰티플수프는 html.parser나 html5lib를 사용하는 것보다 lxml을 사용하는 것이 문서를 상당히 더 빠르게 해석한다.</p>
+<p><a class="reference external" href="https://web.archive.org/web/20150319200824/http://pypi.python.org/pypi/cchardet/">cchardet</a> 라이브러리를 설치하면 인코딩 탐지 속도를 상당히 높일 수 있다.</p>
+<p>가끔 <a class="reference internal" href="#unicode-dammit">Unicode, Dammit</a>는 바이트별로 파일을 조사해서 인코딩을 탐지할 수 있을 뿐이다. 이 때문에 뷰티플수프가 기어가는 원인이 된다. 본인의 테스트에 의하면 이런 일은 파이썬 2.x 버전대에서만 일어나고, 러시아나 중국어 인코딩을 사용한 문서에 아주 많이 발생했다. 이런 일이 일어나면, cchardet을 설치하거나, 스크립트에 Python 3를 사용하여 문제를 해결할 수 있다. 혹시 문서의 인코딩을 안다면, 그 인코딩을 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 구성자에 <tt class="docutils literal"><span class="pre">from_encoding</span></tt>로 건네면, 인코딩 탐지를 완전히 건너뛴다.</p>
+<p><a class="reference internal" href="#parsing-only-part-of-a-document">문서의 일부만 해석하기</a>는 문서를 해석하는 시간을 많이 절약해 주지는 못하겠지만, 메모리가 절약되고, 문서를 훨씬 더 빨리 <cite>탐색할 수 있을 것이다</cite>.</p>
+</div>
+</div>
+<div class="section" id="id12">
+<h1>뷰티플수프 3<a class="headerlink" href="#id12" title="Permalink to this headline">¶</a></h1>
+<p>뷰티플수프 3는 이전의 구형으로서, 더 이상 활발하게 개발되지 않는다. 현재는 주요 리눅스 배포본에 모두 함께 꾸려넣어진다:</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">python-beautifulsoup</span></tt></p>
+<p>또 PyPi를 통하여 <tt class="docutils literal"><span class="pre">BeautifulSoup</span>로</tt> 출간되어 있다:</p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">BeautifulSoup</span></tt></p>
+<p><tt class="kbd docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">BeautifulSoup</span></tt></p>
+<p>
+또한 <a class="reference external" href="https://web.archive.org/web/20150319200824/http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz">뷰티플수프 3.2.0</a> 압축파일을 내려받을 수 있다.</p>
+<p><tt class="docutils literal"><span class="pre">easy_install</span> <span class="pre">beautifulsoup</span></tt>이나 <tt class="docutils literal"><span class="pre">easy_install</span> <span class="pre">BeautifulSoup</span></tt>을 실행했는데, 코드가 작동하지 않으면, 실수로 뷰티플수프 3을 설치한 것이다. <tt class="docutils literal"><span class="pre">easy_install</span> <span class="pre">beautifulsoup4</span></tt>을 실행할 필요가 있다.</p>
+<p><a class="reference external" href="https://web.archive.org/web/20150319200824/http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">뷰티플수프 3 문서는 온라인에 보관되어 있다</a>. 모국어가 중국어라면, <a class="reference external" href="https://web.archive.org/web/20150319200824/http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html">뷰티플수프 3 문서 중국어 번역본</a>을 보는 것이 더 쉬울 것이다. 그 다음에 이 문서를 읽고 뷰티플수프 4에서 변한 것들을 알아보자.</p>
+<div class="section" id="porting-code-to-bs4">
+<h2>BS4로 코드 이식하기<a class="headerlink" href="#porting-code-to-bs4" title="Permalink to this headline">¶</a></h2>
+<p>
+뷰티플수프 3용 코드는 하나만 살짝 바꾸면 뷰티플수프 4에도 작동한다. 꾸러미 이름을 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt>에서 <tt class="docutils literal"><span class="pre">bs4</span></tt>로 바꾸기만 하면 된다. 그래서 다음은:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">BeautifulSoup</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
+</pre></div>
+</div>
+<p>다음과 같이 된다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
+</pre></div>
+</div>
+<ul class="simple">
+<li>“No module named BeautifulSoup”와 같이 <tt class="docutils literal"><span class="pre">ImportError</span></tt>를 만난다면, 문제는 뷰티플수프 3 코드를 시도하는데 뷰티플수프 4만 설치되어 있기 때문이다.</li>
+<li>“No module named bs4”와 같은 <tt class="docutils literal"><span class="pre">ImportError</span></tt>를 만난다면, 문제는 뷰티플수프 4 코드를 시도하는데 뷰티플수프 3만 설치되어 있기 때문이다.</li>
+</ul>
+<p>BS4는 BS3와 대부분 하위 호환성이 있으므로, 대부분의 메쏘드는 폐기되고 <a class="reference external" href="https://web.archive.org/web/20150319200824/http://www.python.org/dev/peps/pep-0008/">PEP 8을 준수하기 위해</a> 새로운 이름이 주어졌다. 이름바꾸기와 변화가 많이 있지만, 그 중에 몇 가지는 하위 호환성이 깨진다.</p>
+<p>다음은 BS3 코드를 변환해 BS4에 이식하고자 할 때 알아야 할 것들이다:</p>
+<div class="section" id="you-need-a-parser">
+<h3>해석기가 필요해<a class="headerlink" href="#you-need-a-parser" title="Permalink to this headline">¶</a></h3>
+<p>뷰티플수프 3는 파이썬의 <tt class="docutils literal"><span class="pre">SGMLParser</span></tt>해석기를 사용했다. 이 모듈은 파이썬 3.0에서 제거되었다. 뷰티플수프 4는 기본으로 <tt class="docutils literal"><span class="pre">html.parser</span></tt>을 사용하지만, 대신에 lxml이나 html5lib을 설치해 사용할 수있다. 비교는 <a class="reference internal" href="#installing-a-parser">해석기 설치하기</a>를 참조하자.</p>
+<p>
+<tt class="docutils literal"><span class="pre">html.parser</span></tt>는 <tt class="docutils literal"><span class="pre">SGMLParser</span></tt>와 같은 해석기가 아니기 때문에, 무효한 조판을 다르게 취급한다. 보통 “차이점은” 무효한 조판을 다룰 경우 <tt class="docutils literal"> <span class="pre">html.parser</span></tt>가 해석기가 충돌을 일으키는 것이다. 이런 경우, 또다른 해석기를 설치할 필요가 있다. 그러나 <tt class="docutils literal"><span class="pre">html.parser</span></tt>는 <tt class="docutils literal"><span class="pre">SGMLParser</span></tt>와는 다른 해석 트리를 생성한다. 이런 일이 일어나면, BS3 코드를 업데이트하여 새로운 트리를 다루도록 해야 할 필요가 있다.</p>
+</div>
+<div class="section" id="method-names">
+<h3>메쏘드 이름<a class="headerlink" href="#method-names" title="Permalink to this headline">¶</a></h3>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">renderContents</span></tt> -&gt; <tt class="docutils literal"><span class="pre">encode_contents</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">replaceWith</span></tt> -&gt; <tt class="docutils literal"><span class="pre">replace_with</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">replaceWithChildren</span></tt> -&gt; <tt class="docutils literal"><span class="pre">unwrap</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findAll</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_all</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findAllNext</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_all_next</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findAllPrevious</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_all_previous</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findNext</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_next</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findNextSibling</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_next_sibling</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findNextSiblings</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_next_siblings</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findParent</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_parent</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findParents</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_parents</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findPrevious</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_previous</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findPreviousSibling</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_previous_sibling</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">findPreviousSiblings</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_previous_siblings</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">nextSibling</span></tt> -&gt; <tt class="docutils literal"><span class="pre">next_sibling</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">previousSibling</span></tt> -&gt; <tt class="docutils literal"><span class="pre">previous_sibling</span></tt></li>
+</ul>
+<p>뷰티플수프 구성자에 건네는 인자들 중에서 같은 이유로 이름이 바뀌었다:</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">BeautifulSoup(parseOnlyThese=...)</span></tt> -&gt; <tt class="docutils literal"><span class="pre">BeautifulSoup(parse_only=...)</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">BeautifulSoup(fromEncoding=...)</span></tt> -&gt; <tt class="docutils literal"><span class="pre">BeautifulSoup(from_encoding=...)</span></tt></li>
+</ul>
+<p>파이썬 3와의 호환을 위해 한 가지 메쏘드 이름을 바꾸었다:</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">Tag.has_key()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">Tag.has_attr()</span></tt></li>
+</ul>
+<p>더 정확한 용어를 위해 한 속성의 이름을 바꾸었다:</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">Tag.isSelfClosing</span></tt> -&gt; <tt class="docutils literal"><span class="pre">Tag.is_empty_element</span></tt></li>
+</ul>
+<p>파이썬에서 특별한 의미가 있는 단어들을 피해서 세 가지 속성의 이름을 바꾸었다. 다른 것들과 다르게 이 변경사항은 <em>하위 호환이 되지 않는다.</em> 이런 속성을 BS3에 사용하면, BS4로 이식할 때 코드가 깨질 것이다.</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">UnicodeDammit.unicode</span></tt> -&gt; <tt class="docutils literal"><span class="pre">UnicodeDammit.unicode_markup</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">Tag.next</span></tt> -&gt; <tt class="docutils literal"><span class="pre">Tag.next_element</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">Tag.previous</span></tt> -&gt; <tt class="docutils literal"><span class="pre">Tag.previous_element</span></tt></li>
+</ul>
+</div>
+<div class="section" id="generators">
+<h3>발생자<a class="headerlink" href="#generators" title="Permalink to this headline">¶</a></h3>
+<p>발생자에 PEP 8을-준수하는 이름을 부여하고, 특성으로 변환하였다:</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">childGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">children</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">nextGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">next_elements</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">nextSiblingGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">next_siblings</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">previousGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">previous_elements</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">previousSiblingGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">previous_siblings</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">recursiveChildGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">descendants</span></tt></li>
+<li><tt class="docutils literal"><span class="pre">parentGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">parents</span></tt></li>
+</ul>
+<p>그래서 다음과 같이 하는 대신에:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">tag</span><span class="o">.</span><span class="n">parentGenerator</span><span class="p">():</span>
+ <span class="o">...</span>
+</pre></div>
+</div>
+<p>다음과 같이 작성할 수 있다:</p>
+<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">tag</span><span class="o">.</span><span class="n">parents</span><span class="p">:</span>
+ <span class="o">...</span>
+</pre></div>
+</div>
+<p>(그러나 구형 코드도 여전히 작동한다.)</p>
+<p>어떤 발생자들은 일이 끝난후 <tt class="docutils literal"><span class="pre">None</span></tt>을 돌려주곤 했다. 그것은 버그였다. 이제 발생자는 그냥 멈춘다.</p>
+<p>두 가지 발생자가 새로 추가되었는데, <a class="reference internal" href="#string-generators"><em>.strings와 .stripped_strings</em></a>가 그것이다. <tt class="docutils literal"><span class="pre">.strings</span></tt>는 NavigableString 객체를 산출하고, <tt class="docutils literal"><span class="pre">.stripped_strings</span></tt>는 공백이 제거된 파이썬 문자열을 산출한다.</p>
+</div>
+<div class="section" id="xml">
+<h3>XML<a class="headerlink" href="#xml" title="Permalink to this headline">¶</a></h3>
+<p>이제 XML 해석을 위한 <tt class="docutils literal"><span class="pre">BeautifulStoneSoup</span></tt> 클래스는 더 이상 없다. XML을 해석하려면“xml”을 두번째 인자로 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 구성자에 건네야 한다. 같은 이유로, <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 구성자는 더 이상 <tt class="docutils literal"><span class="pre">isHTML</span></tt> 인자를 인지하지 못한다.</p>
+<p>뷰티플수프의 빈-원소 XML 태그 처리 방식이 개선되었다. 전에는 XML을 해석할 때 명시적으로 어느 태그가 빈-원소 태그로 간주되는지 지정해야 했었다. 구성자에 <tt class="docutils literal"><span class="pre">selfClosingTags</span></tt> 인자를 보내 봐야 더 이상 인지하지 못한다. 대신에,
+뷰티플수프는 빈 태그를 빈-원소 태그로 간주한다. 빈-원소 태그에 자손을 하나 추가하면, 더 이상 빈-원소 태그가 아니다.</p>
+</div>
+<div class="section" id="entities">
+<h3>개체<a class="headerlink" href="#entities" title="Permalink to this headline">¶</a></h3>
+<p>
+HTML이나 XML 개체가 들어 오면 언제나 그에 상응하는 유니코드 문자로 변환된다. 뷰티플수프 3는 개체들을 다루기 위한 방법이 중첩적으로 많았다. 이제 중복이 제거되었다. <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 구성자는 더 이상
+<tt class="docutils literal"><span class="pre">smartQuotesTo</span></tt>이나 <tt class="docutils literal"><span class="pre">convertEntities</span></tt> 인자를 인지하지 않는다. (<a class="reference internal" href="#unicode-dammit">Unicode, Dammit</a>은 여전히 <tt class="docutils literal"><span class="pre">smart_quotes_to</span></tt>가 있지만, 그의 기본값은 이제 지능형 따옴표를 유니코드로 변환하는 것이다.)
+
+<tt class="docutils literal"><span class="pre">HTML_ENTITIES</span></tt>,
+<tt class="docutils literal"><span class="pre">XML_ENTITIES</span></tt>, 그리고 <tt class="docutils literal"><span class="pre">XHTML_ENTITIES</span></tt> 상수는 제거되었다. 왜냐하면 이제 더 이상 존재하지 않는 특징을 구성하기 때문이다 (유니코드 문자열을 제대로 모두 변환하지 못했다).</p>
+<p>유니코드 문자들을 다시 출력시에 HTML 개체로 변환하고 싶다면, 그것들을 UTF-8 문자로 변환하기 보다, <a class="reference internal" href="#output-formatters"><em>출력 포맷터</em></a>를 사용할 필요가 있다.</p>
+</div>
+<div class="section" id="id13">
+<h3>기타<a class="headerlink" href="#id13" title="Permalink to this headline">¶</a></h3>
+<p><a class="reference internal" href="#string"><em>Tag.string</em></a>은 이제 재귀적으로 작동한다. 태그 A에 태그 B만 달랑 있고 다른 것이 없다면, A.string은 B.string과 똑같다. (이전에서는 None이었다.)</p>
+<p><a class="reference internal" href="#multi-valued-attributes">다중-값 속성</a>은 <tt class="docutils literal"><span class="pre">class</span></tt>와 같이 문자열이 아니라 문자열 리스트를 그 값으로 가진다. 이 사실은 CSS 클래스로 검색하는 방식에 영향을 미친다.</p>
+<p>
+<tt class="docutils literal"><span class="pre">find*</span></tt> 메쏘드에 <a class="reference internal" href="#text"><em>text</em></a> <cite> 그리고 </cite> <a class="reference internal" href="#id8"><em>name</em></a> 같은 태그-종속적 인자를 모두 건네면, 뷰티플수프는 태그-종속적 기준에 부합하고 그 태그의 <a class="reference internal" href="#string"><em>Tag.string</em></a>이 <a class="reference internal" href="#text"><em>text</em></a> 값에 부합하는 태그들을 탐색한다. 문자열 자체는 <cite>찾지 않는다</cite>. 이전에, 뷰티플수프는 태그-종속적 인자는 무시하고 문자열을 찾았다.</p>
+<p>
+<tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 구성자는 더 이상 <cite>markupMassage</cite> 인자를 인지하지 않는다. 이제 조판을 제대로 처리하는 일은 해석기의 책임이다..</p>
+<p><tt class="docutils literal"><span class="pre">ICantBelieveItsBeautifulSoup</span></tt> 그리고 <tt class="docutils literal"><span class="pre">BeautifulSOAP</span></tt>와 같이 거의-사용되지 않는 해석기 클래스는 제거되었다. 이제 애매모호한 조판을 처리하는 방법은 해석기가 결정한다.</p>
+<p>
+<tt class="docutils literal"><span class="pre">prettify()</span></tt> 메쏘드는 이제, bytestring이 아니라 유니코드 문자열을 돌려준다.</p>
+</div>
+</div>
+</div>
+
+
+ </div>
+ </div>
+ </div>
+ <div class="sphinxsidebar">
+ <div class="sphinxsidebarwrapper">
+ <h3><a href="#">목차</a></h3>
+ <ul>
+<li><a class="reference internal" href="#">뷰티플수프 문서</a><ul>
+<li><a class="reference internal" href="#getting-help">도움 얻기</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#quick-start">빨리 시작하기</a></li>
+<li><a class="reference internal" href="#installing-beautiful-soup">뷰티플수프 설치</a><ul>
+<li><a class="reference internal" href="#problems-after-installation">설치 이후의 문제</a></li>
+<li><a class="reference internal" href="#installing-a-parser">해석기 설치하기</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#making-the-soup">수프 만들기</a></li>
+<li><a class="reference internal" href="#kinds-of-objects">객체의 종류</a><ul>
+<li><a class="reference internal" href="#tag"><tt class="docutils literal"><span class="pre">태그(Tag)</span></tt></a><ul>
+<li><a class="reference internal" href="#name">이름(Name)</a></li>
+<li><a class="reference internal" href="#attributes">속성(Attributes)</a><ul>
+<li><a class="reference internal" href="#multi-valued-attributes">다중-값 속성(Multi-valued attributes)</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li><a class="reference internal" href="#navigablestring"><tt class="docutils literal"><span class="pre">NavigableString</span></tt></a></li>
+<li><a class="reference internal" href="#beautifulsoup"><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt></a></li>
+<li><a class="reference internal" href="#comments-and-other-special-strings">주석 그리고 기타 특수 문자들</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#navigating-the-tree">트리 항해하기</a><ul>
+<li><a class="reference internal" href="#going-down">트리 내려가기</a><ul>
+<li><a class="reference internal" href="#navigating-using-tag-names">태그 이름을 사용하여 항해하기</a></li>
+<li><a class="reference internal" href="#contents-and-children"><tt class="docutils literal"><span class="pre">.contents</span></tt>와 <tt class="docutils literal"><span class="pre">.children</span></tt></a></li>
+<li><a class="reference internal" href="#descendants"><tt class="docutils literal"><span class="pre">.descendants</span></tt></a></li>
+<li><a class="reference internal" href="#string"><tt class="docutils literal"><span class="pre">.string</span></tt></a></li>
+<li><a class="reference internal" href="#strings-and-stripped-strings"><tt class="docutils literal"><span class="pre">.strings</span></tt>와 <tt class="docutils literal"><span class="pre">stripped_strings</span></tt></a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#going-up">올라가기</a><ul>
+<li><a class="reference internal" href="#parent"><tt class="docutils literal"><span class="pre">.parent</span></tt></a></li>
+<li><a class="reference internal" href="#parents"><tt class="docutils literal"><span class="pre">.parents</span></tt></a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#going-sideways">옆으로 가기</a><ul>
+<li><a class="reference internal" href="#next-sibling-and-previous-sibling"><tt class="docutils literal"><span class="pre">.next_sibling</span></tt>와 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt></a></li>
+<li><a class="reference internal" href="#next-siblings-and-previous-siblings"><tt class="docutils literal"><span class="pre">.next_siblings</span></tt>와 <tt class="docutils literal"><span class="pre">.previous_siblings</span></tt></a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#going-back-and-forth">앞뒤로 가기</a><ul>
+<li><a class="reference internal" href="#next-element-and-previous-element"><tt class="docutils literal"><span class="pre">.next_element</span></tt>와 <tt class="docutils literal"><span class="pre">.previous_element</span></tt></a></li>
+<li><a class="reference internal" href="#next-elements-and-previous-elements"><tt class="docutils literal"><span class="pre">.next_elements</span></tt>와 <tt class="docutils literal"><span class="pre">.previous_elements</span></tt></a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li><a class="reference internal" href="#searching-the-tree">트리 탐색하기</a><ul>
+<li><a class="reference internal" href="#kinds-of-filters">여과기의 종류</a><ul>
+<li><a class="reference internal" href="#a-string">문자열</a></li>
+<li><a class="reference internal" href="#a-regular-expression">정규 표현식</a></li>
+<li><a class="reference internal" href="#a-list">리스트</a></li>
+<li><a class="reference internal" href="#true"><tt class="docutils literal"><span class="pre">True</span></tt></a></li>
+<li><a class="reference internal" href="#a-function">함수</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#find-all"><tt class="docutils literal"><span class="pre">find_all()</span></tt></a><ul>
+<li><a class="reference internal" href="#the-name-argument"> <tt class="docutils literal"><span class="pre">name</span></tt> 인자</a></li>
+<li><a class="reference internal" href="#the-keyword-arguments">키워드 인자</a></li>
+<li><a class="reference internal" href="#searching-by-css-class">CSS 클래스로 탐색하기</a></li>
+<li><a class="reference internal" href="#the-text-argument"><tt class="docutils literal"><span class="pre">text</span></tt> 인자</a></li>
+<li><a class="reference internal" href="#the-limit-argument"><tt class="docutils literal"><span class="pre">limit</span></tt> 인자</a></li>
+<li><a class="reference internal" href="#the-recursive-argument"><tt class="docutils literal"><span class="pre">recursive</span></tt> 인자</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#calling-a-tag-is-like-calling-find-all">태그를 호출하는 것은 <tt class="docutils literal"><span class="pre">find_all()</span></tt></a>을 호출하는 것과 같다.</li>
+<li><a class="reference internal" href="#find"><tt class="docutils literal"><span class="pre">find()</span></tt></a></li>
+<li><a class="reference internal" href="#find-parents-and-find-parent"><tt class="docutils literal"><span class="pre">find_parents()</span></tt>와 <tt class="docutils literal"><span class="pre">find_parent()</span></tt></a></li>
+<li><a class="reference internal" href="#find-next-siblings-and-find-next-sibling"><tt class="docutils literal"><span class="pre">find_next_siblings()</span></tt>와 <tt class="docutils literal"><span class="pre">find_next_sibling()</span></tt></a></li>
+<li><a class="reference internal" href="#find-previous-siblings-and-find-previous-sibling"><tt class="docutils literal"><span class="pre">find_previous_siblings()</span></tt>와 <tt class="docutils literal"><span class="pre">find_previous_sibling()</span></tt></a></li>
+<li><a class="reference internal" href="#find-all-next-and-find-next"><tt class="docutils literal"><span class="pre">find_all_next()</span></tt>와 <tt class="docutils literal"><span class="pre">find_next()</span></tt></a></li>
+<li><a class="reference internal" href="#find-all-previous-and-find-previous"><tt class="docutils literal"><span class="pre">find_all_previous()</span></tt>와 <tt class="docutils literal"><span class="pre">find_previous()</span></tt></a></li>
+<li><a class="reference internal" href="#css-selectors">CSS 선택자</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#modifying-the-tree">트리 변경하기</a><ul>
+<li><a class="reference internal" href="#changing-tag-names-and-attributes">태그 이름과 속성을 바꾸기</a></li>
+<li><a class="reference internal" href="#modifying-string"><tt class="docutils literal"><span class="pre">.string</span></tt></a> 변경하기</li>
+<li><a class="reference internal" href="#append"><tt class="docutils literal"><span class="pre">append()</span></tt></a></li>
+<li><a class="reference internal" href="#beautifulsoup-new-string-and-new-tag"><tt class="docutils literal"><span class="pre">BeautifulSoup.new_string()</span></tt>와 <tt class="docutils literal"><span class="pre">.new_tag()</span></tt></a></li>
+<li><a class="reference internal" href="#insert"><tt class="docutils literal"><span class="pre">insert()</span></tt></a></li>
+<li><a class="reference internal" href="#insert-before-and-insert-after"><tt class="docutils literal"><span class="pre">insert_before()</span></tt>와 <tt class="docutils literal"><span class="pre">insert_after()</span></tt></a></li>
+<li><a class="reference internal" href="#clear"><tt class="docutils literal"><span class="pre">clear()</span></tt></a></li>
+<li><a class="reference internal" href="#extract"><tt class="docutils literal"><span class="pre">extract()</span></tt></a></li>
+<li><a class="reference internal" href="#decompose"><tt class="docutils literal"><span class="pre">decompose()</span></tt></a></li>
+<li><a class="reference internal" href="#replace-with"><tt class="docutils literal"><span class="pre">replace_with()</span></tt></a></li>
+<li><a class="reference internal" href="#wrap"><tt class="docutils literal"><span class="pre">wrap()</span></tt></a></li>
+<li><a class="reference internal" href="#unwrap"><tt class="docutils literal"><span class="pre">unwrap()</span></tt></a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#output">출력</a><ul>
+<li><a class="reference internal" href="#pretty-printing">예쁘게 인쇄하기</a></li>
+<li><a class="reference internal" href="#non-pretty-printing">있는 그대로 인쇄하기</a></li>
+<li><a class="reference internal" href="#output-formatters">출력 포맷</a></li>
+<li><a class="reference internal" href="#get-text"><tt class="docutils literal"><span class="pre">get_text()</span></tt></a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#specifying-the-parser-to-use">해석기 지정하기</a><ul>
+<li><a class="reference internal" href="#differences-between-parsers">해석기 사이의 차이점</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#encodings">인코딩</a><ul>
+<li><a class="reference internal" href="#output-encoding">출력 인코딩</a></li>
+<li><a class="reference internal" href="#unicode-dammit">이런, 유니코드군</a><ul>
+<li><a class="reference internal" href="#smart-quotes">지능형 따옴표</a></li>
+<li><a class="reference internal" href="#inconsistent-encodings">비일관적인 인코딩</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li><a class="reference internal" href="#parsing-only-part-of-a-document">문서의 일부만 해석하기</a><ul>
+<li><a class="reference internal" href="#soupstrainer"><tt class="docutils literal"><span class="pre">SoupStrainer</span></tt></a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#troubleshooting">문제 해결</a><ul>
+<li><a class="reference internal" href="#version-mismatch-problems">버전 불일치 문제</a></li>
+<li><a class="reference internal" href="#parsing-xml">XML 해석하기</a></li>
+<li><a class="reference internal" href="#other-parser-problems">기타 해석기 문제</a></li>
+<li><a class="reference internal" href="#miscellaneous">잡동사니</a></li>
+<li><a class="reference internal" href="#improving-performance">수행성능 향상하기</a></li>
+</ul>
+</li>
+<li><a class="reference internal" href="#id12">뷰티플수프 3</a><ul>
+<li><a class="reference internal" href="#porting-code-to-bs4">BS4로 코드 이식하기</a><ul>
+<li><a class="reference internal" href="#you-need-a-parser">해석기가 필요해</a></li>
+<li><a class="reference internal" href="#method-names">메쏘드 이름</a></li>
+<li><a class="reference internal" href="#generators">발생자</a></li>
+<li><a class="reference internal" href="#xml">XML</a></li>
+<li><a class="reference internal" href="#entities">개체</a></li>
+<li><a class="reference internal" href="#id13">기타</a></li>
+</ul>
+</li>
+</ul>
+</li>
+</ul>
+
+ <h3>이 페이지</h3>
+ <ul class="this-page-menu">
+ <li><a href="https://web.archive.org/web/20150319200824/http://www.crummy.com/software/BeautifulSoup/bs4/doc/_sources/index.txt" rel="nofollow">소스 보여주기</a></li>
+ </ul>
+<div id="searchbox" style="">
+ <h3>빠른 검색</h3>
+ <form class="search" action="https://web.archive.org/web/20150319200824/http://www.crummy.com/software/BeautifulSoup/bs4/doc/search.html" method="get">
+ <input name="q" type="text">
+ <input value="Go" type="submit">
+ <input name="check_keywords" value="yes" type="hidden">
+ <input name="area" value="default" type="hidden">
+ </form>
+ <p class="searchtip" style="font-size: 90%;">
+ 용어나 모듈, 클래스 또는 함수 이름을 입력하시오.
+ </p>
+</div>
+
+ </div>
+ </div>
+ <div class="clearer"></div>
+ </div>
+ <div class="related">
+ <h3>항해</h3>
+ <ul>
+ <li class="right" style="margin-right: 10px;">
+ <a href="https://web.archive.org/web/20150319200824/http://www.crummy.com/software/BeautifulSoup/bs4/doc/genindex.html" title="General Index">인덱스</a></li>
+ <li><a href="#">뷰티플수프 4.0.0 문서</a> »</li>
+ </ul>
+ </div>
+ <div class="footer">
+ © Copyright 2012, Leonard Richardson.
+ Created using <a href="https://web.archive.org/web/20150319200824/http://sphinx.pocoo.org/">Sphinx</a> 1.1.3.
+ </div>
+
+
+
+
+</body></html>
diff --git a/doc.ptbr/source/6.1.jpg b/doc.ptbr/source/6.1.jpg
new file mode 100644
index 0000000..97014f0
--- /dev/null
+++ b/doc.ptbr/source/6.1.jpg
Binary files differ
diff --git a/doc.ptbr/source/conf.py b/doc.ptbr/source/conf.py
new file mode 100644
index 0000000..cd679b5
--- /dev/null
+++ b/doc.ptbr/source/conf.py
@@ -0,0 +1,256 @@
+# -*- coding: utf-8 -*-
+#
+# Beautiful Soup documentation build configuration file, created by
+# sphinx-quickstart on Thu Jan 26 11:22:55 2012.
+#
+# This file is execfile()d with the current directory set to its containing dir.
+#
+# Note that not all possible configuration values are present in this
+# autogenerated file.
+#
+# All configuration values have a default; values that are commented out
+# serve to show the default.
+
+import sys, os
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#sys.path.insert(0, os.path.abspath('.'))
+
+# -- General configuration -----------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be extensions
+# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
+extensions = []
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix of source filenames.
+source_suffix = '.rst'
+
+# The encoding of source files.
+#source_encoding = 'utf-8-sig'
+
+# The master toctree document.
+master_doc = 'index'
+
+# General information about the project.
+project = u'Beautiful Soup'
+copyright = u'2004-2015, Leonard Richardson'
+
+# The version info for the project you're documenting, acts as replacement for
+# |version| and |release|, also used in various other places throughout the
+# built documents.
+#
+# The short X.Y version.
+version = '4'
+# The full version, including alpha/beta/rc tags.
+release = '4.4.0'
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#language = None
+
+# There are two options for replacing |today|: either, you set today to some
+# non-false value, then it is used:
+#today = ''
+# Else, today_fmt is used as the format for a strftime call.
+#today_fmt = '%B %d, %Y'
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+exclude_patterns = []
+
+# The reST default role (used for this markup: `text`) to use for all documents.
+#default_role = None
+
+# If true, '()' will be appended to :func: etc. cross-reference text.
+#add_function_parentheses = True
+
+# If true, the current module name will be prepended to all description
+# unit titles (such as .. function::).
+#add_module_names = True
+
+# If true, sectionauthor and moduleauthor directives will be shown in the
+# output. They are ignored by default.
+#show_authors = False
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = 'sphinx'
+
+# A list of ignored prefixes for module index sorting.
+#modindex_common_prefix = []
+
+
+# -- Options for HTML output ---------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages. See the documentation for
+# a list of builtin themes.
+html_theme = 'default'
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further. For a list of options available for each theme, see the
+# documentation.
+#html_theme_options = {}
+
+# Add any paths that contain custom themes here, relative to this directory.
+#html_theme_path = []
+
+# The name for this set of Sphinx documents. If None, it defaults to
+# "<project> v<release> documentation".
+#html_title = None
+
+# A shorter title for the navigation bar. Default is the same as html_title.
+#html_short_title = None
+
+# The name of an image file (relative to this directory) to place at the top
+# of the sidebar.
+#html_logo = None
+
+# The name of an image file (within the static path) to use as favicon of the
+# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
+# pixels large.
+#html_favicon = None
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+
+# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
+# using the given strftime format.
+#html_last_updated_fmt = '%b %d, %Y'
+
+# If true, SmartyPants will be used to convert quotes and dashes to
+# typographically correct entities.
+#html_use_smartypants = True
+
+# Custom sidebar templates, maps document names to template names.
+#html_sidebars = {}
+
+# Additional templates that should be rendered to pages, maps page names to
+# template names.
+#html_additional_pages = {}
+
+# If false, no module index is generated.
+#html_domain_indices = True
+
+# If false, no index is generated.
+#html_use_index = True
+
+# If true, the index is split into individual pages for each letter.
+#html_split_index = False
+
+# If true, links to the reST sources are added to the pages.
+#html_show_sourcelink = True
+
+# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
+#html_show_sphinx = True
+
+# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
+#html_show_copyright = True
+
+# If true, an OpenSearch description file will be output, and all pages will
+# contain a <link> tag referring to it. The value of this option must be the
+# base URL from which the finished HTML is served.
+#html_use_opensearch = ''
+
+# This is the file name suffix for HTML files (e.g. ".xhtml").
+#html_file_suffix = None
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'BeautifulSoupdoc'
+
+
+# -- Options for LaTeX output --------------------------------------------------
+
+# The paper size ('letter' or 'a4').
+#latex_paper_size = 'letter'
+
+# The font size ('10pt', '11pt' or '12pt').
+#latex_font_size = '10pt'
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title, author, documentclass [howto/manual]).
+latex_documents = [
+ ('index', 'BeautifulSoup.tex', u'Beautiful Soup Documentation',
+ u'Leonard Richardson', 'manual'),
+]
+
+# The name of an image file (relative to this directory) to place at the top of
+# the title page.
+#latex_logo = None
+
+# For "manual" documents, if this is true, then toplevel headings are parts,
+# not chapters.
+#latex_use_parts = False
+
+# If true, show page references after internal links.
+#latex_show_pagerefs = False
+
+# If true, show URL addresses after external links.
+#latex_show_urls = False
+
+# Additional stuff for the LaTeX preamble.
+#latex_preamble = ''
+
+# Documents to append as an appendix to all manuals.
+#latex_appendices = []
+
+# If false, no module index is generated.
+#latex_domain_indices = True
+
+
+# -- Options for manual page output --------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+ ('index', 'beautifulsoup', u'Beautiful Soup Documentation',
+ [u'Leonard Richardson'], 1)
+]
+
+
+# -- Options for Epub output ---------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = u'Beautiful Soup'
+epub_author = u'Leonard Richardson'
+epub_publisher = u'Leonard Richardson'
+epub_copyright = u'2012, Leonard Richardson'
+
+# The language of the text. It defaults to the language option
+# or en if the language is not set.
+#epub_language = ''
+
+# The scheme of the identifier. Typical schemes are ISBN or URL.
+#epub_scheme = ''
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#epub_identifier = ''
+
+# A unique identification for the text.
+#epub_uid = ''
+
+# HTML files that should be inserted before the pages created by sphinx.
+# The format is a list of tuples containing the path and title.
+#epub_pre_files = []
+
+# HTML files shat should be inserted after the pages created by sphinx.
+# The format is a list of tuples containing the path and title.
+#epub_post_files = []
+
+# A list of files that should not be packed into the epub file.
+#epub_exclude_files = []
+
+# The depth of the table of contents in toc.ncx.
+#epub_tocdepth = 3
+
+# Allow duplicate toc entries.
+#epub_tocdup = True
diff --git a/doc/source/index.ptbr.rst b/doc.ptbr/source/index.rst
index f596d44..f596d44 100644
--- a/doc/source/index.ptbr.rst
+++ b/doc.ptbr/source/index.rst
diff --git a/doc.zh/source/6.1.jpg b/doc.zh/source/6.1.jpg
new file mode 100644
index 0000000..97014f0
--- /dev/null
+++ b/doc.zh/source/6.1.jpg
Binary files differ
diff --git a/doc.zh/source/index.rst b/doc.zh/source/index.rst
new file mode 100644
index 0000000..03d2e05
--- /dev/null
+++ b/doc.zh/source/index.rst
@@ -0,0 +1,2739 @@
+.. BeautifulSoup文档 documentation master file, created by
+ Deron Wang on Fri Nov 29 13:49:30 2013.
+ You can adapt this file completely to your liking, but it should at least
+ contain the root `toctree` directive.
+
+Beautiful Soup 4.4.0 文档
+==========================
+
+`Beautiful Soup <http://www.crummy.com/software/BeautifulSoup/>`_ 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.
+
+这篇文档介绍了BeautifulSoup4中所有主要特性,并且有小例子.让我来向你展示它适合做什么,如何工作,怎样使用,如何达到你想要的效果,和处理异常情况.
+
+文档中出现的例子在Python2.7和Python3.2中的执行结果相同
+
+你可能在寻找 `Beautiful Soup3 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_ 的文档,Beautiful Soup 3 目前已经停止开发,我们推荐在现在的项目中使用Beautiful Soup 4, `移植到BS4 <http://www.baidu.com>`_
+
+这篇帮助文档已经被翻译成了其它语言:
+
+* `这篇文档当然还有中文版. <http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.cn.html>`_
+* このページは日本語で利用できます(`外部リンク <http://kondou.com/BS4/>`_)
+* 이 문서는 한국어 번역도 가능합니다. (`외부 링크 <http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_)
+
+
+寻求帮助
+--------
+
+如果你有关于BeautifulSoup的问题,可以发送邮件到 `讨论组 <https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup>`_ .如果你的问题包含了一段需要转换的HTML代码,那么确保你提的问题描述中附带这段HTML文档的 `代码诊断`_ [1]_
+
+快速开始
+========
+
+下面的一段HTML代码将作为例子被多次用到.这是 *爱丽丝梦游仙境的* 的一段内容(以后内容中简称为 *爱丽丝* 的文档):
+
+::
+
+ html_doc = """
+ <html><head><title>The Dormouse's story</title></head>
+ <body>
+ <p class="title"><b>The Dormouse's story</b></p>
+
+ <p class="story">Once upon a time there were three little sisters; and their names were
+ <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
+ <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
+ <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
+ and they lived at the bottom of a well.</p>
+
+ <p class="story">...</p>
+ """
+
+使用BeautifulSoup解析这段代码,能够得到一个 ``BeautifulSoup`` 的对象,并能按照标准的缩进格式的结构输出:
+
+::
+
+ from bs4 import BeautifulSoup
+ soup = BeautifulSoup(html_doc, 'html.parser')
+
+ print(soup.prettify())
+ # <html>
+ # <head>
+ # <title>
+ # The Dormouse's story
+ # </title>
+ # </head>
+ # <body>
+ # <p class="title">
+ # <b>
+ # The Dormouse's story
+ # </b>
+ # </p>
+ # <p class="story">
+ # Once upon a time there were three little sisters; and their names were
+ # <a class="sister" href="http://example.com/elsie" id="link1">
+ # Elsie
+ # </a>
+ # ,
+ # <a class="sister" href="http://example.com/lacie" id="link2">
+ # Lacie
+ # </a>
+ # and
+ # <a class="sister" href="http://example.com/tillie" id="link2">
+ # Tillie
+ # </a>
+ # ; and they lived at the bottom of a well.
+ # </p>
+ # <p class="story">
+ # ...
+ # </p>
+ # </body>
+ # </html>
+
+几个简单的浏览结构化数据的方法:
+
+::
+
+ soup.title
+ # <title>The Dormouse's story</title>
+
+ soup.title.name
+ # u'title'
+
+ soup.title.string
+ # u'The Dormouse's story'
+
+ soup.title.parent.name
+ # u'head'
+
+ soup.p
+ # <p class="title"><b>The Dormouse's story</b></p>
+
+ soup.p['class']
+ # u'title'
+
+ soup.a
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+
+ soup.find_all('a')
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+ soup.find(id="link3")
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
+
+从文档中找到所有<a>标签的链接:
+
+::
+
+ for link in soup.find_all('a'):
+ print(link.get('href'))
+ # http://example.com/elsie
+ # http://example.com/lacie
+ # http://example.com/tillie
+
+从文档中获取所有文字内容:
+
+::
+
+ print(soup.get_text())
+ # The Dormouse's story
+ #
+ # The Dormouse's story
+ #
+ # Once upon a time there were three little sisters; and their names were
+ # Elsie,
+ # Lacie and
+ # Tillie;
+ # and they lived at the bottom of a well.
+ #
+ # ...
+
+这是你想要的吗?别着急,还有更好用的
+
+安装 Beautiful Soup
+======================
+
+如果你用的是新版的Debain或ubuntu,那么可以通过系统的软件包管理来安装:
+
+``$ apt-get install Python-bs4``
+
+Beautiful Soup 4 通过PyPi发布,所以如果你无法使用系统包管理安装,那么也可以通过 ``easy_install`` 或 ``pip`` 来安装.包的名字是 ``beautifulsoup4`` ,这个包兼容Python2和Python3.
+
+``$ easy_install beautifulsoup4``
+
+``$ pip install beautifulsoup4``
+
+(在PyPi中还有一个名字是 ``BeautifulSoup`` 的包,但那可能不是你想要的,那是 `Beautiful Soup3 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_ 的发布版本,因为很多项目还在使用BS3, 所以 ``BeautifulSoup`` 包依然有效.但是如果你在编写新项目,那么你应该安装的 ``beautifulsoup4`` )
+
+如果你没有安装 ``easy_install`` 或 ``pip`` ,那你也可以 `下载BS4的源码 <http://www.crummy.com/software/BeautifulSoup/download/4.x/>`_ ,然后通过setup.py来安装.
+
+``$ Python setup.py install``
+
+如果上述安装方法都行不通,Beautiful Soup的发布协议允许你将BS4的代码打包在你的项目中,这样无须安装即可使用.
+
+作者在Python2.7和Python3.2的版本下开发Beautiful Soup, 理论上Beautiful Soup应该在所有当前的Python版本中正常工作
+
+安装完成后的问题
+-----------------
+
+Beautiful Soup发布时打包成Python2版本的代码,在Python3环境下安装时,会自动转换成Python3的代码,如果没有一个安装的过程,那么代码就不会被转换.
+
+如果代码抛出了 ``ImportError`` 的异常: "No module named HTMLParser", 这是因为你在Python3版本中执行Python2版本的代码.
+
+
+如果代码抛出了 ``ImportError`` 的异常: "No module named html.parser", 这是因为你在Python2版本中执行Python3版本的代码.
+
+如果遇到上述2种情况,最好的解决方法是重新安装BeautifulSoup4.
+
+如果在ROOT_TAG_NAME = u'[document]'代码处遇到 ``SyntaxError`` "Invalid syntax"错误,需要将把BS4的Python代码版本从Python2转换到Python3. 可以重新安装BS4:
+
+``$ Python3 setup.py install``
+
+或在bs4的目录中执行Python代码版本转换脚本
+
+``$ 2to3-3.2 -w bs4``
+
+安装解析器
+------------
+
+Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 `lxml <http://lxml.de/>`_ .根据操作系统不同,可以选择下列方法来安装lxml:
+
+``$ apt-get install Python-lxml``
+
+``$ easy_install lxml``
+
+``$ pip install lxml``
+
+另一个可供选择的解析器是纯Python实现的 `html5lib <http://code.google.com/p/html5lib/>`_ , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:
+
+``$ apt-get install Python-html5lib``
+
+``$ easy_install html5lib``
+
+``$ pip install html5lib``
+
+下表列出了主要的解析器,以及它们的优缺点:
+
++-----------------------+---------------------------+---------------------------+---------------------------+
+| 解析器 | 使用方法 | 优势 | 劣势 |
++=======================+===========================+===========================+===========================+
+| Python标准库 | ``BeautifulSoup(markup, | - Python的内置标准库 | - Python 2.7.3 or 3.2.2)前|
+| | "html.parser")`` | - 执行速度适中 | 的版本中文档容错能力差 |
+| | | - 文档容错能力强 | |
+| | | | |
++-----------------------+---------------------------+---------------------------+---------------------------+
+| lxml HTML 解析器 | ``BeautifulSoup(markup, | - 速度快 | - 需要安装C语言库 |
+| | "lxml")`` | - 文档容错能力强 | |
+| | | | |
++-----------------------+---------------------------+---------------------------+---------------------------+
+| lxml XML 解析器 | ``BeautifulSoup(markup, | - 速度快 | - 需要安装C语言库 |
+| | ["lxml-xml"])`` | - 唯一支持XML的解析器 | |
+| | | | |
+| | ``BeautifulSoup(markup, | | |
+| | "xml")`` | | |
++-----------------------+---------------------------+---------------------------+---------------------------+
+| html5lib | ``BeautifulSoup(markup, | - 最好的容错性 | - 速度慢 |
+| | "html5lib")`` | - 以浏览器的方式解析文档 | - 不依赖外部扩展 |
+| | | - 生成HTML5格式的文档 | |
++-----------------------+---------------------------+---------------------------+---------------------------+
+
+推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.
+
+提示: 如果一段HTML或XML文档格式不正确的话,那么在不同的解析器中返回的结果可能是不一样的,查看 `解析器之间的区别`_ 了解更多细节
+
+如何使用
+========
+
+将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄.
+
+::
+
+ from bs4 import BeautifulSoup
+
+ soup = BeautifulSoup(open("index.html"))
+
+ soup = BeautifulSoup("<html>data</html>")
+
+首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码
+
+::
+
+ BeautifulSoup("Sacr&eacute; bleu!")
+ <html><head></head><body>Sacré bleu!</body></html>
+
+然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档.(参考 `解析成XML`_ ).
+
+对象的种类
+==========
+
+Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
+``Tag`` , ``NavigableString`` , ``BeautifulSoup`` , ``Comment`` .
+
+Tag
+-----
+
+``Tag`` 对象与XML或HTML原生文档中的tag相同:
+
+::
+
+ soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
+ tag = soup.b
+ type(tag)
+ # <class 'bs4.element.Tag'>
+
+Tag有很多方法和属性,在 `遍历文档树`_ 和 `搜索文档树`_ 中有详细解释.现在介绍一下tag中最重要的属性: name和attributes
+
+Name
+.....
+
+每个tag都有自己的名字,通过 ``.name`` 来获取:
+
+::
+
+ tag.name
+ # u'b'
+
+如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档:
+
+::
+
+ tag.name = "blockquote"
+ tag
+ # <blockquote class="boldest">Extremely bold</blockquote>
+
+Attributes
+............
+
+一个tag可能有很多个属性. tag ``<b class="boldest">`` 有一个 "class" 的属性,值为 "boldest" . tag的属性的操作方法与字典相同:
+
+::
+
+ tag['class']
+ # u'boldest'
+
+也可以直接"点"取属性, 比如: ``.attrs`` :
+
+::
+
+ tag.attrs
+ # {u'class': u'boldest'}
+
+tag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样
+
+::
+
+ tag['class'] = 'verybold'
+ tag['id'] = 1
+ tag
+ # <blockquote class="verybold" id="1">Extremely bold</blockquote>
+
+ del tag['class']
+ del tag['id']
+ tag
+ # <blockquote>Extremely bold</blockquote>
+
+ tag['class']
+ # KeyError: 'class'
+ print(tag.get('class'))
+ # None
+
+多值属性
+``````````
+
+HTML 4定义了一系列可以包含多个值的属性.在HTML5中移除了一些,却增加更多.最常见的多值的属性是 class (一个tag可以有多个CSS的class). 还有一些属性 ``rel`` , ``rev`` , ``accept-charset`` , ``headers`` , ``accesskey`` . 在Beautiful Soup中多值属性的返回类型是list:
+
+::
+
+ css_soup = BeautifulSoup('<p class="body strikeout"></p>')
+ css_soup.p['class']
+ # ["body", "strikeout"]
+
+ css_soup = BeautifulSoup('<p class="body"></p>')
+ css_soup.p['class']
+ # ["body"]
+
+如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回
+
+::
+
+ id_soup = BeautifulSoup('<p id="my id"></p>')
+ id_soup.p['id']
+ # 'my id'
+
+将tag转换成字符串时,多值属性会合并为一个值
+
+::
+
+ rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
+ rel_soup.a['rel']
+ # ['index']
+ rel_soup.a['rel'] = ['index', 'contents']
+ print(rel_soup.p)
+ # <p>Back to the <a rel="index contents">homepage</a></p>
+
+如果转换的文档是XML格式,那么tag中不包含多值属性
+
+::
+
+ xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
+ xml_soup.p['class']
+ # u'body strikeout'
+
+可以遍历的字符串
+----------------
+
+字符串常被包含在tag内.Beautiful Soup用 ``NavigableString`` 类来包装tag中的字符串:
+
+::
+
+ tag.string
+ # u'Extremely bold'
+ type(tag.string)
+ # <class 'bs4.element.NavigableString'>
+
+一个 ``NavigableString`` 字符串与Python中的Unicode字符串相同,并且还支持包含在 `遍历文档树`_ 和 `搜索文档树`_ 中的一些特性. 通过 ``unicode()`` 方法可以直接将 ``NavigableString`` 对象转换成Unicode字符串:
+
+::
+
+ unicode_string = unicode(tag.string)
+ unicode_string
+ # u'Extremely bold'
+ type(unicode_string)
+ # <type 'unicode'>
+
+tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 `replace_with()`_ 方法:
+
+::
+
+ tag.string.replace_with("No longer bold")
+ tag
+ # <blockquote>No longer bold</blockquote>
+
+``NavigableString`` 对象支持 `遍历文档树`_ 和 `搜索文档树`_ 中定义的大部分属性, 并非全部.尤其是,一个字符串不能包含其它内容(tag能够包含字符串或是其它tag),字符串不支持 ``.contents`` 或 ``.string`` 属性或 ``find()`` 方法.
+
+如果想在Beautiful Soup之外使用 ``NavigableString`` 对象,需要调用 ``unicode()`` 方法,将该对象转换成普通的Unicode字符串,否则就算Beautiful Soup已方法已经执行结束,该对象的输出也会带有对象的引用地址.这样会浪费内存.
+
+BeautifulSoup
+----------------
+
+``BeautifulSoup`` 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 ``Tag`` 对象,它支持 `遍历文档树`_ 和 `搜索文档树`_ 中描述的大部分的方法.
+
+因为 ``BeautifulSoup`` 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 ``.name`` 属性是很方便的,所以 ``BeautifulSoup`` 对象包含了一个值为 "[document]" 的特殊属性 ``.name``
+
+::
+
+ soup.name
+ # u'[document]'
+
+注释及特殊字符串
+-----------------
+
+``Tag`` , ``NavigableString`` , ``BeautifulSoup`` 几乎覆盖了html和xml中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:
+
+::
+
+ markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
+ soup = BeautifulSoup(markup)
+ comment = soup.b.string
+ type(comment)
+ # <class 'bs4.element.Comment'>
+
+``Comment`` 对象是一个特殊类型的 ``NavigableString`` 对象:
+
+::
+
+ comment
+ # u'Hey, buddy. Want to buy a used parser'
+
+但是当它出现在HTML文档中时, ``Comment`` 对象会使用特殊的格式输出:
+
+::
+
+ print(soup.b.prettify())
+ # <b>
+ # <!--Hey, buddy. Want to buy a used parser?-->
+ # </b>
+
+Beautiful Soup中定义的其它类型都可能会出现在XML的文档中: ``CData`` , ``ProcessingInstruction`` , ``Declaration`` , ``Doctype`` .与 ``Comment`` 对象类似,这些类都是 ``NavigableString`` 的子类,只是添加了一些额外的方法的字符串独享.下面是用CDATA来替代注释的例子:
+
+::
+
+ from bs4 import CData
+ cdata = CData("A CDATA block")
+ comment.replace_with(cdata)
+
+ print(soup.b.prettify())
+ # <b>
+ # <![CDATA[A CDATA block]]>
+ # </b>
+
+遍历文档树
+==========
+
+还拿"爱丽丝梦游仙境"的文档来做例子:
+
+::
+
+ html_doc = """
+ <html><head><title>The Dormouse's story</title></head>
+ <body>
+ <p class="title"><b>The Dormouse's story</b></p>
+
+ <p class="story">Once upon a time there were three little sisters; and their names were
+ <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
+ <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
+ <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
+ and they lived at the bottom of a well.</p>
+
+ <p class="story">...</p>
+ """
+
+ from bs4 import BeautifulSoup
+ soup = BeautifulSoup(html_doc, 'html.parser')
+
+通过这段例子来演示怎样从文档的一段内容找到另一段内容
+
+子节点
+-------
+
+一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.
+
+注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点
+
+tag的名字
+..........
+
+操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取 <head> 标签,只要用 ``soup.head`` :
+
+::
+
+ soup.head
+ # <head><title>The Dormouse's story</title></head>
+
+ soup.title
+ # <title>The Dormouse's story</title>
+
+这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取<body>标签中的第一个<b>标签:
+
+::
+
+ soup.body.b
+ # <b>The Dormouse's story</b>
+
+通过点取属性的方式只能获得当前名字的第一个tag:
+
+::
+
+ soup.a
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+
+如果想要得到所有的<a>标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 `Searching the tree` 中描述的方法,比如: find_all()
+
+::
+
+ soup.find_all('a')
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+.contents 和 .children
+........................
+
+tag的 ``.contents`` 属性可以将tag的子节点以列表的方式输出:
+
+::
+
+ head_tag = soup.head
+ head_tag
+ # <head><title>The Dormouse's story</title></head>
+
+ head_tag.contents
+ [<title>The Dormouse's story</title>]
+
+ title_tag = head_tag.contents[0]
+ title_tag
+ # <title>The Dormouse's story</title>
+ title_tag.contents
+ # [u'The Dormouse's story']
+
+``BeautifulSoup`` 对象本身一定会包含子节点,也就是说<html>标签也是 ``BeautifulSoup`` 对象的子节点:
+
+::
+
+ len(soup.contents)
+ # 1
+ soup.contents[0].name
+ # u'html'
+
+字符串没有 ``.contents`` 属性,因为字符串没有子节点:
+
+::
+
+ text = title_tag.contents[0]
+ text.contents
+ # AttributeError: 'NavigableString' object has no attribute 'contents'
+
+通过tag的 ``.children`` 生成器,可以对tag的子节点进行循环:
+
+::
+
+ for child in title_tag.children:
+ print(child)
+ # The Dormouse's story
+
+.descendants
+..............
+
+``.contents`` 和 ``.children`` 属性仅包含tag的直接子节点.例如,<head>标签只有一个直接子节点<title>
+
+::
+
+ head_tag.contents
+ # [<title>The Dormouse's story</title>]
+
+但是<title>标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于<head>标签的子孙节点. ``.descendants`` 属性可以对所有tag的子孙节点进行递归循环 [5]_ :
+
+::
+
+ for child in head_tag.descendants:
+ print(child)
+ # <title>The Dormouse's story</title>
+ # The Dormouse's story
+
+上面的例子中, <head>标签只有一个子节点,但是有2个子孙节点:<head>节点和<head>的子节点, ``BeautifulSoup`` 有一个直接子节点(<html>节点),却有很多子孙节点:
+
+::
+
+ len(list(soup.children))
+ # 1
+ len(list(soup.descendants))
+ # 25
+
+.string
+........
+
+如果tag只有一个 ``NavigableString`` 类型子节点,那么这个tag可以使用 ``.string`` 得到子节点:
+
+::
+
+ title_tag.string
+ # u'The Dormouse's story'
+
+如果一个tag仅有一个子节点,那么这个tag也可以使用 ``.string`` 方法,输出结果与当前唯一子节点的 ``.string`` 结果相同:
+
+::
+
+ head_tag.contents
+ # [<title>The Dormouse's story</title>]
+
+ head_tag.string
+ # u'The Dormouse's story'
+
+如果tag包含了多个子节点,tag就无法确定 ``.string`` 方法应该调用哪个子节点的内容, ``.string`` 的输出结果是 ``None`` :
+
+::
+
+ print(soup.html.string)
+ # None
+
+.strings 和 stripped_strings
+.............................
+
+如果tag中包含多个字符串 [2]_ ,可以使用 ``.strings`` 来循环获取:
+
+::
+
+ for string in soup.strings:
+ print(repr(string))
+ # u"The Dormouse's story"
+ # u'\n\n'
+ # u"The Dormouse's story"
+ # u'\n\n'
+ # u'Once upon a time there were three little sisters; and their names were\n'
+ # u'Elsie'
+ # u',\n'
+ # u'Lacie'
+ # u' and\n'
+ # u'Tillie'
+ # u';\nand they lived at the bottom of a well.'
+ # u'\n\n'
+ # u'...'
+ # u'\n'
+
+输出的字符串中可能包含了很多空格或空行,使用 ``.stripped_strings`` 可以去除多余空白内容:
+
+::
+
+ for string in soup.stripped_strings:
+ print(repr(string))
+ # u"The Dormouse's story"
+ # u"The Dormouse's story"
+ # u'Once upon a time there were three little sisters; and their names were'
+ # u'Elsie'
+ # u','
+ # u'Lacie'
+ # u'and'
+ # u'Tillie'
+ # u';\nand they lived at the bottom of a well.'
+ # u'...'
+
+全部是空格的行会被忽略掉,段首和段末的空白会被删除
+
+父节点
+-------
+
+继续分析文档树,每个tag或字符串都有父节点:被包含在某个tag中
+
+.parent
+........
+
+通过 ``.parent`` 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,<head>标签是<title>标签的父节点:
+
+::
+
+ title_tag = soup.title
+ title_tag
+ # <title>The Dormouse's story</title>
+ title_tag.parent
+ # <head><title>The Dormouse's story</title></head>
+
+文档title的字符串也有父节点:<title>标签
+
+::
+
+ title_tag.string.parent
+ # <title>The Dormouse's story</title>
+
+文档的顶层节点比如<html>的父节点是 ``BeautifulSoup`` 对象:
+
+::
+
+ html_tag = soup.html
+ type(html_tag.parent)
+ # <class 'bs4.BeautifulSoup'>
+
+``BeautifulSoup`` 对象的 ``.parent`` 是None:
+
+::
+
+ print(soup.parent)
+ # None
+
+.parents
+..........
+
+通过元素的 ``.parents`` 属性可以递归得到元素的所有父辈节点,下面的例子使用了 ``.parents`` 方法遍历了<a>标签到根节点的所有节点.
+
+::
+
+ link = soup.a
+ link
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+ for parent in link.parents:
+ if parent is None:
+ print(parent)
+ else:
+ print(parent.name)
+ # p
+ # body
+ # html
+ # [document]
+ # None
+
+兄弟节点
+---------
+
+看一段简单的例子:
+
+::
+
+ sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
+ print(sibling_soup.prettify())
+ # <html>
+ # <body>
+ # <a>
+ # <b>
+ # text1
+ # </b>
+ # <c>
+ # text2
+ # </c>
+ # </a>
+ # </body>
+ # </html>
+
+因为<b>标签和<c>标签是同一层:他们是同一个元素的子节点,所以<b>和<c>可以被称为兄弟节点.一段文档以标准格式输出时,兄弟节点有相同的缩进级别.在代码中也可以使用这种关系.
+
+.next_sibling 和 .previous_sibling
+....................................
+
+在文档树中,使用 ``.next_sibling`` 和 ``.previous_sibling`` 属性来查询兄弟节点:
+
+::
+
+ sibling_soup.b.next_sibling
+ # <c>text2</c>
+
+ sibling_soup.c.previous_sibling
+ # <b>text1</b>
+
+<b>标签有 ``.next_sibling`` 属性,但是没有 ``.previous_sibling`` 属性,因为<b>标签在同级节点中是第一个.同理,<c>标签有 ``.previous_sibling`` 属性,却没有 ``.next_sibling`` 属性:
+
+::
+
+ print(sibling_soup.b.previous_sibling)
+ # None
+ print(sibling_soup.c.next_sibling)
+ # None
+
+例子中的字符串“text1”和“text2”不是兄弟节点,因为它们的父节点不同:
+
+::
+
+ sibling_soup.b.string
+ # u'text1'
+
+ print(sibling_soup.b.string.next_sibling)
+ # None
+
+实际文档中的tag的 ``.next_sibling`` 和 ``.previous_sibling`` 属性通常是字符串或空白. 看看“爱丽丝”文档:
+
+::
+
+ <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
+ <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
+ <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
+
+如果以为第一个<a>标签的 ``.next_sibling`` 结果是第二个<a>标签,那就错了,真实结果是第一个<a>标签和第二个<a>标签之间的顿号和换行符:
+
+::
+
+ link = soup.a
+ link
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+
+ link.next_sibling
+ # u',\n'
+
+第二个<a>标签是顿号的 ``.next_sibling`` 属性:
+
+::
+
+ link.next_sibling.next_sibling
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
+
+.next_siblings 和 .previous_siblings
+......................................
+
+通过 ``.next_siblings`` 和 ``.previous_siblings`` 属性可以对当前节点的兄弟节点迭代输出:
+
+::
+
+ for sibling in soup.a.next_siblings:
+ print(repr(sibling))
+ # u',\n'
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
+ # u' and\n'
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
+ # u'; and they lived at the bottom of a well.'
+ # None
+
+ for sibling in soup.find(id="link3").previous_siblings:
+ print(repr(sibling))
+ # ' and\n'
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
+ # u',\n'
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+ # u'Once upon a time there were three little sisters; and their names were\n'
+ # None
+
+回退和前进
+----------
+
+看一下“爱丽丝” 文档:
+
+::
+
+ <html><head><title>The Dormouse's story</title></head>
+ <p class="title"><b>The Dormouse's story</b></p>
+
+HTML解析器把这段字符串转换成一连串的事件: "打开<html>标签","打开一个<head>标签","打开一个<title>标签","添加一段字符串","关闭<title>标签","打开<p>标签",等等.Beautiful Soup提供了重现解析器初始化过程的方法.
+
+.next_element 和 .previous_element
+...................................
+
+``.next_element`` 属性指向解析过程中下一个被解析的对象(字符串或tag),结果可能与 ``.next_sibling`` 相同,但通常是不一样的.
+
+这是“爱丽丝”文档中最后一个<a>标签,它的 ``.next_sibling`` 结果是一个字符串,因为当前的解析过程 [2]_ 因为当前的解析过程因为遇到了<a>标签而中断了:
+
+::
+
+ last_a_tag = soup.find("a", id="link3")
+ last_a_tag
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
+
+ last_a_tag.next_sibling
+ # '; and they lived at the bottom of a well.'
+
+但这个<a>标签的 ``.next_element`` 属性结果是在<a>标签被解析之后的解析内容,不是<a>标签后的句子部分,应该是字符串"Tillie":
+
+::
+
+ last_a_tag.next_element
+ # u'Tillie'
+
+这是因为在原始文档中,字符串“Tillie” 在分号前出现,解析器先进入<a>标签,然后是字符串“Tillie”,然后关闭</a>标签,然后是分号和剩余部分.分号与<a>标签在同一层级,但是字符串“Tillie”会被先解析.
+
+``.previous_element`` 属性刚好与 ``.next_element`` 相反,它指向当前被解析的对象的前一个解析对象:
+
+::
+
+ last_a_tag.previous_element
+ # u' and\n'
+ last_a_tag.previous_element.next_element
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
+
+.next_elements 和 .previous_elements
+.....................................
+
+通过 ``.next_elements`` 和 ``.previous_elements`` 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样:
+
+::
+
+ for element in last_a_tag.next_elements:
+ print(repr(element))
+ # u'Tillie'
+ # u';\nand they lived at the bottom of a well.'
+ # u'\n\n'
+ # <p class="story">...</p>
+ # u'...'
+ # u'\n'
+ # None
+
+搜索文档树
+==========
+
+Beautiful Soup定义了很多搜索方法,这里着重介绍2个: ``find()`` 和 ``find_all()`` .其它方法的参数和用法类似,请读者举一反三.
+
+再以“爱丽丝”文档作为例子:
+
+::
+
+ html_doc = """
+ <html><head><title>The Dormouse's story</title></head>
+ <body>
+ <p class="title"><b>The Dormouse's story</b></p>
+
+ <p class="story">Once upon a time there were three little sisters; and their names were
+ <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
+ <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
+ <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
+ and they lived at the bottom of a well.</p>
+
+ <p class="story">...</p>
+ """
+
+ from bs4 import BeautifulSoup
+ soup = BeautifulSoup(html_doc, 'html.parser')
+
+使用 ``find_all()`` 类似的方法可以查找到想要查找的文档内容
+
+过滤器
+------
+
+介绍 ``find_all()`` 方法前,先介绍一下过滤器的类型 [3]_ ,这些过滤器贯穿整个搜索的API.过滤器可以被用在tag的name中,节点的属性中,字符串中或他们的混合中.
+
+字符串
+............
+
+最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的<b>标签:
+
+::
+
+ soup.find_all('b')
+ # [<b>The Dormouse's story</b>]
+
+如果传入字节码参数,Beautiful Soup会当作UTF-8编码,可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错
+
+正则表达式
+..........
+
+如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 ``match()`` 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到:
+
+::
+
+ import re
+ for tag in soup.find_all(re.compile("^b")):
+ print(tag.name)
+ # body
+ # b
+
+下面代码找出所有名字中包含"t"的标签:
+
+::
+
+ for tag in soup.find_all(re.compile("t")):
+ print(tag.name)
+ # html
+ # title
+
+列表
+....
+
+如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签:
+
+::
+
+ soup.find_all(["a", "b"])
+ # [<b>The Dormouse's story</b>,
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+True
+.....
+
+``True`` 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点
+
+::
+
+ for tag in soup.find_all(True):
+ print(tag.name)
+ # html
+ # head
+ # title
+ # body
+ # p
+ # b
+ # p
+ # a
+ # a
+ # a
+ # p
+
+方法
+....
+
+如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 [4]_ ,如果这个方法返回 ``True`` 表示当前元素匹配并且被找到,如果不是则反回 ``False``
+
+下面方法校验了当前元素,如果包含 ``class`` 属性却不包含 ``id`` 属性,那么将返回 ``True``:
+
+::
+
+ def has_class_but_no_id(tag):
+ return tag.has_attr('class') and not tag.has_attr('id')
+
+将这个方法作为参数传入 ``find_all()`` 方法,将得到所有<p>标签:
+
+::
+
+ soup.find_all(has_class_but_no_id)
+ # [<p class="title"><b>The Dormouse's story</b></p>,
+ # <p class="story">Once upon a time there were...</p>,
+ # <p class="story">...</p>]
+
+返回结果中只有<p>标签没有<a>标签,因为<a>标签还定义了"id",没有返回<html>和<head>,因为<html>和<head>中没有定义"class"属性.
+
+通过一个方法来过滤一类标签属性的时候, 这个方法的参数是要被过滤的属性的值, 而不是这个标签.
+下面的例子是找出 ``href`` 属性不符合指定正则的 ``a`` 标签.
+
+::
+
+
+ def not_lacie(href):
+ return href and not re.compile("lacie").search(href)
+ soup.find_all(href=not_lacie)
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+标签过滤方法可以使用复杂方法. 下面的例子可以过滤出前后都有文字的标签.
+
+::
+
+ from bs4 import NavigableString
+ def surrounded_by_strings(tag):
+ return (isinstance(tag.next_element, NavigableString)
+ and isinstance(tag.previous_element, NavigableString))
+
+ for tag in soup.find_all(surrounded_by_strings):
+ print tag.name
+ # p
+ # a
+ # a
+ # a
+ # p
+
+现在来了解一下搜索方法的细节
+
+find_all()
+-----------
+
+find_all( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
+
+``find_all()`` 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件.这里有几个例子:
+
+::
+
+ soup.find_all("title")
+ # [<title>The Dormouse's story</title>]
+
+ soup.find_all("p", "title")
+ # [<p class="title"><b>The Dormouse's story</b></p>]
+
+ soup.find_all("a")
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+ soup.find_all(id="link2")
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
+
+ import re
+ soup.find(string=re.compile("sisters"))
+ # u'Once upon a time there were three little sisters; and their names were\n'
+
+有几个方法很相似,还有几个方法是新的,参数中的 ``string`` 和 ``id`` 是什么含义? 为什么 ``find_all("p", "title")`` 返回的是CSS Class为"title"的<p>标签? 我们来仔细看一下 ``find_all()`` 的参数
+
+name 参数
+..........
+
+``name`` 参数可以查找所有名字为 ``name`` 的tag,字符串对象会被自动忽略掉.
+
+简单的用法如下:
+
+::
+
+ soup.find_all("title")
+ # [<title>The Dormouse's story</title>]
+
+重申: 搜索 ``name`` 参数的值可以使任一类型的 `过滤器`_ ,字符窜,正则表达式,列表,方法或是 ``True`` .
+
+keyword 参数
+..............
+
+如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 ``id`` 的参数,Beautiful Soup会搜索每个tag的"id"属性.
+
+::
+
+ soup.find_all(id='link2')
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
+
+如果传入 ``href`` 参数,Beautiful Soup会搜索每个tag的"href"属性:
+
+::
+
+ soup.find_all(href=re.compile("elsie"))
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
+
+搜索指定名字的属性时可以使用的参数值包括 `字符串`_ , `正则表达式`_ , `列表`_, `True`_ .
+
+下面的例子在文档树中查找所有包含 ``id`` 属性的tag,无论 ``id`` 的值是什么:
+
+::
+
+ soup.find_all(id=True)
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+使用多个指定名字的参数可以同时过滤tag的多个属性:
+
+::
+
+ soup.find_all(href=re.compile("elsie"), id='link1')
+ # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
+
+有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性:
+
+::
+
+ data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
+ data_soup.find_all(data-foo="value")
+ # SyntaxError: keyword can't be an expression
+
+但是可以通过 ``find_all()`` 方法的 ``attrs`` 参数定义一个字典参数来搜索包含特殊属性的tag:
+
+::
+
+ data_soup.find_all(attrs={"data-foo": "value"})
+ # [<div data-foo="value">foo!</div>]
+
+按CSS搜索
+..........
+
+按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 ``class`` 在Python中是保留字,使用 ``class`` 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 ``class_`` 参数搜索有指定CSS类名的tag:
+
+::
+
+ soup.find_all("a", class_="sister")
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+``class_`` 参数同样接受不同类型的 ``过滤器`` ,字符串,正则表达式,方法或 ``True`` :
+
+::
+
+ soup.find_all(class_=re.compile("itl"))
+ # [<p class="title"><b>The Dormouse's story</b></p>]
+
+ def has_six_characters(css_class):
+ return css_class is not None and len(css_class) == 6
+
+ soup.find_all(class_=has_six_characters)
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+tag的 ``class`` 属性是 `多值属性`_ .按照CSS类名搜索tag时,可以分别搜索tag中的每个CSS类名:
+
+::
+
+ css_soup = BeautifulSoup('<p class="body strikeout"></p>')
+ css_soup.find_all("p", class_="strikeout")
+ # [<p class="body strikeout"></p>]
+
+ css_soup.find_all("p", class_="body")
+ # [<p class="body strikeout"></p>]
+
+搜索 ``class`` 属性时也可以通过CSS值完全匹配:
+
+::
+
+ css_soup.find_all("p", class_="body strikeout")
+ # [<p class="body strikeout"></p>]
+
+完全匹配 ``class`` 的值时,如果CSS类名的顺序与实际不符,将搜索不到结果:
+
+::
+
+ soup.find_all("a", attrs={"class": "sister"})
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+``string`` 参数
+...............
+
+通过 ``string`` 参数可以搜搜文档中的字符串内容.与 ``name`` 参数的可选值一样, ``string`` 参数接受 `字符串`_ , `正则表达式`_ , `列表`_, `True`_ . 看例子:
+
+::
+
+ soup.find_all(string="Elsie")
+ # [u'Elsie']
+
+ soup.find_all(string=["Tillie", "Elsie", "Lacie"])
+ # [u'Elsie', u'Lacie', u'Tillie']
+
+ soup.find_all(string=re.compile("Dormouse"))
+ [u"The Dormouse's story", u"The Dormouse's story"]
+
+ def is_the_only_string_within_a_tag(s):
+ ""Return True if this string is the only child of its parent tag.""
+ return (s == s.parent.string)
+
+ soup.find_all(string=is_the_only_string_within_a_tag)
+ # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']
+
+虽然 ``string`` 参数用于搜索字符串,还可以与其它参数混合使用来过滤tag.Beautiful Soup会找到 ``.string`` 方法与 ``string`` 参数值相符的tag.下面代码用来搜索内容里面包含“Elsie”的<a>标签:
+
+::
+
+ soup.find_all("a", string="Elsie")
+ # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
+
+``limit`` 参数
+...............
+
+``find_all()`` 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 ``limit`` 参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到 ``limit`` 的限制时,就停止搜索返回结果.
+
+文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量:
+
+::
+
+ soup.find_all("a", limit=2)
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
+
+``recursive`` 参数
+...................
+
+调用tag的 ``find_all()`` 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 ``recursive=False`` .
+
+一段简单的文档:
+
+::
+
+ <html>
+ <head>
+ <title>
+ The Dormouse's story
+ </title>
+ </head>
+ ...
+
+是否使用 ``recursive`` 参数的搜索结果:
+
+::
+
+ soup.html.find_all("title")
+ # [<title>The Dormouse's story</title>]
+
+ soup.html.find_all("title", recursive=False)
+ # []
+
+这是文档片段
+
+::
+
+ <html>
+ <head>
+ <title>
+ The Dormouse's story
+ </title>
+ </head>
+ ...
+
+<title>标签在 <html> 标签下, 但并不是直接子节点, <head> 标签才是直接子节点.
+在允许查询所有后代节点时 Beautiful Soup 能够查找到 <title> 标签.
+但是使用了 ``recursive=False`` 参数之后,只能查找直接子节点,这样就查不到 <title> 标签了.
+
+Beautiful Soup 提供了多种DOM树搜索方法. 这些方法都使用了类似的参数定义.
+比如这些方法: ``find_all()``: ``name``, ``attrs``, ``text``, ``limit``.
+但是只有 ``find_all()`` 和 ``find()`` 支持 ``recursive`` 参数.
+
+像调用 ``find_all()`` 一样调用tag
+----------------------------------
+
+``find_all()`` 几乎是Beautiful Soup中最常用的搜索方法,所以我们定义了它的简写方法. ``BeautifulSoup`` 对象和 ``tag`` 对象可以被当作一个方法来使用,这个方法的执行结果与调用这个对象的 ``find_all()`` 方法相同,下面两行代码是等价的:
+
+::
+
+ soup.find_all("a")
+ soup("a")
+
+这两行代码也是等价的:
+
+::
+
+ soup.title.find_all(string=True)
+ soup.title(string=True)
+
+find()
+-------
+
+find( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
+
+``find_all()`` 方法将返回文档中符合条件的所有tag,尽管有时候我们只想得到一个结果.比如文档中只有一个<body>标签,那么使用 ``find_all()`` 方法来查找<body>标签就不太合适, 使用 ``find_all`` 方法并设置 ``limit=1`` 参数不如直接使用 ``find()`` 方法.下面两行代码是等价的:
+
+::
+
+ soup.find_all('title', limit=1)
+ # [<title>The Dormouse's story</title>]
+
+ soup.find('title')
+ # <title>The Dormouse's story</title>
+
+唯一的区别是 ``find_all()`` 方法的返回结果是值包含一个元素的列表,而 ``find()`` 方法直接返回结果.
+
+``find_all()`` 方法没有找到目标是返回空列表, ``find()`` 方法找不到目标时,返回 ``None`` .
+
+::
+
+ print(soup.find("nosuchtag"))
+ # None
+
+``soup.head.title`` 是 `tag的名字`_ 方法的简写.这个简写的原理就是多次调用当前tag的 ``find()`` 方法:
+
+::
+
+ soup.head.title
+ # <title>The Dormouse's story</title>
+
+ soup.find("head").find("title")
+ # <title>The Dormouse's story</title>
+
+find_parents() 和 find_parent()
+--------------------------------
+
+find_parents( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
+
+find_parent( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
+
+我们已经用了很大篇幅来介绍 ``find_all()`` 和 ``find()`` 方法,Beautiful Soup中还有10个用于搜索的API.它们中的五个用的是与 ``find_all()`` 相同的搜索参数,另外5个与 ``find()`` 方法的搜索参数类似.区别仅是它们搜索文档的不同部分.
+
+记住: ``find_all()`` 和 ``find()`` 只搜索当前节点的所有子节点,孙子节点等. ``find_parents()`` 和 ``find_parent()`` 用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档\搜索文档包含的内容. 我们从一个文档中的一个叶子节点开始:
+
+::
+
+ a_string = soup.find(string="Lacie")
+ a_string
+ # u'Lacie'
+
+ a_string.find_parents("a")
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
+
+ a_string.find_parent("p")
+ # <p class="story">Once upon a time there were three little sisters; and their names were
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
+ # and they lived at the bottom of a well.</p>
+
+ a_string.find_parents("p", class="title")
+ # []
+
+文档中的一个<a>标签是是当前叶子节点的直接父节点,所以可以被找到.还有一个<p>标签,是目标叶子节点的间接父辈节点,所以也可以被找到.包含class值为"title"的<p>标签不是不是目标叶子节点的父辈节点,所以通过 ``find_parents()`` 方法搜索不到.
+
+``find_parent()`` 和 ``find_parents()`` 方法会让人联想到 `.parent`_ 和 `.parents`_ 属性.它们之间的联系非常紧密.搜索父辈节点的方法实际上就是对 ``.parents`` 属性的迭代搜索.
+
+find_next_siblings() 和 find_next_sibling()
+-------------------------------------------
+
+find_next_siblings( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
+
+find_next_sibling( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
+
+这2个方法通过 `.next_siblings`_ 属性对当tag的所有后面解析 [5]_ 的兄弟tag节点进行迭代, ``find_next_siblings()`` 方法返回所有符合条件的后面的兄弟节点, ``find_next_sibling()`` 只返回符合条件的后面的第一个tag节点.
+
+::
+
+ first_link = soup.a
+ first_link
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+
+ first_link.find_next_siblings("a")
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+ first_story_paragraph = soup.find("p", "story")
+ first_story_paragraph.find_next_sibling("p")
+ # <p class="story">...</p>
+
+find_previous_siblings() 和 find_previous_sibling()
+-----------------------------------------------------
+
+find_previous_siblings( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
+
+find_previous_sibling( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
+
+这2个方法通过 `.previous_siblings`_ 属性对当前tag的前面解析 [5]_ 的兄弟tag节点进行迭代, ``find_previous_siblings()`` 方法返回所有符合条件的前面的兄弟节点, ``find_previous_sibling()`` 方法返回第一个符合条件的前面的兄弟节点:
+
+::
+
+ last_link = soup.find("a", id="link3")
+ last_link
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
+
+ last_link.find_previous_siblings("a")
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
+
+ first_story_paragraph = soup.find("p", "story")
+ first_story_paragraph.find_previous_sibling("p")
+ # <p class="title"><b>The Dormouse's story</b></p>
+
+find_all_next() 和 find_next()
+--------------------------------
+
+find_all_next( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
+
+find_next( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
+
+这2个方法通过 `.next_elements`_ 属性对当前tag的之后的 [5]_ tag和字符串进行迭代, ``find_all_next()`` 方法返回所有符合条件的节点, ``find_next()`` 方法返回第一个符合条件的节点:
+
+::
+
+ first_link = soup.a
+ first_link
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+
+ first_link.find_all_next(string=True)
+ # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
+ # u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']
+
+ first_link.find_next("p")
+ # <p class="story">...</p>
+
+第一个例子中,字符串 “Elsie”也被显示出来,尽管它被包含在我们开始查找的<a>标签的里面.第二个例子中,最后一个<p>标签也被显示出来,尽管它与我们开始查找位置的<a>标签不属于同一部分.例子中,搜索的重点是要匹配过滤器的条件,并且在文档中出现的顺序而不是开始查找的元素的位置.
+
+find_all_previous() 和 find_previous()
+---------------------------------------
+
+find_all_previous( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
+
+find_previous( `name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ )
+
+这2个方法通过 `.previous_elements`_ 属性对当前节点前面 [5]_ 的tag和字符串进行迭代, ``find_all_previous()`` 方法返回所有符合条件的节点, ``find_previous()`` 方法返回第一个符合条件的节点.
+
+::
+
+ first_link = soup.a
+ first_link
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+
+ first_link.find_all_previous("p")
+ # [<p class="story">Once upon a time there were three little sisters; ...</p>,
+ # <p class="title"><b>The Dormouse's story</b></p>]
+
+ first_link.find_previous("title")
+ # <title>The Dormouse's story</title>
+
+``find_all_previous("p")`` 返回了文档中的第一段(class="title"的那段),但还返回了第二段,<p>标签包含了我们开始查找的<a>标签.不要惊讶,这段代码的功能是查找所有出现在指定<a>标签之前的<p>标签,因为这个<p>标签包含了开始的<a>标签,所以<p>标签一定是在<a>之前出现的.
+
+CSS选择器
+------------
+
+Beautiful Soup支持大部分的CSS选择器 `<http://www.w3.org/TR/CSS2/selector.html>`_ [6]_ ,
+在 ``Tag`` 或 ``BeautifulSoup`` 对象的 ``.select()`` 方法中传入字符串参数,
+即可使用CSS选择器的语法找到tag:
+
+::
+
+ soup.select("title")
+ # [<title>The Dormouse's story</title>]
+
+ soup.select("p:nth-of-type(3)")
+ # [<p class="story">...</p>]
+
+通过tag标签逐层查找:
+
+::
+
+ soup.select("body a")
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+ soup.select("html head title")
+ # [<title>The Dormouse's story</title>]
+
+找到某个tag标签下的直接子标签 [6]_ :
+
+::
+
+ soup.select("head > title")
+ # [<title>The Dormouse's story</title>]
+
+ soup.select("p > a")
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+ soup.select("p > a:nth-of-type(2)")
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
+
+ soup.select("p > #link1")
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
+
+ soup.select("body > a")
+ # []
+
+找到兄弟节点标签:
+
+::
+
+ soup.select("#link1 ~ .sister")
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+ soup.select("#link1 + .sister")
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
+
+通过CSS的类名查找:
+
+::
+
+ soup.select(".sister")
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+ soup.select("[class~=sister]")
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+通过tag的id查找:
+
+::
+
+ soup.select("#link1")
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
+
+ soup.select("a#link2")
+ # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
+
+同时用多种CSS选择器查询元素:
+
+::
+
+ soup.select("#link1,#link2")
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
+
+
+通过是否存在某个属性来查找:
+
+::
+
+ soup.select('a[href]')
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+通过属性的值来查找:
+
+::
+
+ soup.select('a[href="http://example.com/elsie"]')
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
+
+ soup.select('a[href^="http://example.com/"]')
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
+ # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
+ # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+ soup.select('a[href$="tillie"]')
+ # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
+
+ soup.select('a[href*=".com/el"]')
+ # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
+
+通过语言设置来查找:
+
+::
+
+ multilingual_markup = """
+ <p lang="en">Hello</p>
+ <p lang="en-us">Howdy, y'all</p>
+ <p lang="en-gb">Pip-pip, old fruit</p>
+ <p lang="fr">Bonjour mes amis</p>
+ """
+ multilingual_soup = BeautifulSoup(multilingual_markup)
+ multilingual_soup.select('p[lang|=en]')
+ # [<p lang="en">Hello</p>,
+ # <p lang="en-us">Howdy, y'all</p>,
+ # <p lang="en-gb">Pip-pip, old fruit</p>]
+
+返回查找到的元素的第一个
+
+::
+
+ soup.select_one(".sister")
+ # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
+
+
+对于熟悉CSS选择器语法的人来说这是个非常方便的方法.Beautiful Soup也支持CSS选择器API,
+如果你仅仅需要CSS选择器的功能,那么直接使用 ``lxml`` 也可以,
+而且速度更快,支持更多的CSS选择器语法,但Beautiful Soup整合了CSS选择器的语法和自身方便使用API.
+
+
+修改文档树
+===========
+
+Beautiful Soup的强项是文档树的搜索,但同时也可以方便的修改文档树
+
+修改tag的名称和属性
+-------------------
+
+在 `Attributes`_ 的章节中已经介绍过这个功能,但是再看一遍也无妨. 重命名一个tag,改变属性的值,添加或删除属性:
+
+::
+
+ soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
+ tag = soup.b
+
+ tag.name = "blockquote"
+ tag['class'] = 'verybold'
+ tag['id'] = 1
+ tag
+ # <blockquote class="verybold" id="1">Extremely bold</blockquote>
+
+ del tag['class']
+ del tag['id']
+ tag
+ # <blockquote>Extremely bold</blockquote>
+
+修改 .string
+-------------
+
+给tag的 ``.string`` 属性赋值,就相当于用当前的内容替代了原来的内容:
+
+::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+
+ tag = soup.a
+ tag.string = "New link text."
+ tag
+ # <a href="http://example.com/">New link text.</a>
+
+注意: 如果当前的tag包含了其它tag,那么给它的 ``.string`` 属性赋值会覆盖掉原有的所有内容包括子tag
+
+append()
+----------
+
+``Tag.append()`` 方法想tag中添加内容,就好像Python的列表的 ``.append()`` 方法:
+
+::
+
+ soup = BeautifulSoup("<a>Foo</a>")
+ soup.a.append("Bar")
+
+ soup
+ # <html><head></head><body><a>FooBar</a></body></html>
+ soup.a.contents
+ # [u'Foo', u'Bar']
+
+NavigableString() 和 .new_tag()
+-----------------------------------------
+
+如果想添加一段文本内容到文档中也没问题,可以调用Python的 ``append()`` 方法
+或调用 ``NavigableString`` 的构造方法:
+
+::
+
+ soup = BeautifulSoup("<b></b>")
+ tag = soup.b
+ tag.append("Hello")
+ new_string = NavigableString(" there")
+ tag.append(new_string)
+ tag
+ # <b>Hello there.</b>
+ tag.contents
+ # [u'Hello', u' there']
+
+如果想要创建一段注释,或 ``NavigableString`` 的任何子类, 只要调用 NavigableString 的构造方法:
+
+::
+
+ from bs4 import Comment
+ new_comment = soup.new_string("Nice to see you.", Comment)
+ tag.append(new_comment)
+ tag
+ # <b>Hello there<!--Nice to see you.--></b>
+ tag.contents
+ # [u'Hello', u' there', u'Nice to see you.']
+
+# 这是Beautiful Soup 4.2.1 中新增的方法
+
+创建一个tag最好的方法是调用工厂方法 ``BeautifulSoup.new_tag()`` :
+
+::
+
+ soup = BeautifulSoup("<b></b>")
+ original_tag = soup.b
+
+ new_tag = soup.new_tag("a", href="http://www.example.com")
+ original_tag.append(new_tag)
+ original_tag
+ # <b><a href="http://www.example.com"></a></b>
+
+ new_tag.string = "Link text."
+ original_tag
+ # <b><a href="http://www.example.com">Link text.</a></b>
+
+第一个参数作为tag的name,是必填,其它参数选填
+
+insert()
+--------
+
+``Tag.insert()`` 方法与 ``Tag.append()`` 方法类似,区别是不会把新元素添加到父节点 ``.contents`` 属性的最后,而是把元素插入到指定的位置.与Python列表总的 ``.insert()`` 方法的用法下同:
+
+::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ tag = soup.a
+
+ tag.insert(1, "but did not endorse ")
+ tag
+ # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
+ tag.contents
+ # [u'I linked to ', u'but did not endorse', <i>example.com</i>]
+
+insert_before() 和 insert_after()
+-----------------------------------
+
+``insert_before()`` 方法在当前tag或文本节点前插入内容:
+
+::
+
+ soup = BeautifulSoup("<b>stop</b>")
+ tag = soup.new_tag("i")
+ tag.string = "Don't"
+ soup.b.string.insert_before(tag)
+ soup.b
+ # <b><i>Don't</i>stop</b>
+
+``insert_after()`` 方法在当前tag或文本节点后插入内容:
+
+::
+
+ soup.b.i.insert_after(soup.new_string(" ever "))
+ soup.b
+ # <b><i>Don't</i> ever stop</b>
+ soup.b.contents
+ # [<i>Don't</i>, u' ever ', u'stop']
+
+clear()
+--------
+
+``Tag.clear()`` 方法移除当前tag的内容:
+
+::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ tag = soup.a
+
+ tag.clear()
+ tag
+ # <a href="http://example.com/"></a>
+
+extract()
+----------
+
+``PageElement.extract()`` 方法将当前tag移除文档树,并作为方法结果返回:
+
+::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ a_tag = soup.a
+
+ i_tag = soup.i.extract()
+
+ a_tag
+ # <a href="http://example.com/">I linked to</a>
+
+ i_tag
+ # <i>example.com</i>
+
+ print(i_tag.parent)
+ None
+
+这个方法实际上产生了2个文档树: 一个是用来解析原始文档的 ``BeautifulSoup`` 对象,另一个是被移除并且返回的tag.被移除并返回的tag可以继续调用 ``extract`` 方法:
+
+::
+
+ my_string = i_tag.string.extract()
+ my_string
+ # u'example.com'
+
+ print(my_string.parent)
+ # None
+ i_tag
+ # <i></i>
+
+decompose()
+------------
+
+``Tag.decompose()`` 方法将当前节点移除文档树并完全销毁:
+
+::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ a_tag = soup.a
+
+ soup.i.decompose()
+
+ a_tag
+ # <a href="http://example.com/">I linked to</a>
+
+replace_with()
+---------------
+
+``PageElement.replace_with()`` 方法移除文档树中的某段内容,并用新tag或文本节点替代它:
+
+::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ a_tag = soup.a
+
+ new_tag = soup.new_tag("b")
+ new_tag.string = "example.net"
+ a_tag.i.replace_with(new_tag)
+
+ a_tag
+ # <a href="http://example.com/">I linked to <b>example.net</b></a>
+
+``replace_with()`` 方法返回被替代的tag或文本节点,可以用来浏览或添加到文档树其它地方
+
+wrap()
+------
+
+``PageElement.wrap()`` 方法可以对指定的tag元素进行包装 [8]_ ,并返回包装后的结果:
+
+::
+
+ soup = BeautifulSoup("<p>I wish I was bold.</p>")
+ soup.p.string.wrap(soup.new_tag("b"))
+ # <b>I wish I was bold.</b>
+
+ soup.p.wrap(soup.new_tag("div"))
+ # <div><p><b>I wish I was bold.</b></p></div>
+
+该方法在 Beautiful Soup 4.0.5 中添加
+
+unwrap()
+---------
+
+``Tag.unwrap()`` 方法与 ``wrap()`` 方法相反.将移除tag内的所有tag标签,该方法常被用来进行标记的解包:
+
+::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ a_tag = soup.a
+
+ a_tag.i.unwrap()
+ a_tag
+ # <a href="http://example.com/">I linked to example.com</a>
+
+与 ``replace_with()`` 方法相同, ``unwrap()`` 方法返回被移除的tag
+
+输出
+====
+
+格式化输出
+-----------
+
+``prettify()`` 方法将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行
+
+::
+
+ markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
+ soup = BeautifulSoup(markup)
+ soup.prettify()
+ # '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...'
+
+ print(soup.prettify())
+ # <html>
+ # <head>
+ # </head>
+ # <body>
+ # <a href="http://example.com/">
+ # I linked to
+ # <i>
+ # example.com
+ # </i>
+ # </a>
+ # </body>
+ # </html>
+
+``BeautifulSoup`` 对象和它的tag节点都可以调用 ``prettify()`` 方法:
+
+::
+
+ print(soup.a.prettify())
+ # <a href="http://example.com/">
+ # I linked to
+ # <i>
+ # example.com
+ # </i>
+ # </a>
+
+压缩输出
+----------
+
+如果只想得到结果字符串,不重视格式,那么可以对一个 ``BeautifulSoup`` 对象或 ``Tag`` 对象使用Python的 ``unicode()`` 或 ``str()`` 方法:
+
+::
+
+ str(soup)
+ # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'
+
+ unicode(soup.a)
+ # u'<a href="http://example.com/">I linked to <i>example.com</i></a>'
+
+``str()`` 方法返回UTF-8编码的字符串,可以指定 `编码`_ 的设置.
+
+还可以调用 ``encode()`` 方法获得字节码或调用 ``decode()`` 方法获得Unicode.
+
+输出格式
+---------
+
+Beautiful Soup输出是会将HTML中的特殊字符转换成Unicode,比如“&lquot;”:
+
+::
+
+ soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")
+ unicode(soup)
+ # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'
+
+如果将文档转换成字符串,Unicode编码会被编码成UTF-8.这样就无法正确显示HTML特殊字符了:
+
+::
+
+ str(soup)
+ # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>'
+
+get_text()
+----------
+
+如果只想得到tag中包含的文本内容,那么可以调用 ``get_text()`` 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回:
+
+::
+
+ markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
+ soup = BeautifulSoup(markup)
+
+ soup.get_text()
+ u'\nI linked to example.com\n'
+ soup.i.get_text()
+ u'example.com'
+
+可以通过参数指定tag的文本内容的分隔符:
+
+::
+
+ # soup.get_text("|")
+ u'\nI linked to |example.com|\n'
+
+还可以去除获得文本内容的前后空白:
+
+::
+
+ # soup.get_text("|", strip=True)
+ u'I linked to|example.com'
+
+或者使用 `.stripped_strings`_ 生成器,获得文本列表后手动处理列表:
+
+::
+
+ [text for text in soup.stripped_strings]
+ # [u'I linked to', u'example.com']
+
+指定文档解析器
+==============
+
+如果仅是想要解析HTML文档,只要用文档创建 ``BeautifulSoup`` 对象就可以了.Beautiful Soup会自动选择一个解析器来解析文档.但是还可以通过参数指定使用那种解析器来解析当前文档.
+
+``BeautifulSoup`` 第一个参数应该是要被解析的文档字符串或是文件句柄,第二个参数用来标识怎样解析文档.如果第二个参数为空,那么Beautiful Soup根据当前系统安装的库自动选择解析器,解析器的优先数序: lxml, html5lib, Python标准库.在下面两种条件下解析器优先顺序会变化:
+
+ * 要解析的文档是什么类型: 目前支持, “html”, “xml”, 和 “html5”
+ * 指定使用哪种解析器: 目前支持, “lxml”, “html5lib”, 和 “html.parser”
+
+`安装解析器`_ 章节介绍了可以使用哪种解析器,以及如何安装.
+
+如果指定的解析器没有安装,Beautiful Soup会自动选择其它方案.目前只有 lxml 解析器支持XML文档的解析,在没有安装lxml库的情况下,创建 ``beautifulsoup`` 对象时无论是否指定使用lxml,都无法得到解析后的对象
+
+解析器之间的区别
+-----------------
+
+Beautiful Soup为不同的解析器提供了相同的接口,但解析器本身时有区别的.同一篇文档被不同的解析器解析后可能会生成不同结构的树型文档.区别最大的是HTML解析器和XML解析器,看下面片段被解析成HTML结构:
+
+::
+
+ BeautifulSoup("<a><b /></a>")
+ # <html><head></head><body><a><b></b></a></body></html>
+
+因为空标签<b />不符合HTML标准,所以解析器把它解析成<b></b>
+
+同样的文档使用XML解析如下(解析XML需要安装lxml库).注意,空标签<b />依然被保留,并且文档前添加了XML头,而不是被包含在<html>标签内:
+
+::
+
+ BeautifulSoup("<a><b /></a>", "xml")
+ # <?xml version="1.0" encoding="utf-8"?>
+ # <a><b/></a>
+
+HTML解析器之间也有区别,如果被解析的HTML文档是标准格式,那么解析器之间没有任何差别,只是解析速度不同,结果都会返回正确的文档树.
+
+但是如果被解析文档不是标准格式,那么不同的解析器返回结果可能不同.下面例子中,使用lxml解析错误格式的文档,结果</p>标签被直接忽略掉了:
+
+::
+
+ BeautifulSoup("<a></p>", "lxml")
+ # <html><body><a></a></body></html>
+
+使用html5lib库解析相同文档会得到不同的结果:
+
+::
+
+ BeautifulSoup("<a></p>", "html5lib")
+ # <html><head></head><body><a><p></p></a></body></html>
+
+html5lib库没有忽略掉</p>标签,而是自动补全了标签,还给文档树添加了<head>标签.
+
+使用pyhton内置库解析结果如下:
+
+::
+
+ BeautifulSoup("<a></p>", "html.parser")
+ # <a></a>
+
+与lxml [7]_ 库类似的,Python内置库忽略掉了</p>标签,与html5lib库不同的是标准库没有尝试创建符合标准的文档格式或将文档片段包含在<body>标签内,与lxml不同的是标准库甚至连<html>标签都没有尝试去添加.
+
+因为文档片段“<a></p>”是错误格式,所以以上解析方式都能算作"正确",html5lib库使用的是HTML5的部分标准,所以最接近"正确".不过所有解析器的结构都能够被认为是"正常"的.
+
+不同的解析器可能影响代码执行结果,如果在分发给别人的代码中使用了 ``BeautifulSoup`` ,那么最好注明使用了哪种解析器,以减少不必要的麻烦.
+
+编码
+====
+
+任何HTML或XML文档都有自己的编码方式,比如ASCII 或 UTF-8,但是使用Beautiful Soup解析后,文档都被转换成了Unicode:
+
+::
+
+ markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
+ soup = BeautifulSoup(markup)
+ soup.h1
+ # <h1>Sacré bleu!</h1>
+ soup.h1.string
+ # u'Sacr\xe9 bleu!'
+
+这不是魔术(但很神奇),Beautiful Soup用了 `编码自动检测`_ 子库来识别当前文档编码并转换成Unicode编码. ``BeautifulSoup`` 对象的 ``.original_encoding`` 属性记录了自动识别编码的结果:
+
+::
+
+ soup.original_encoding
+ 'utf-8'
+
+`编码自动检测`_ 功能大部分时候都能猜对编码格式,但有时候也会出错.有时候即使猜测正确,也是在逐个字节的遍历整个文档后才猜对的,这样很慢.如果预先知道文档编码,可以设置编码参数来减少自动检查编码出错的概率并且提高文档解析速度.在创建 ``BeautifulSoup`` 对象的时候设置 ``from_encoding`` 参数.
+
+下面一段文档用了ISO-8859-8编码方式,这段文档太短,结果Beautiful Soup以为文档是用ISO-8859-7编码:
+
+::
+
+ markup = b"<h1>\xed\xe5\xec\xf9</h1>"
+ soup = BeautifulSoup(markup)
+ soup.h1
+ <h1>νεμω</h1>
+ soup.original_encoding
+ 'ISO-8859-7'
+
+通过传入 ``from_encoding`` 参数来指定编码方式:
+
+::
+
+ soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
+ soup.h1
+ <h1>םולש</h1>
+ soup.original_encoding
+ 'iso8859-8'
+
+如果仅知道文档采用了Unicode编码, 但不知道具体编码. 可以先自己猜测, 猜测错误(依旧是乱码)时,
+可以把错误编码作为 ``exclude_encodings`` 参数, 这样文档就不会尝试使用这种编码了解码了.
+译者备注: 在没有指定编码的情况下, BS会自己猜测编码, 把不正确的编码排除掉, BS就更容易猜到正确编码.
+
+::
+
+ soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
+ soup.h1
+ <h1>םולש</h1>
+ soup.original_encoding
+ 'WINDOWS-1255'
+
+猜测结果是 Windows-1255 编码, 猜测结果可能不够准确, 但是 Windows-1255 编码是 ISO-8859-8 的扩展集,
+所以猜测结果已经十分接近了, 并且不影响使用. (``exclude_encodings`` 参数是 4.4.0版本的新功能)
+
+少数情况下(通常是UTF-8编码的文档中包含了其它编码格式的文件),想获得正确的Unicode编码就不得不将文档中少数特殊编码字符替换成特殊Unicode编码,“REPLACEMENT CHARACTER” (U+FFFD, �) [9]_ . 如果Beautifu Soup猜测文档编码时作了特殊字符的替换,那么Beautiful Soup会把 ``UnicodeDammit`` 或 ``BeautifulSoup`` 对象的 ``.contains_replacement_characters`` 属性标记为 ``True`` .这样就可以知道当前文档进行Unicode编码后丢失了一部分特殊内容字符.如果文档中包含�而 ``.contains_replacement_characters`` 属性是 ``False`` ,则表示�就是文档中原来的字符,不是转码失败.
+
+输出编码
+--------
+
+通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码,下面例子输入文档是Latin-1编码:
+
+::
+
+ markup = b'''
+ <html>
+ <head>
+ <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />
+ </head>
+ <body>
+ <p>Sacr\xe9 bleu!</p>
+ </body>
+ </html>
+ '''
+
+ soup = BeautifulSoup(markup)
+ print(soup.prettify())
+ # <html>
+ # <head>
+ # <meta content="text/html; charset=utf-8" http-equiv="Content-type" />
+ # </head>
+ # <body>
+ # <p>
+ # Sacré bleu!
+ # </p>
+ # </body>
+ # </html>
+
+注意,输出文档中的<meta>标签的编码设置已经修改成了与输出编码一致的UTF-8.
+
+如果不想用UTF-8编码输出,可以将编码方式传入 ``prettify()`` 方法:
+
+::
+
+ print(soup.prettify("latin-1"))
+ # <html>
+ # <head>
+ # <meta content="text/html; charset=latin-1" http-equiv="Content-type" />
+ # ...
+
+还可以调用 ``BeautifulSoup`` 对象或任意节点的 ``encode()`` 方法,就像Python的字符串调用 ``encode()`` 方法一样:
+
+::
+
+ soup.p.encode("latin-1")
+ # '<p>Sacr\xe9 bleu!</p>'
+
+ soup.p.encode("utf-8")
+ # '<p>Sacr\xc3\xa9 bleu!</p>'
+
+如果文档中包含当前编码不支持的字符,那么这些字符将被转换成一系列XML特殊字符引用,下面例子中包含了Unicode编码字符SNOWMAN:
+
+::
+
+ markup = u"<b>\N{SNOWMAN}</b>"
+ snowman_soup = BeautifulSoup(markup)
+ tag = snowman_soup.b
+
+SNOWMAN字符在UTF-8编码中可以正常显示(看上去像是☃),但有些编码不支持SNOWMAN字符,比如ISO-Latin-1或ASCII,那么在这些编码中SNOWMAN字符会被转换成“&#9731”:
+
+::
+
+ print(tag.encode("utf-8"))
+ # <b>☃</b>
+
+ print tag.encode("latin-1")
+ # <b>&#9731;</b>
+
+ print tag.encode("ascii")
+ # <b>&#9731;</b>
+
+Unicode, Dammit! (乱码, 靠!)
+-----------------------------
+
+译者备注: UnicodeDammit 是BS内置库, 主要用来猜测文档编码.
+
+`编码自动检测`_ 功能可以在Beautiful Soup以外使用,检测某段未知编码时,可以使用这个方法:
+
+::
+
+ from bs4 import UnicodeDammit
+ dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")
+ print(dammit.unicode_markup)
+ # Sacré bleu!
+ dammit.original_encoding
+ # 'utf-8'
+
+如果Python中安装了 ``chardet`` 或 ``cchardet`` 那么编码检测功能的准确率将大大提高.
+输入的字符越多,检测结果越精确,如果事先猜测到一些可能编码,
+那么可以将猜测的编码作为参数,这样将优先检测这些编码:
+
+::
+
+
+ dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])
+ print(dammit.unicode_markup)
+ # Sacré bleu!
+ dammit.original_encoding
+ # 'latin-1'
+
+`编码自动检测`_ 功能中有2项功能是Beautiful Soup库中用不到的
+
+智能引号
+...........
+
+使用Unicode时,Beautiful Soup还会智能的把引号 [10]_ 转换成HTML或XML中的特殊字符:
+
+::
+
+ markup = b"<p>I just \x93love\x94 Microsoft Word\x92s smart quotes</p>"
+
+ UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup
+ # u'<p>I just &ldquo;love&rdquo; Microsoft Word&rsquo;s smart quotes</p>'
+
+ UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup
+ # u'<p>I just &#x201C;love&#x201D; Microsoft Word&#x2019;s smart quotes</p>'
+
+也可以把引号转换为ASCII码:
+
+::
+
+ UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup
+ # u'<p>I just "love" Microsoft Word\'s smart quotes</p>'
+
+很有用的功能,但是Beautiful Soup没有使用这种方式.默认情况下,Beautiful Soup把引号转换成Unicode:
+
+::
+
+ UnicodeDammit(markup, ["windows-1252"]).unicode_markup
+ # u'<p>I just \u201clove\u201d Microsoft Word\u2019s smart quotes</p>'
+
+矛盾的编码
+...........
+
+有时文档的大部分都是用UTF-8,但同时还包含了Windows-1252编码的字符,就像微软的智能引号 [10]_ 一样.
+一些包含多个信息的来源网站容易出现这种情况. ``UnicodeDammit.detwingle()``
+方法可以把这类文档转换成纯UTF-8编码格式,看个简单的例子:
+
+::
+
+ snowmen = (u"\N{SNOWMAN}" * 3)
+ quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}")
+ doc = snowmen.encode("utf8") + quote.encode("windows_1252")
+
+这段文档很杂乱,snowmen是UTF-8编码,引号是Windows-1252编码,直接输出时不能同时显示snowmen和引号,因为它们编码不同:
+
+::
+
+ print(doc)
+ # ☃☃☃�I like snowmen!�
+
+ print(doc.decode("windows-1252"))
+ # ☃☃☃“I like snowmen!”
+
+如果对这段文档用UTF-8解码就会得到 ``UnicodeDecodeError`` 异常,如果用Windows-1252解码就回得到一堆乱码.
+幸好, ``UnicodeDammit.detwingle()`` 方法会把这段字符串转换成UTF-8编码,允许我们同时显示出文档中的snowmen和引号:
+
+::
+
+ new_doc = UnicodeDammit.detwingle(doc)
+ print(new_doc.decode("utf8"))
+ # ☃☃☃“I like snowmen!”
+
+``UnicodeDammit.detwingle()`` 方法只能解码包含在UTF-8编码中的Windows-1252编码内容,但这解决了最常见的一类问题.
+
+在创建 ``BeautifulSoup`` 或 ``UnicodeDammit`` 对象前一定要先对文档调用 ``UnicodeDammit.detwingle()`` 确保文档的编码方式正确.如果尝试去解析一段包含Windows-1252编码的UTF-8文档,就会得到一堆乱码,比如: ☃☃☃“I like snowmen!”.
+
+``UnicodeDammit.detwingle()`` 方法在Beautiful Soup 4.1.0版本中新增
+
+比较对象是否相同
+=================
+
+两个 ``NavigableString`` 或 ``Tag`` 对象具有相同的HTML或XML结构时,
+Beautiful Soup就判断这两个对象相同. 这个例子中, 2个 <b> 标签在 BS 中是相同的,
+尽管他们在文档树的不同位置, 但是具有相同的表象: "<b>pizza</b>"
+
+::
+
+ markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"
+ soup = BeautifulSoup(markup, 'html.parser')
+ first_b, second_b = soup.find_all('b')
+ print first_b == second_b
+ # True
+
+ print first_b.previous_element == second_b.previous_element
+ # False
+
+如果想判断两个对象是否严格的指向同一个对象可以通过 ``is`` 来判断
+
+::
+
+ print first_b is second_b
+ # False
+
+复制Beautiful Soup对象
+======================
+
+``copy.copy()`` 方法可以复制任意 ``Tag`` 或 ``NavigableString`` 对象
+
+::
+
+ import copy
+ p_copy = copy.copy(soup.p)
+ print p_copy
+ # <p>I want <b>pizza</b> and more <b>pizza</b>!</p>
+
+复制后的对象跟与对象是相等的, 但指向不同的内存地址
+
+::
+
+ print soup.p == p_copy
+ # True
+
+ print soup.p is p_copy
+ # False
+
+源对象和复制对象的区别是源对象在文档树中, 而复制后的对象是独立的还没有添加到文档树中.
+复制后对象的效果跟调用了 ``extract()`` 方法相同.
+
+::
+
+ print p_copy.parent
+ # None
+
+这是因为相等的对象不能同时插入相同的位置
+
+
+解析部分文档
+============
+
+如果仅仅因为想要查找文档中的<a>标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把<a>标签以外的东西都忽略掉. ``SoupStrainer`` 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 ``SoupStrainer`` 中定义过的文档. 创建一个 ``SoupStrainer`` 对象并作为 ``parse_only`` 参数给 ``BeautifulSoup`` 的构造方法即可.
+
+SoupStrainer
+-------------
+
+``SoupStrainer`` 类接受与典型搜索方法相同的参数:`name`_ , `attrs`_ , `recursive`_ , `string`_ , `**kwargs`_ 。下面举例说明三种 ``SoupStrainer`` 对象:
+
+::
+
+ from bs4 import SoupStrainer
+
+ only_a_tags = SoupStrainer("a")
+
+ only_tags_with_id_link2 = SoupStrainer(id="link2")
+
+ def is_short_string(string):
+ return len(string) < 10
+
+ only_short_strings = SoupStrainer(string=is_short_string)
+
+再拿“爱丽丝”文档来举例,来看看使用三种 ``SoupStrainer`` 对象做参数会有什么不同:
+
+::
+
+ html_doc = """
+ <html><head><title>The Dormouse's story</title></head>
+ <body>
+ <p class="title"><b>The Dormouse's story</b></p>
+
+ <p class="story">Once upon a time there were three little sisters; and their names were
+ <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
+ <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
+ <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
+ and they lived at the bottom of a well.</p>
+
+ <p class="story">...</p>
+ """
+
+ print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())
+ # <a class="sister" href="http://example.com/elsie" id="link1">
+ # Elsie
+ # </a>
+ # <a class="sister" href="http://example.com/lacie" id="link2">
+ # Lacie
+ # </a>
+ # <a class="sister" href="http://example.com/tillie" id="link3">
+ # Tillie
+ # </a>
+
+ print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify())
+ # <a class="sister" href="http://example.com/lacie" id="link2">
+ # Lacie
+ # </a>
+
+ print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify())
+ # Elsie
+ # ,
+ # Lacie
+ # and
+ # Tillie
+ # ...
+ #
+
+还可以将 ``SoupStrainer`` 作为参数传入 `搜索文档树`_ 中提到的方法.这可能不是个常用用法,所以还是提一下:
+
+::
+
+ soup = BeautifulSoup(html_doc)
+ soup.find_all(only_short_strings)
+ # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
+ # u'\n\n', u'...', u'\n']
+
+常见问题
+========
+
+代码诊断
+----------
+
+如果想知道Beautiful Soup到底怎样处理一份文档,可以将文档传入 ``diagnose()`` 方法(Beautiful Soup 4.2.0中新增),Beautiful Soup会输出一份报告,说明不同的解析器会怎样处理这段文档,并标出当前的解析过程会使用哪种解析器:
+
+::
+
+ from bs4.diagnose import diagnose
+ data = open("bad.html").read()
+ diagnose(data)
+
+ # Diagnostic running on Beautiful Soup 4.2.0
+ # Python version 2.7.3 (default, Aug 1 2012, 05:16:07)
+ # I noticed that html5lib is not installed. Installing it may help.
+ # Found lxml version 2.3.2.0
+ #
+ # Trying to parse your data with html.parser
+ # Here's what html.parser did with the document:
+ # ...
+
+``diagnose()`` 方法的输出结果可能帮助你找到问题的原因,如果不行,还可以把结果复制出来以便寻求他人的帮助
+
+文档解析错误
+-------------
+
+文档解析错误有两种.一种是崩溃,Beautiful Soup尝试解析一段文档结果却抛除了异常,通常是 ``HTMLParser.HTMLParseError`` .还有一种异常情况,是Beautiful Soup解析后的文档树看起来与原来的内容相差很多.
+
+这些错误几乎都不是Beautiful Soup的原因,这不会是因为Beautiful Soup的代码写的太优秀,而是因为Beautiful Soup没有包含任何文档解析代码.异常产生自被依赖的解析器,如果解析器不能很好的解析出当前的文档,那么最好的办法是换一个解析器.更多细节查看 `安装解析器`_ 章节.
+
+最常见的解析错误是 ``HTMLParser.HTMLParseError: malformed start tag`` 和 ``HTMLParser.HTMLParseError: bad end tag`` .这都是由Python内置的解析器引起的,解决方法是 `安装lxml或html5lib`_
+
+最常见的异常现象是当前文档找不到指定的Tag,而这个Tag光是用眼睛就足够发现的了. ``find_all()`` 方法返回 [] ,而 ``find()`` 方法返回 None .这是Python内置解析器的又一个问题: 解析器会跳过那些它不知道的tag.解决方法还是 `安装lxml或html5lib`_
+
+版本错误
+----------
+
+* ``SyntaxError: Invalid syntax`` (异常位置在代码行: ``ROOT_TAG_NAME = u'[document]'`` ),因为Python2语法的代码(没有经过迁移)直接在Python3中运行
+
+* ``ImportError: No module named HTMLParser`` 因为在Python3中执行Python2版本的Beautiful Soup
+
+* ``ImportError: No module named html.parser`` 因为在Python2中执行Python3版本的Beautiful Soup
+
+* ``ImportError: No module named BeautifulSoup`` 因为在没有安装BeautifulSoup3库的Python环境下执行代码,或忘记了BeautifulSoup4的代码需要从 ``bs4`` 包中引入
+
+* ``ImportError: No module named bs4`` 因为当前Python环境下还没有安装BeautifulSoup4
+
+解析成XML
+----------
+
+默认情况下,Beautiful Soup会将当前文档作为HTML格式解析,如果要解析XML文档,要在 ``BeautifulSoup`` 构造方法中加入第二个参数 "xml":
+
+::
+
+ soup = BeautifulSoup(markup, "xml")
+
+当然,还需要 `安装lxml`_
+
+解析器的错误
+------------
+
+* 如果同样的代码在不同环境下结果不同,可能是因为两个环境下使用不同的解析器造成的.例如这个环境中安装了lxml,而另一个环境中只有html5lib, `解析器之间的区别`_ 中说明了原因.修复方法是在 ``BeautifulSoup`` 的构造方法中中指定解析器
+
+* 因为HTML标签是 `大小写敏感 <http://www.w3.org/TR/html5/syntax.html#syntax>`_ 的,所以3种解析器再出来文档时都将tag和属性转换成小写.例如文档中的 <TAG></TAG> 会被转换为 <tag></tag> .如果想要保留tag的大写的话,那么应该将文档 `解析成XML`_ .
+
+杂项错误
+--------
+
+* ``UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar`` (或其它类型的 ``UnicodeEncodeError`` )的错误,主要是两方面的错误(都不是Beautiful Soup的原因),第一种是正在使用的终端(console)无法显示部分Unicode,参考 `Python wiki <http://wiki.Python.org/moin/PrintFails>`_ ,第二种是向文件写入时,被写入文件不支持部分Unicode,这时只要用 ``u.encode("utf8")`` 方法将编码转换为UTF-8.
+
+* ``KeyError: [attr]`` 因为调用 ``tag['attr']`` 方法而引起,因为这个tag没有定义该属性.出错最多的是 ``KeyError: 'href'`` 和 ``KeyError: 'class'`` .如果不确定某个属性是否存在时,用 ``tag.get('attr')`` 方法去获取它,跟获取Python字典的key一样
+
+* ``AttributeError: 'ResultSet' object has no attribute 'foo'`` 错误通常是因为把 ``find_all()`` 的返回结果当作一个tag或文本节点使用,实际上返回结果是一个列表或 ``ResultSet`` 对象的字符串,需要对结果进行循环才能得到每个节点的 ``.foo`` 属性.或者使用 ``find()`` 方法仅获取到一个节点
+
+* ``AttributeError: 'NoneType' object has no attribute 'foo'`` 这个错误通常是在调用了 ``find()`` 方法后直节点取某个属性 .foo 但是 ``find()`` 方法并没有找到任何结果,所以它的返回值是 ``None`` .需要找出为什么 ``find()`` 的返回值是 ``None`` .
+
+如何提高效率
+------------
+
+Beautiful Soup对文档的解析速度不会比它所依赖的解析器更快,如果对计算时间要求很高或者计算机的时间比程序员的时间更值钱,那么就应该直接使用 `lxml <http://lxml.de/>`_ .
+
+换句话说,还有提高Beautiful Soup效率的办法,使用lxml作为解析器.Beautiful Soup用lxml做解析器比用html5lib或Python内置解析器速度快很多.
+
+安装 `cchardet <http://pypi.Python.org/pypi/cchardet/>`_ 后文档的解码的编码检测会速度更快
+
+`解析部分文档`_ 不会节省多少解析时间,但是会节省很多内存,并且搜索时也会变得更快.
+
+Beautiful Soup 3
+=================
+
+Beautiful Soup 3是上一个发布版本,目前已经停止维护.Beautiful Soup 3库目前已经被几个主要的linux平台添加到源里:
+
+``$ apt-get install Python-beautifulsoup``
+
+在PyPi中分发的包名字是 ``BeautifulSoup`` :
+
+``$ easy_install BeautifulSoup``
+
+``$ pip install BeautifulSoup``
+
+或通过 `Beautiful Soup 3.2.0源码包 <http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz>`_ 安装
+
+Beautiful Soup 3的在线文档查看 `这里 <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_ .
+
+迁移到BS4
+----------
+
+只要一个小变动就能让大部分的Beautiful Soup 3代码使用Beautiful Soup 4的库和方法----修改 ``BeautifulSoup`` 对象的引入方式:
+
+::
+
+ from BeautifulSoup import BeautifulSoup
+
+修改为:
+
+::
+
+ from bs4 import BeautifulSoup
+
+* 如果代码抛出 ``ImportError`` 异常“No module named BeautifulSoup”,原因可能是尝试执行Beautiful Soup 3,但环境中只安装了Beautiful Soup 4库
+
+* 如果代码跑出 ``ImportError`` 异常“No module named bs4”,原因可能是尝试运行Beautiful Soup 4的代码,但环境中只安装了Beautiful Soup 3.
+
+虽然BS4兼容绝大部分BS3的功能,但BS3中的大部分方法已经不推荐使用了,就方法按照 `PEP8标准 <http://www.Python.org/dev/peps/pep-0008/>`_ 重新定义了方法名.很多方法都重新定义了方法名,但只有少数几个方法没有向下兼容.
+
+上述内容就是BS3迁移到BS4的注意事项
+
+需要的解析器
+............
+
+Beautiful Soup 3曾使用Python的 ``SGMLParser`` 解析器,这个模块在Python3中已经被移除了.Beautiful Soup 4默认使用系统的 ``html.parser`` ,也可以使用lxml或html5lib扩展库代替.查看 `安装解析器`_ 章节
+
+因为解析器 ``html.parser`` 与 ``SGMLParser`` 不同. BS4 和 BS3 处理相同的文档会产生不同的对象结构. 使用lxml或html5lib解析文档的时候, 如果添加了 ``html.parser`` 参数, 解析的对象又回发生变化. 如果发生了这种情况, 只能修改对应的处文档结果处理代码了.
+
+方法名的变化
+............
+
+* ``renderContents`` -> ``encode_contents``
+
+* ``replaceWith`` -> ``replace_with``
+
+* ``replaceWithChildren`` -> ``unwrap``
+
+* ``findAll`` -> ``find_all``
+
+* ``findAllNext`` -> ``find_all_next``
+
+* ``findAllPrevious`` -> ``find_all_previous``
+
+* ``findNext`` -> ``find_next``
+
+* ``findNextSibling`` -> ``find_next_sibling``
+
+* ``findNextSiblings`` -> ``find_next_siblings``
+
+* ``findParent`` -> ``find_parent``
+
+* ``findParents`` -> ``find_parents``
+
+* ``findPrevious`` -> ``find_previous``
+
+* ``findPreviousSibling`` -> ``find_previous_sibling``
+
+* ``findPreviousSiblings`` -> ``find_previous_siblings``
+
+* ``nextSibling`` -> ``next_sibling``
+
+* ``previousSibling`` -> ``previous_sibling``
+
+Beautiful Soup构造方法的参数部分也有名字变化:
+
+* ``BeautifulSoup(parseOnlyThese=...)`` -> ``BeautifulSoup(parse_only=...)``
+
+* ``BeautifulSoup(fromEncoding=...)`` -> ``BeautifulSoup(from_encoding=...)``
+
+为了适配Python3,修改了一个方法名:
+
+* ``Tag.has_key()`` -> ``Tag.has_attr()``
+
+修改了一个属性名,让它看起来更专业点:
+
+* ``Tag.isSelfClosing`` -> ``Tag.is_empty_element``
+
+修改了下面3个属性的名字,以免雨Python保留字冲突.这些变动不是向下兼容的,如果在BS3中使用了这些属性,那么在BS4中这些代码无法执行.
+
+* UnicodeDammit.Unicode -> UnicodeDammit.Unicode_markup``
+
+* ``Tag.next`` -> ``Tag.next_element``
+
+* ``Tag.previous`` -> ``Tag.previous_element``
+
+生成器
+.......
+
+将下列生成器按照PEP8标准重新命名,并转换成对象的属性:
+
+* ``childGenerator()`` -> ``children``
+
+* ``nextGenerator()`` -> ``next_elements``
+
+* ``nextSiblingGenerator()`` -> ``next_siblings``
+
+* ``previousGenerator()`` -> ``previous_elements``
+
+* ``previousSiblingGenerator()`` -> ``previous_siblings``
+
+* ``recursiveChildGenerator()`` -> ``descendants``
+
+* ``parentGenerator()`` -> ``parents``
+
+所以迁移到BS4版本时要替换这些代码:
+
+::
+
+ for parent in tag.parentGenerator():
+ ...
+
+替换为:
+
+::
+
+ for parent in tag.parents:
+ ...
+
+(两种调用方法现在都能使用)
+
+BS3中有的生成器循环结束后会返回 ``None`` 然后结束.这是个bug.新版生成器不再返回 ``None`` .
+
+BS4中增加了2个新的生成器, `.strings 和 stripped_strings`_ . ``.strings`` 生成器返回NavigableString对象, ``.stripped_strings`` 方法返回去除前后空白的Python的string对象.
+
+XML
+....
+
+BS4中移除了解析XML的 ``BeautifulStoneSoup`` 类.如果要解析一段XML文档,使用 ``BeautifulSoup`` 构造方法并在第二个参数设置为“xml”.同时 ``BeautifulSoup`` 构造方法也不再识别 ``isHTML`` 参数.
+
+Beautiful Soup处理XML空标签的方法升级了.旧版本中解析XML时必须指明哪个标签是空标签. 构造方法的 ``selfClosingTags`` 参数已经不再使用.新版Beautiful Soup将所有空标签解析为空元素,如果向空元素中添加子节点,那么这个元素就不再是空元素了.
+
+实体
+.....
+
+HTML或XML实体都会被解析成Unicode字符,Beautiful Soup 3版本中有很多处理实体的方法,在新版中都被移除了. ``BeautifulSoup`` 构造方法也不再接受 ``smartQuotesTo`` 或 ``convertEntities`` 参数. `编码自动检测`_ 方法依然有 ``smart_quotes_to`` 参数,但是默认会将引号转换成Unicode.内容配置项 ``HTML_ENTITIES`` , ``XML_ENTITIES`` 和 ``XHTML_ENTITIES`` 在新版中被移除.因为它们代表的特性已经不再被支持.
+
+如果在输出文档时想把Unicode字符转换成HTML实体,而不是输出成UTF-8编码,那就需要用到 `输出格式`_ 的方法.
+
+迁移杂项
+.........
+
+`Tag.string`_ 属性现在是一个递归操作.如果A标签只包含了一个B标签,那么A标签的.string属性值与B标签的.string属性值相同.
+
+`多值属性`_ 比如 ``class`` 属性包含一个他们的值的列表,而不是一个字符串.这可能会影响到如何按照CSS类名哦搜索tag.
+
+如果使用 ``find*`` 方法时同时传入了 `string 参数`_ 和 `name 参数`_ .Beautiful Soup会搜索指定name的tag,并且这个tag的 `Tag.string`_ 属性包含text参数的内容.结果中不会包含字符串本身.旧版本中Beautiful Soup会忽略掉tag参数,只搜索text参数.
+
+``BeautifulSoup`` 构造方法不再支持 markupMassage 参数.现在由解析器负责文档的解析正确性.
+
+很少被用到的几个解析器方法在新版中被移除,比如 ``ICantBelieveItsBeautifulSoup`` 和 ``BeautifulSOAP`` .现在由解析器完全负责如何解释模糊不清的文档标记.
+
+``prettify()`` 方法在新版中返回Unicode字符串,不再返回字节流.
+
+附录
+=====
+
+.. _`BeautifulSoup3 文档`: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html
+.. _name: `name 参数`_
+.. _attrs: `按CSS搜索`_
+.. _recursive: `recursive 参数`_
+.. _string: `string 参数`_
+.. _**kwargs: `keyword 参数`_
+.. _.next_siblings: `.next_siblings 和 .previous_siblings`_
+.. _.previous_siblings: `.next_siblings 和 .previous_siblings`_
+.. _.next_elements: `.next_elements 和 .previous_elements`_
+.. _.previous_elements: `.next_elements 和 .previous_elements`_
+.. _.stripped_strings: `.strings 和 stripped_strings`_
+.. _安装lxml: `安装解析器`_
+.. _安装lxml或html5lib: `安装解析器`_
+.. _编码自动检测: `Unicode, Dammit! (乱码, 靠!)`_
+.. _Tag.string: `.string`_
+
+
+.. [1] BeautifulSoup的google讨论组不是很活跃,可能是因为库已经比较完善了吧,但是作者还是会很热心的尽量帮你解决问题的.
+.. [2] 文档被解析成树形结构,所以下一步解析过程应该是当前节点的子节点
+.. [3] 过滤器只能作为搜索文档的参数,或者说应该叫参数类型更为贴切,原文中用了 ``filter`` 因此翻译为过滤器
+.. [4] 元素参数,HTML文档中的一个tag节点,不能是文本节点
+.. [5] 采用先序遍历方式
+.. [6] CSS选择器是一种单独的文档搜索语法, 参考 http://www.w3school.com.cn/css/css_selector_type.asp
+.. [7] 原文写的是 html5lib, 译者觉得这是原文档的一个笔误
+.. [8] wrap含有包装,打包的意思,但是这里的包装不是在外部包装而是将当前tag的内部内容包装在一个tag里.包装原来内容的新tag依然在执行 `wrap()`_ 方法的tag内
+.. [9] 文档中特殊编码字符被替换成特殊字符(通常是�)的过程是Beautful Soup自动实现的,如果想要多种编码格式的文档被完全转换正确,那么,只好,预先手动处理,统一编码格式
+.. [10] 智能引号,常出现在microsoft的word软件中,即在某一段落中按引号出现的顺序每个引号都被自动转换为左引号,或右引号.
+
+原文: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
+
+翻译: Deron Wang
+
+查看 `BeautifulSoup3 文档`_
diff --git a/doc.zh/source/index.zh.html b/doc.zh/source/index.zh.html
deleted file mode 100644
index 71ea360..0000000
--- a/doc.zh/source/index.zh.html
+++ /dev/null
@@ -1,2398 +0,0 @@
-<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
- "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
-
-
-<html xmlns="http://www.w3.org/1999/xhtml">
- <head>
- <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-
- <title>Beautiful Soup 4.2.0 文档 &mdash; Beautiful Soup 4.2.0 documentation</title>
-
- <link rel="stylesheet" href="_static/default.css" type="text/css" />
- <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
-
- <script type="text/javascript">
- var DOCUMENTATION_OPTIONS = {
- URL_ROOT: './',
- VERSION: '4.2.0',
- COLLAPSE_INDEX: false,
- FILE_SUFFIX: '.html',
- HAS_SOURCE: true
- };
- </script>
- <script type="text/javascript" src="_static/jquery.js"></script>
- <script type="text/javascript" src="_static/underscore.js"></script>
- <script type="text/javascript" src="_static/doctools.js"></script>
- <link rel="top" title="Beautiful Soup 4.2.0 documentation" href="index.html" />
- </head>
- <body>
- <div class="related">
- <h3>Navigation</h3>
- <ul>
- <li class="right" style="margin-right: 10px">
- <a href="genindex.html" title="General Index"
- accesskey="I">index</a></li>
- <li><a href="index.html">Beautiful Soup 4.2.0 documentation</a> &raquo;</li>
- </ul>
- </div>
-
- <div class="document">
- <div class="documentwrapper">
- <div class="bodywrapper">
- <div class="body">
-
- <div class="section" id="beautiful-soup-4-2-0">
-<h1>Beautiful Soup 4.2.0 文档<a class="headerlink" href="#beautiful-soup-4-2-0" title="Permalink to this headline">¶</a></h1>
-<img alt="_static/cover.jpg" class="align-right" src="_static/cover.jpg" />
-<p><a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.</p>
-<p>这篇文档介绍了BeautifulSoup4中所有主要特性,并切有小例子.让我来向你展示它适合做什么,如何工作,怎样使用,如何达到你想要的效果,和处理异常情况.</p>
-<p>文档中出现的例子在Python2.7和Python3.2中的执行结果相同</p>
-<p>你可能在寻找 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">Beautiful Soup3</a> 的文档,Beautiful Soup 3 目前已经停止开发,我们推荐在现在的项目中使用Beautiful Soup 4, <a class="reference external" href="http://www.baidu.com">移植到BS4</a></p>
-<div class="section" id="id1">
-<h2>寻求帮助<a class="headerlink" href="#id1" title="Permalink to this headline">¶</a></h2>
-<p>如果你有关于BeautifulSoup的问题,可以发送邮件到 <a class="reference external" href="https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup">讨论组</a> .如果你的问题包含了一段需要转换的HTML代码,那么确保你提的问题描述中附带这段HTML文档的 <a class="reference internal" href="#id60">代码诊断</a> <a class="footnote-reference" href="#id82" id="id3">[1]</a></p>
-</div>
-</div>
-<div class="section" id="id4">
-<h1>快速开始<a class="headerlink" href="#id4" title="Permalink to this headline">¶</a></h1>
-<p>下面的一段HTML代码将作为例子被多次用到.这是 <em>爱丽丝梦游仙境的</em> 的一段内容(以后内容中简称为 <em>爱丽丝</em> 的文档):</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">html_doc</span> <span class="o">=</span> <span class="s">&quot;&quot;&quot;</span>
-<span class="s">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;</span>
-<span class="s">&lt;body&gt;</span>
-<span class="s">&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;</span>
-
-<span class="s">&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were</span>
-<span class="s">&lt;a href=&quot;http://example.com/elsie&quot; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="s">&lt;a href=&quot;http://example.com/lacie&quot; class=&quot;sister&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and</span>
-<span class="s">&lt;a href=&quot;http://example.com/tillie&quot; class=&quot;sister&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;</span>
-<span class="s">and they lived at the bottom of a well.&lt;/p&gt;</span>
-
-<span class="s">&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
-<span class="s">&quot;&quot;&quot;</span>
-</pre></div>
-</div>
-<p>使用BeautifulSoup解析这段代码,能够得到一个 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 的对象,并能按照标准的缩进格式的结构输出:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
-
-<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
-<span class="c"># &lt;html&gt;</span>
-<span class="c"># &lt;head&gt;</span>
-<span class="c"># &lt;title&gt;</span>
-<span class="c"># The Dormouse&#39;s story</span>
-<span class="c"># &lt;/title&gt;</span>
-<span class="c"># &lt;/head&gt;</span>
-<span class="c"># &lt;body&gt;</span>
-<span class="c"># &lt;p class=&quot;title&quot;&gt;</span>
-<span class="c"># &lt;b&gt;</span>
-<span class="c"># The Dormouse&#39;s story</span>
-<span class="c"># &lt;/b&gt;</span>
-<span class="c"># &lt;/p&gt;</span>
-<span class="c"># &lt;p class=&quot;story&quot;&gt;</span>
-<span class="c"># Once upon a time there were three little sisters; and their names were</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;</span>
-<span class="c"># Elsie</span>
-<span class="c"># &lt;/a&gt;</span>
-<span class="c"># ,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;</span>
-<span class="c"># Lacie</span>
-<span class="c"># &lt;/a&gt;</span>
-<span class="c"># and</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link2&quot;&gt;</span>
-<span class="c"># Tillie</span>
-<span class="c"># &lt;/a&gt;</span>
-<span class="c"># ; and they lived at the bottom of a well.</span>
-<span class="c"># &lt;/p&gt;</span>
-<span class="c"># &lt;p class=&quot;story&quot;&gt;</span>
-<span class="c"># ...</span>
-<span class="c"># &lt;/p&gt;</span>
-<span class="c"># &lt;/body&gt;</span>
-<span class="c"># &lt;/html&gt;</span>
-</pre></div>
-</div>
-<p>几个简单的浏览结构化数据的方法:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">title</span>
-<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">name</span>
-<span class="c"># u&#39;title&#39;</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">string</span>
-<span class="c"># u&#39;The Dormouse&#39;s story&#39;</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span>
-<span class="c"># u&#39;head&#39;</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">p</span>
-<span class="c"># &lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
-<span class="c"># u&#39;title&#39;</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&#39;a&#39;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">&quot;link3&quot;</span><span class="p">)</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;</span>
-</pre></div>
-</div>
-<p>从文档中找到所有&lt;a&gt;标签的链接:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&#39;a&#39;</span><span class="p">):</span>
- <span class="k">print</span><span class="p">(</span><span class="n">link</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">&#39;href&#39;</span><span class="p">))</span>
- <span class="c"># http://example.com/elsie</span>
- <span class="c"># http://example.com/lacie</span>
- <span class="c"># http://example.com/tillie</span>
-</pre></div>
-</div>
-<p>从文档中获取所有文字内容:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">get_text</span><span class="p">())</span>
-<span class="c"># The Dormouse&#39;s story</span>
-<span class="c">#</span>
-<span class="c"># The Dormouse&#39;s story</span>
-<span class="c">#</span>
-<span class="c"># Once upon a time there were three little sisters; and their names were</span>
-<span class="c"># Elsie,</span>
-<span class="c"># Lacie and</span>
-<span class="c"># Tillie;</span>
-<span class="c"># and they lived at the bottom of a well.</span>
-<span class="c">#</span>
-<span class="c"># ...</span>
-</pre></div>
-</div>
-<p>这是你想要的吗?别着急,还有更好用的</p>
-</div>
-<div class="section" id="id5">
-<h1>安装 Beautiful Soup<a class="headerlink" href="#id5" title="Permalink to this headline">¶</a></h1>
-<p>如果你用的是新版的Debain或ubuntu,那么可以通过系统的软件包管理来安装:</p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">Python-bs4</span></tt></p>
-<p>Beautiful Soup 4 通过PyPi发布,所以如果你无法使用系统包管理安装,那么也可以通过 <tt class="docutils literal"><span class="pre">easy_install</span></tt> 或 <tt class="docutils literal"><span class="pre">pip</span></tt> 来安装.包的名字是 <tt class="docutils literal"><span class="pre">beautifulsoup4</span></tt> ,这个包兼容Python2和Python3.</p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">beautifulsoup4</span></tt></p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">beautifulsoup4</span></tt></p>
-<p>(在PyPi中还有一个名字是 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 的包,但那可能不是你想要的,那是 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">Beautiful Soup3</a> 的发布版本,因为很多项目还在使用BS3, 所以 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 包依然有效.但是如果你在编写新项目,那么你应该安装的 <tt class="docutils literal"><span class="pre">beautifulsoup4</span></tt> )</p>
-<p>如果你没有安装 <tt class="docutils literal"><span class="pre">easy_install</span></tt> 或 <tt class="docutils literal"><span class="pre">pip</span></tt> ,那你也可以 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/download/4.x/">下载BS4的源码</a> ,然后通过setup.py来安装.</p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">Python</span> <span class="pre">setup.py</span> <span class="pre">install</span></tt></p>
-<p>如果上述安装方法都行不通,Beautiful Soup的发布协议允许你将BS4的代码打包在你的项目中,这样无须安装即可使用.</p>
-<p>作者在Python2.7和Python3.2的版本下开发Beautiful Soup, 理论上Beautiful Soup应该在所有当前的Python版本中正常工作</p>
-<div class="section" id="id8">
-<h2>安装完成后的问题<a class="headerlink" href="#id8" title="Permalink to this headline">¶</a></h2>
-<p>Beautiful Soup发布时打包成Python2版本的代码,在Python3环境下安装时,会自动转换成Python3的代码,如果没有一个安装的过程,那么代码就不会被转换.</p>
-<p>如果代码抛出了 <tt class="docutils literal"><span class="pre">ImportError</span></tt> 的异常: &#8220;No module named HTMLParser&#8221;, 这是因为你在Python3版本中执行Python2版本的代码.</p>
-<p>如果代码抛出了 <tt class="docutils literal"><span class="pre">ImportError</span></tt> 的异常: &#8220;No module named html.parser&#8221;, 这是因为你在Python2版本中执行Python3版本的代码.</p>
-<p>如果遇到上述2种情况,最好的解决方法是重新安装BeautifulSoup4.</p>
-<p>如果在ROOT_TAG_NAME = u&#8217;[document]&#8217;代码处遇到 <tt class="docutils literal"><span class="pre">SyntaxError</span></tt> &#8220;Invalid syntax&#8221;错误,需要将把BS4的Python代码版本从Python2转换到Python3. 可以重新安装BS4:</p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">Python3</span> <span class="pre">setup.py</span> <span class="pre">install</span></tt></p>
-<p>或在bs4的目录中执行Python代码版本转换脚本</p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">2to3-3.2</span> <span class="pre">-w</span> <span class="pre">bs4</span></tt></p>
-</div>
-<div class="section" id="id9">
-<h2>安装解析器<a class="headerlink" href="#id9" title="Permalink to this headline">¶</a></h2>
-<p>Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 <a class="reference external" href="http://lxml.de/">lxml</a> .根据操作系统不同,可以选择下列方法来安装lxml:</p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">Python-lxml</span></tt></p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">lxml</span></tt></p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">lxml</span></tt></p>
-<p>另一个可供选择的解析器是纯Python实现的 <a class="reference external" href="http://code.google.com/p/html5lib/">html5lib</a> , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:</p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">Python-html5lib</span></tt></p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">html5lib</span></tt></p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">html5lib</span></tt></p>
-<p>下表列出了主要的解析器,以及它们的优缺点:</p>
-<table border="1" class="docutils">
-<colgroup>
-<col width="22%" />
-<col width="26%" />
-<col width="26%" />
-<col width="26%" />
-</colgroup>
-<thead valign="bottom">
-<tr class="row-odd"><th class="head">解析器</th>
-<th class="head">使用方法</th>
-<th class="head">优势</th>
-<th class="head">劣势</th>
-</tr>
-</thead>
-<tbody valign="top">
-<tr class="row-even"><td>Python标准库</td>
-<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
-<span class="pre">&quot;html.parser&quot;)</span></tt></td>
-<td><ul class="first last simple">
-<li>Python的内置标准库</li>
-<li>执行速度适中</li>
-<li>文档容错能力强</li>
-</ul>
-</td>
-<td><ul class="first last simple">
-<li>Python 2.7.3 or 3.2.2)前
-的版本中文档容错能力差</li>
-</ul>
-</td>
-</tr>
-<tr class="row-odd"><td>lxml HTML 解析器</td>
-<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
-<span class="pre">&quot;lxml&quot;)</span></tt></td>
-<td><ul class="first last simple">
-<li>速度快</li>
-<li>文档容错能力强</li>
-</ul>
-</td>
-<td><ul class="first last simple">
-<li>需要安装C语言库</li>
-</ul>
-</td>
-</tr>
-<tr class="row-even"><td>lxml XML 解析器</td>
-<td><p class="first"><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
-<span class="pre">[&quot;lxml&quot;,</span> <span class="pre">&quot;xml&quot;])</span></tt></p>
-<p class="last"><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
-<span class="pre">&quot;xml&quot;)</span></tt></p>
-</td>
-<td><ul class="first last simple">
-<li>速度快</li>
-<li>唯一支持XML的解析器</li>
-</ul>
-</td>
-<td><ul class="first last simple">
-<li>需要安装C语言库</li>
-</ul>
-</td>
-</tr>
-<tr class="row-odd"><td>html5lib</td>
-<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
-<span class="pre">&quot;html5lib&quot;)</span></tt></td>
-<td><ul class="first last simple">
-<li>最好的容错性</li>
-<li>以浏览器的方式解析文档</li>
-<li>生成HTML5格式的文档</li>
-</ul>
-</td>
-<td><ul class="first last simple">
-<li>速度慢</li>
-<li>不依赖外部扩展</li>
-</ul>
-</td>
-</tr>
-</tbody>
-</table>
-<p>推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.</p>
-<p>提示: 如果一段HTML或XML文档格式不正确的话,那么在不同的解析器中返回的结果可能是不一样的,查看 <a class="reference internal" href="#id49">解析器之间的区别</a> 了解更多细节</p>
-</div>
-</div>
-<div class="section" id="id10">
-<h1>如何使用<a class="headerlink" href="#id10" title="Permalink to this headline">¶</a></h1>
-<p>将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄.</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
-
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="s">&quot;index.html&quot;</span><span class="p">))</span>
-
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;html&gt;data&lt;/html&gt;&quot;</span><span class="p">)</span>
-</pre></div>
-</div>
-<p>首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码</p>
-<div class="highlight-python"><pre>BeautifulSoup("Sacr&amp;eacute; bleu!")
-&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;Sacré bleu!&lt;/body&gt;&lt;/html&gt;</pre>
-</div>
-<p>然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档.(参考 <a class="reference internal" href="#xml">解析成XML</a> ).</p>
-</div>
-<div class="section" id="id11">
-<h1>对象的种类<a class="headerlink" href="#id11" title="Permalink to this headline">¶</a></h1>
-<p>Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: <tt class="docutils literal"><span class="pre">Tag</span></tt> , <tt class="docutils literal"><span class="pre">NavigableString</span></tt> , <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> , <tt class="docutils literal"><span class="pre">Comment</span></tt> .</p>
-<div class="section" id="tag">
-<h2>Tag<a class="headerlink" href="#tag" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">Tag</span></tt> 对象与XML或HTML原生文档中的tag相同:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;b class=&quot;boldest&quot;&gt;Extremely bold&lt;/b&gt;&#39;</span><span class="p">)</span>
-<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
-<span class="nb">type</span><span class="p">(</span><span class="n">tag</span><span class="p">)</span>
-<span class="c"># &lt;class &#39;bs4.element.Tag&#39;&gt;</span>
-</pre></div>
-</div>
-<p>Tag有很多方法和属性,在 <a class="reference internal" href="#id15">遍历文档树</a> 和 <a class="reference internal" href="#id24">搜索文档树</a> 中有详细解释.现在介绍一下tag中最重要的属性: name和attributes</p>
-<div class="section" id="name">
-<h3>Name<a class="headerlink" href="#name" title="Permalink to this headline">¶</a></h3>
-<p>每个tag都有自己的名字,通过 <tt class="docutils literal"><span class="pre">.name</span></tt> 来获取:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">name</span>
-<span class="c"># u&#39;b&#39;</span>
-</pre></div>
-</div>
-<p>如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">&quot;blockquote&quot;</span>
-<span class="n">tag</span>
-<span class="c"># &lt;blockquote class=&quot;boldest&quot;&gt;Extremely bold&lt;/blockquote&gt;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="attributes">
-<h3>Attributes<a class="headerlink" href="#attributes" title="Permalink to this headline">¶</a></h3>
-<p>一个tag可能有很多个属性. tag <tt class="docutils literal"><span class="pre">&lt;b</span> <span class="pre">class=&quot;boldest&quot;&gt;</span></tt> 有一个 &#8220;class&#8221; 的属性,值为 &#8220;boldest&#8221; . tag的属性的操作方法与字典相同:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
-<span class="c"># u&#39;boldest&#39;</span>
-</pre></div>
-</div>
-<p>也可以直接&#8221;点&#8221;取属性, 比如: <tt class="docutils literal"><span class="pre">.attrs</span></tt> :</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">attrs</span>
-<span class="c"># {u&#39;class&#39;: u&#39;boldest&#39;}</span>
-</pre></div>
-</div>
-<p>tag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#39;verybold&#39;</span>
-<span class="n">tag</span><span class="p">[</span><span class="s">&#39;id&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
-<span class="n">tag</span>
-<span class="c"># &lt;blockquote class=&quot;verybold&quot; id=&quot;1&quot;&gt;Extremely bold&lt;/blockquote&gt;</span>
-
-<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
-<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">&#39;id&#39;</span><span class="p">]</span>
-<span class="n">tag</span>
-<span class="c"># &lt;blockquote&gt;Extremely bold&lt;/blockquote&gt;</span>
-
-<span class="n">tag</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
-<span class="c"># KeyError: &#39;class&#39;</span>
-<span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">&#39;class&#39;</span><span class="p">))</span>
-<span class="c"># None</span>
-</pre></div>
-</div>
-<div class="section" id="id12">
-<h4>多值属性<a class="headerlink" href="#id12" title="Permalink to this headline">¶</a></h4>
-<p>HTML 4定义了一系列可以包含多个值的属性.在HTML5中移除了一些,却增加更多.最常见的多值的属性是 class (一个tag可以有多个CSS的class). 还有一些属性 <tt class="docutils literal"><span class="pre">rel</span></tt> , <tt class="docutils literal"><span class="pre">rev</span></tt> , <tt class="docutils literal"><span class="pre">accept-charset</span></tt> , <tt class="docutils literal"><span class="pre">headers</span></tt> , <tt class="docutils literal"><span class="pre">accesskey</span></tt> . 在Beautiful Soup中多值属性的返回类型是list:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">css_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;p class=&quot;body strikeout&quot;&gt;&lt;/p&gt;&#39;</span><span class="p">)</span>
-<span class="n">css_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
-<span class="c"># [&quot;body&quot;, &quot;strikeout&quot;]</span>
-
-<span class="n">css_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;p class=&quot;body&quot;&gt;&lt;/p&gt;&#39;</span><span class="p">)</span>
-<span class="n">css_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
-<span class="c"># [&quot;body&quot;]</span>
-</pre></div>
-</div>
-<p>如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">id_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;p id=&quot;my id&quot;&gt;&lt;/p&gt;&#39;</span><span class="p">)</span>
-<span class="n">id_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">&#39;id&#39;</span><span class="p">]</span>
-<span class="c"># &#39;my id&#39;</span>
-</pre></div>
-</div>
-<p>将tag转换成字符串时,多值属性会合并为一个值</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">rel_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;p&gt;Back to the &lt;a rel=&quot;index&quot;&gt;homepage&lt;/a&gt;&lt;/p&gt;&#39;</span><span class="p">)</span>
-<span class="n">rel_soup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">&#39;rel&#39;</span><span class="p">]</span>
-<span class="c"># [&#39;index&#39;]</span>
-<span class="n">rel_soup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">&#39;rel&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39;index&#39;</span><span class="p">,</span> <span class="s">&#39;contents&#39;</span><span class="p">]</span>
-<span class="k">print</span><span class="p">(</span><span class="n">rel_soup</span><span class="o">.</span><span class="n">p</span><span class="p">)</span>
-<span class="c"># &lt;p&gt;Back to the &lt;a rel=&quot;index contents&quot;&gt;homepage&lt;/a&gt;&lt;/p&gt;</span>
-</pre></div>
-</div>
-<p>如果转换的文档是XML格式,那么tag中不包含多值属性</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">xml_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;p class=&quot;body strikeout&quot;&gt;&lt;/p&gt;&#39;</span><span class="p">,</span> <span class="s">&#39;xml&#39;</span><span class="p">)</span>
-<span class="n">xml_soup</span><span class="o">.</span><span class="n">p</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
-<span class="c"># u&#39;body strikeout&#39;</span>
-</pre></div>
-</div>
-</div>
-</div>
-</div>
-<div class="section" id="id13">
-<h2>可以遍历的字符串<a class="headerlink" href="#id13" title="Permalink to this headline">¶</a></h2>
-<p>字符串常被包含在tag内.Beautiful Soup用 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 类来包装tag中的字符串:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">string</span>
-<span class="c"># u&#39;Extremely bold&#39;</span>
-<span class="nb">type</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
-<span class="c"># &lt;class &#39;bs4.element.NavigableString&#39;&gt;</span>
-</pre></div>
-</div>
-<p>一个 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 字符串与Python中的Unicode字符串相同,并且还支持包含在 <a class="reference internal" href="#id15">遍历文档树</a> 和 <a class="reference internal" href="#id24">搜索文档树</a> 中的一些特性. 通过 <tt class="docutils literal"><span class="pre">unicode()</span></tt> 方法可以直接将 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 对象转换成Unicode字符串:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">unicode_string</span> <span class="o">=</span> <span class="nb">unicode</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
-<span class="n">unicode_string</span>
-<span class="c"># u&#39;Extremely bold&#39;</span>
-<span class="nb">type</span><span class="p">(</span><span class="n">unicode_string</span><span class="p">)</span>
-<span class="c"># &lt;type &#39;unicode&#39;&gt;</span>
-</pre></div>
-</div>
-<p>tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 <a class="reference internal" href="#replace-with">replace_with()</a> 方法:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">tag</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="s">&quot;No longer bold&quot;</span><span class="p">)</span>
-<span class="n">tag</span>
-<span class="c"># &lt;blockquote&gt;No longer bold&lt;/blockquote&gt;</span>
-</pre></div>
-</div>
-<p><tt class="docutils literal"><span class="pre">NavigableString</span></tt> 对象支持 <a class="reference internal" href="#id15">遍历文档树</a> 和 <a class="reference internal" href="#id24">搜索文档树</a> 中定义的大部分属性, 并非全部.尤其是,一个字符串不能包含其它内容(tag能够包含字符串或是其它tag),字符串不支持 <tt class="docutils literal"><span class="pre">.contents</span></tt> 或 <tt class="docutils literal"><span class="pre">.string</span></tt> 属性或 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法.</p>
-<p>如果想在Beautiful Soup之外使用 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 对象,需要调用 <tt class="docutils literal"><span class="pre">unicode()</span></tt> 方法,将该对象转换成普通的Unicode字符串,否则就算Beautiful Soup已方法已经执行结束,该对象的输出也会带有对象的引用地址.这样会浪费内存.</p>
-</div>
-<div class="section" id="beautifulsoup">
-<h2>BeautifulSoup<a class="headerlink" href="#beautifulsoup" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 <tt class="docutils literal"><span class="pre">Tag</span></tt> 对象,它支持 <a class="reference internal" href="#id15">遍历文档树</a> 和 <a class="reference internal" href="#id24">搜索文档树</a> 中描述的大部分的方法.</p>
-<p>因为 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 <tt class="docutils literal"><span class="pre">.name</span></tt> 属性是很方便的,所以 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象包含了一个值为 &#8220;[document]&#8221; 的特殊属性 <tt class="docutils literal"><span class="pre">.name</span></tt></p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">name</span>
-<span class="c"># u&#39;[document]&#39;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="id14">
-<h2>注释及特殊字符串<a class="headerlink" href="#id14" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">Tag</span></tt> , <tt class="docutils literal"><span class="pre">NavigableString</span></tt> , <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 几乎覆盖了html和xml中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&quot;&lt;b&gt;&lt;!--Hey, buddy. Want to buy a used parser?--&gt;&lt;/b&gt;&quot;</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
-<span class="n">comment</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span>
-<span class="nb">type</span><span class="p">(</span><span class="n">comment</span><span class="p">)</span>
-<span class="c"># &lt;class &#39;bs4.element.Comment&#39;&gt;</span>
-</pre></div>
-</div>
-<p><tt class="docutils literal"><span class="pre">Comment</span></tt> 对象是一个特殊类型的 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 对象:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">comment</span>
-<span class="c"># u&#39;Hey, buddy. Want to buy a used parser&#39;</span>
-</pre></div>
-</div>
-<p>但是当它出现在HTML文档中时, <tt class="docutils literal"><span class="pre">Comment</span></tt> 对象会使用特殊的格式输出:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
-<span class="c"># &lt;b&gt;</span>
-<span class="c"># &lt;!--Hey, buddy. Want to buy a used parser?--&gt;</span>
-<span class="c"># &lt;/b&gt;</span>
-</pre></div>
-</div>
-<p>Beautiful Soup中定义的其它类型都可能会出现在XML的文档中: <tt class="docutils literal"><span class="pre">CData</span></tt> , <tt class="docutils literal"><span class="pre">ProcessingInstruction</span></tt> , <tt class="docutils literal"><span class="pre">Declaration</span></tt> , <tt class="docutils literal"><span class="pre">Doctype</span></tt> .与 <tt class="docutils literal"><span class="pre">Comment</span></tt> 对象类似,这些类都是 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 的子类,只是添加了一些额外的方法的字符串独享.下面是用CDATA来替代注释的例子:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">CData</span>
-<span class="n">cdata</span> <span class="o">=</span> <span class="n">CData</span><span class="p">(</span><span class="s">&quot;A CDATA block&quot;</span><span class="p">)</span>
-<span class="n">comment</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="n">cdata</span><span class="p">)</span>
-
-<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
-<span class="c"># &lt;b&gt;</span>
-<span class="c"># &lt;![CDATA[A CDATA block]]&gt;</span>
-<span class="c"># &lt;/b&gt;</span>
-</pre></div>
-</div>
-</div>
-</div>
-<div class="section" id="id15">
-<h1>遍历文档树<a class="headerlink" href="#id15" title="Permalink to this headline">¶</a></h1>
-<p>还拿&#8221;爱丽丝梦游仙境&#8221;的文档来做例子:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">html_doc</span> <span class="o">=</span> <span class="s">&quot;&quot;&quot;</span>
-<span class="s">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;</span>
-
-<span class="s">&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;</span>
-
-<span class="s">&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were</span>
-<span class="s">&lt;a href=&quot;http://example.com/elsie&quot; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="s">&lt;a href=&quot;http://example.com/lacie&quot; class=&quot;sister&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and</span>
-<span class="s">&lt;a href=&quot;http://example.com/tillie&quot; class=&quot;sister&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;</span>
-<span class="s">and they lived at the bottom of a well.&lt;/p&gt;</span>
-
-<span class="s">&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
-<span class="s">&quot;&quot;&quot;</span>
-
-<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
-</pre></div>
-</div>
-<p>通过这段例子来演示怎样从文档的一段内容找到另一段内容</p>
-<div class="section" id="id16">
-<h2>子节点<a class="headerlink" href="#id16" title="Permalink to this headline">¶</a></h2>
-<p>一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.</p>
-<p>注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点</p>
-<div class="section" id="id17">
-<h3>tag的名字<a class="headerlink" href="#id17" title="Permalink to this headline">¶</a></h3>
-<p>操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取 &lt;head&gt; 标签,只要用 <tt class="docutils literal"><span class="pre">soup.head</span></tt> :</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">head</span>
-<span class="c"># &lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">title</span>
-<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
-</pre></div>
-</div>
-<p>这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取&lt;body&gt;标签中的第一个&lt;b&gt;标签:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">body</span><span class="o">.</span><span class="n">b</span>
-<span class="c"># &lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;</span>
-</pre></div>
-</div>
-<p>通过点取属性的方式只能获得当前名字的第一个tag:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
-</pre></div>
-</div>
-<p>如果想要得到所有的&lt;a&gt;标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 <cite>Searching the tree</cite> 中描述的方法,比如: find_all()</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&#39;a&#39;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="contents-children">
-<h3>.contents 和 .children<a class="headerlink" href="#contents-children" title="Permalink to this headline">¶</a></h3>
-<p>tag的 <tt class="docutils literal"><span class="pre">.contents</span></tt> 属性可以将tag的子节点以列表的方式输出:</p>
-<div class="highlight-python"><pre>head_tag = soup.head
-head_tag
-# &lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;
-
-head_tag.contents
-[&lt;title&gt;The Dormouse's story&lt;/title&gt;]
-
-title_tag = head_tag.contents[0]
-title_tag
-# &lt;title&gt;The Dormouse's story&lt;/title&gt;
-title_tag.contents
-# [u'The Dormouse's story']</pre>
-</div>
-<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象本身一定会包含子节点,也就是说&lt;html&gt;标签也是 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的子节点:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="nb">len</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">contents</span><span class="p">)</span>
-<span class="c"># 1</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">contents</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">name</span>
-<span class="c"># u&#39;html&#39;</span>
-</pre></div>
-</div>
-<p>字符串没有 <tt class="docutils literal"><span class="pre">.contents</span></tt> 属性,因为字符串没有子节点:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">text</span> <span class="o">=</span> <span class="n">title_tag</span><span class="o">.</span><span class="n">contents</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
-<span class="n">text</span><span class="o">.</span><span class="n">contents</span>
-<span class="c"># AttributeError: &#39;NavigableString&#39; object has no attribute &#39;contents&#39;</span>
-</pre></div>
-</div>
-<p>通过tag的 <tt class="docutils literal"><span class="pre">.children</span></tt> 生成器,可以对tag的子节点进行循环:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">title_tag</span><span class="o">.</span><span class="n">children</span><span class="p">:</span>
- <span class="k">print</span><span class="p">(</span><span class="n">child</span><span class="p">)</span>
- <span class="c"># The Dormouse&#39;s story</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="descendants">
-<h3>.descendants<a class="headerlink" href="#descendants" title="Permalink to this headline">¶</a></h3>
-<p><tt class="docutils literal"><span class="pre">.contents</span></tt> 和 <tt class="docutils literal"><span class="pre">.children</span></tt> 属性仅包含tag的直接子节点.例如,&lt;head&gt;标签只有一个直接子节点&lt;title&gt;</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">head_tag</span><span class="o">.</span><span class="n">contents</span>
-<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
-</pre></div>
-</div>
-<p>但是&lt;title&gt;标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于&lt;head&gt;标签的子孙节点. <tt class="docutils literal"><span class="pre">.descendants</span></tt> 属性可以对所有tag的子孙节点进行递归循环 <a class="footnote-reference" href="#id86" id="id18">[5]</a> :</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">head_tag</span><span class="o">.</span><span class="n">descendants</span><span class="p">:</span>
- <span class="k">print</span><span class="p">(</span><span class="n">child</span><span class="p">)</span>
- <span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
- <span class="c"># The Dormouse&#39;s story</span>
-</pre></div>
-</div>
-<p>上面的例子中, &lt;head&gt;标签只有一个子节点,但是有2个子孙节点:&lt;head&gt;节点和&lt;head&gt;的子节点, <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 有一个直接子节点(&lt;html&gt;节点),却有很多子孙节点:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">children</span><span class="p">))</span>
-<span class="c"># 1</span>
-<span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">descendants</span><span class="p">))</span>
-<span class="c"># 25</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="string">
-<h3>.string<a class="headerlink" href="#string" title="Permalink to this headline">¶</a></h3>
-<p>如果tag只有一个 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 类型子节点,那么这个tag可以使用 <tt class="docutils literal"><span class="pre">.string</span></tt> 得到子节点:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">title_tag</span><span class="o">.</span><span class="n">string</span>
-<span class="c"># u&#39;The Dormouse&#39;s story&#39;</span>
-</pre></div>
-</div>
-<p>如果一个tag仅有一个子节点,那么这个tag也可以使用 <tt class="docutils literal"><span class="pre">.string</span></tt> 方法,输出结果与当前唯一子节点的 <tt class="docutils literal"><span class="pre">.string</span></tt> 结果相同:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">head_tag</span><span class="o">.</span><span class="n">contents</span>
-<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
-
-<span class="n">head_tag</span><span class="o">.</span><span class="n">string</span>
-<span class="c"># u&#39;The Dormouse&#39;s story&#39;</span>
-</pre></div>
-</div>
-<p>如果tag包含了多个子节点,tag就无法确定 <tt class="docutils literal"><span class="pre">.string</span></tt> 方法应该调用哪个子节点的内容, <tt class="docutils literal"><span class="pre">.string</span></tt> 的输出结果是 <tt class="docutils literal"><span class="pre">None</span></tt> :</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
-<span class="c"># None</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="strings-stripped-strings">
-<h3>.strings 和 stripped_strings<a class="headerlink" href="#strings-stripped-strings" title="Permalink to this headline">¶</a></h3>
-<p>如果tag中包含多个字符串 <a class="footnote-reference" href="#id83" id="id19">[2]</a> ,可以使用 <tt class="docutils literal"><span class="pre">.strings</span></tt> 来循环获取:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">string</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">strings</span><span class="p">:</span>
- <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">string</span><span class="p">))</span>
- <span class="c"># u&quot;The Dormouse&#39;s story&quot;</span>
- <span class="c"># u&#39;\n\n&#39;</span>
- <span class="c"># u&quot;The Dormouse&#39;s story&quot;</span>
- <span class="c"># u&#39;\n\n&#39;</span>
- <span class="c"># u&#39;Once upon a time there were three little sisters; and their names were\n&#39;</span>
- <span class="c"># u&#39;Elsie&#39;</span>
- <span class="c"># u&#39;,\n&#39;</span>
- <span class="c"># u&#39;Lacie&#39;</span>
- <span class="c"># u&#39; and\n&#39;</span>
- <span class="c"># u&#39;Tillie&#39;</span>
- <span class="c"># u&#39;;\nand they lived at the bottom of a well.&#39;</span>
- <span class="c"># u&#39;\n\n&#39;</span>
- <span class="c"># u&#39;...&#39;</span>
- <span class="c"># u&#39;\n&#39;</span>
-</pre></div>
-</div>
-<p>输出的字符串中可能包含了很多空格或空行,使用 <tt class="docutils literal"><span class="pre">.stripped_strings</span></tt> 可以去除多余空白内容:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">string</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">stripped_strings</span><span class="p">:</span>
- <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">string</span><span class="p">))</span>
- <span class="c"># u&quot;The Dormouse&#39;s story&quot;</span>
- <span class="c"># u&quot;The Dormouse&#39;s story&quot;</span>
- <span class="c"># u&#39;Once upon a time there were three little sisters; and their names were&#39;</span>
- <span class="c"># u&#39;Elsie&#39;</span>
- <span class="c"># u&#39;,&#39;</span>
- <span class="c"># u&#39;Lacie&#39;</span>
- <span class="c"># u&#39;and&#39;</span>
- <span class="c"># u&#39;Tillie&#39;</span>
- <span class="c"># u&#39;;\nand they lived at the bottom of a well.&#39;</span>
- <span class="c"># u&#39;...&#39;</span>
-</pre></div>
-</div>
-<p>全部是空格的行会被忽略掉,段首和段末的空白会被删除</p>
-</div>
-</div>
-<div class="section" id="id20">
-<h2>父节点<a class="headerlink" href="#id20" title="Permalink to this headline">¶</a></h2>
-<p>继续分析文档树,每个tag或字符串都有父节点:被包含在某个tag中</p>
-<div class="section" id="parent">
-<h3>.parent<a class="headerlink" href="#parent" title="Permalink to this headline">¶</a></h3>
-<p>通过 <tt class="docutils literal"><span class="pre">.parent</span></tt> 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,&lt;head&gt;标签是&lt;title&gt;标签的父节点:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">title_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">title</span>
-<span class="n">title_tag</span>
-<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
-<span class="n">title_tag</span><span class="o">.</span><span class="n">parent</span>
-<span class="c"># &lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;</span>
-</pre></div>
-</div>
-<p>文档title的字符串也有父节点:&lt;title&gt;标签</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">title_tag</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">parent</span>
-<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
-</pre></div>
-</div>
-<p>文档的顶层节点比如&lt;html&gt;的父节点是 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">html_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">html</span>
-<span class="nb">type</span><span class="p">(</span><span class="n">html_tag</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
-<span class="c"># &lt;class &#39;bs4.BeautifulSoup&#39;&gt;</span>
-</pre></div>
-</div>
-<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的 <tt class="docutils literal"><span class="pre">.parent</span></tt> 是None:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
-<span class="c"># None</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="parents">
-<h3>.parents<a class="headerlink" href="#parents" title="Permalink to this headline">¶</a></h3>
-<p>通过元素的 <tt class="docutils literal"><span class="pre">.parents</span></tt> 属性可以递归得到元素的所有父辈节点,下面的例子使用了 <tt class="docutils literal"><span class="pre">.parents</span></tt> 方法遍历了&lt;a&gt;标签到根节点的所有节点.</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-<span class="n">link</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
-<span class="k">for</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">link</span><span class="o">.</span><span class="n">parents</span><span class="p">:</span>
- <span class="k">if</span> <span class="n">parent</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
- <span class="k">print</span><span class="p">(</span><span class="n">parent</span><span class="p">)</span>
- <span class="k">else</span><span class="p">:</span>
- <span class="k">print</span><span class="p">(</span><span class="n">parent</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
-<span class="c"># p</span>
-<span class="c"># body</span>
-<span class="c"># html</span>
-<span class="c"># [document]</span>
-<span class="c"># None</span>
-</pre></div>
-</div>
-</div>
-</div>
-<div class="section" id="id21">
-<h2>兄弟节点<a class="headerlink" href="#id21" title="Permalink to this headline">¶</a></h2>
-<p>看一段简单的例子:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">sibling_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;&lt;b&gt;text1&lt;/b&gt;&lt;c&gt;text2&lt;/c&gt;&lt;/b&gt;&lt;/a&gt;&quot;</span><span class="p">)</span>
-<span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
-<span class="c"># &lt;html&gt;</span>
-<span class="c"># &lt;body&gt;</span>
-<span class="c"># &lt;a&gt;</span>
-<span class="c"># &lt;b&gt;</span>
-<span class="c"># text1</span>
-<span class="c"># &lt;/b&gt;</span>
-<span class="c"># &lt;c&gt;</span>
-<span class="c"># text2</span>
-<span class="c"># &lt;/c&gt;</span>
-<span class="c"># &lt;/a&gt;</span>
-<span class="c"># &lt;/body&gt;</span>
-<span class="c"># &lt;/html&gt;</span>
-</pre></div>
-</div>
-<p>因为&lt;b&gt;标签和&lt;c&gt;标签是同一层:他们是同一个元素的子节点,所以&lt;b&gt;和&lt;c&gt;可以被称为兄弟节点.一段文档以标准格式输出时,兄弟节点有相同的缩进级别.在代码中也可以使用这种关系.</p>
-<div class="section" id="next-sibling-previous-sibling">
-<h3>.next_sibling 和 .previous_sibling<a class="headerlink" href="#next-sibling-previous-sibling" title="Permalink to this headline">¶</a></h3>
-<p>在文档树中,使用 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 和 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt> 属性来查询兄弟节点:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">next_sibling</span>
-<span class="c"># &lt;c&gt;text2&lt;/c&gt;</span>
-
-<span class="n">sibling_soup</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">previous_sibling</span>
-<span class="c"># &lt;b&gt;text1&lt;/b&gt;</span>
-</pre></div>
-</div>
-<p>&lt;b&gt;标签有 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 属性,但是没有 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt> 属性,因为&lt;b&gt;标签在同级节点中是第一个.同理,&lt;c&gt;标签有 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt> 属性,却没有 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 属性:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">previous_sibling</span><span class="p">)</span>
-<span class="c"># None</span>
-<span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">next_sibling</span><span class="p">)</span>
-<span class="c"># None</span>
-</pre></div>
-</div>
-<p>例子中的字符串“text1”和“text2”不是兄弟节点,因为它们的父节点不同:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span>
-<span class="c"># u&#39;text1&#39;</span>
-
-<span class="k">print</span><span class="p">(</span><span class="n">sibling_soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">next_sibling</span><span class="p">)</span>
-<span class="c"># None</span>
-</pre></div>
-</div>
-<p>实际文档中的tag的 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 和 <tt class="docutils literal"><span class="pre">.previous_sibling</span></tt> 属性通常是字符串或空白. 看看“爱丽丝”文档:</p>
-<div class="highlight-python"><pre>&lt;a href="http://example.com/elsie" class="sister" id="link1"&gt;Elsie&lt;/a&gt;
-&lt;a href="http://example.com/lacie" class="sister" id="link2"&gt;Lacie&lt;/a&gt;
-&lt;a href="http://example.com/tillie" class="sister" id="link3"&gt;Tillie&lt;/a&gt;</pre>
-</div>
-<p>如果以为第一个&lt;a&gt;标签的 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 结果是第二个&lt;a&gt;标签,那就错了,真实结果是第一个&lt;a&gt;标签和第二个&lt;a&gt;标签之间的顿号和换行符:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-<span class="n">link</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
-
-<span class="n">link</span><span class="o">.</span><span class="n">next_sibling</span>
-<span class="c"># u&#39;,\n&#39;</span>
-</pre></div>
-</div>
-<p>第二个&lt;a&gt;标签是顿号的 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 属性:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">link</span><span class="o">.</span><span class="n">next_sibling</span><span class="o">.</span><span class="n">next_sibling</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="next-siblings-previous-siblings">
-<h3>.next_siblings 和 .previous_siblings<a class="headerlink" href="#next-siblings-previous-siblings" title="Permalink to this headline">¶</a></h3>
-<p>通过 <tt class="docutils literal"><span class="pre">.next_siblings</span></tt> 和 <tt class="docutils literal"><span class="pre">.previous_siblings</span></tt> 属性可以对当前节点的兄弟节点迭代输出:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">sibling</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">next_siblings</span><span class="p">:</span>
- <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">sibling</span><span class="p">))</span>
- <span class="c"># u&#39;,\n&#39;</span>
- <span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;</span>
- <span class="c"># u&#39; and\n&#39;</span>
- <span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;</span>
- <span class="c"># u&#39;; and they lived at the bottom of a well.&#39;</span>
- <span class="c"># None</span>
-
-<span class="k">for</span> <span class="n">sibling</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">&quot;link3&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">previous_siblings</span><span class="p">:</span>
- <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">sibling</span><span class="p">))</span>
- <span class="c"># &#39; and\n&#39;</span>
- <span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;</span>
- <span class="c"># u&#39;,\n&#39;</span>
- <span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
- <span class="c"># u&#39;Once upon a time there were three little sisters; and their names were\n&#39;</span>
- <span class="c"># None</span>
-</pre></div>
-</div>
-</div>
-</div>
-<div class="section" id="id22">
-<h2>回退和前进<a class="headerlink" href="#id22" title="Permalink to this headline">¶</a></h2>
-<p>看一下“爱丽丝” 文档:</p>
-<div class="highlight-python"><pre>&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;
-&lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</pre>
-</div>
-<p>HTML解析器把这段字符串转换成一连串的事件: &#8220;打开&lt;html&gt;标签&#8221;,&#8221;打开一个&lt;head&gt;标签&#8221;,&#8221;打开一个&lt;title&gt;标签&#8221;,&#8221;添加一段字符串&#8221;,&#8221;关闭&lt;title&gt;标签&#8221;,&#8221;打开&lt;p&gt;标签&#8221;,等等.Beautiful Soup提供了重现解析器初始化过程的方法.</p>
-<div class="section" id="next-element-previous-element">
-<h3>.next_element 和 .previous_element<a class="headerlink" href="#next-element-previous-element" title="Permalink to this headline">¶</a></h3>
-<p><tt class="docutils literal"><span class="pre">.next_element</span></tt> 属性指向解析过程中下一个被解析的对象(字符串或tag),结果可能与 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 相同,但通常是不一样的.</p>
-<p>这是“爱丽丝”文档中最后一个&lt;a&gt;标签,它的 <tt class="docutils literal"><span class="pre">.next_sibling</span></tt> 结果是一个字符串,因为当前的解析过程 <a class="footnote-reference" href="#id83" id="id23">[2]</a> 因为当前的解析过程因为遇到了&lt;a&gt;标签而中断了:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">last_a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="s">&quot;link3&quot;</span><span class="p">)</span>
-<span class="n">last_a_tag</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;</span>
-
-<span class="n">last_a_tag</span><span class="o">.</span><span class="n">next_sibling</span>
-<span class="c"># &#39;; and they lived at the bottom of a well.&#39;</span>
-</pre></div>
-</div>
-<p>但这个&lt;a&gt;标签的 <tt class="docutils literal"><span class="pre">.next_element</span></tt> 属性结果是在&lt;a&gt;标签被解析之后的解析内容,不是&lt;a&gt;标签后的句子部分,应该是字符串&#8221;Tillie&#8221;:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">last_a_tag</span><span class="o">.</span><span class="n">next_element</span>
-<span class="c"># u&#39;Tillie&#39;</span>
-</pre></div>
-</div>
-<p>这是因为在原始文档中,字符串“Tillie” 在分号前出现,解析器先进入&lt;a&gt;标签,然后是字符串“Tillie”,然后关闭&lt;/a&gt;标签,然后是分号和剩余部分.分号与&lt;a&gt;标签在同一层级,但是字符串“Tillie”会被先解析.</p>
-<p><tt class="docutils literal"><span class="pre">.previous_element</span></tt> 属性刚好与 <tt class="docutils literal"><span class="pre">.next_element</span></tt> 相反,它指向当前被解析的对象的前一个解析对象:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">last_a_tag</span><span class="o">.</span><span class="n">previous_element</span>
-<span class="c"># u&#39; and\n&#39;</span>
-<span class="n">last_a_tag</span><span class="o">.</span><span class="n">previous_element</span><span class="o">.</span><span class="n">next_element</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="next-elements-previous-elements">
-<h3>.next_elements 和 .previous_elements<a class="headerlink" href="#next-elements-previous-elements" title="Permalink to this headline">¶</a></h3>
-<p>通过 <tt class="docutils literal"><span class="pre">.next_elements</span></tt> 和 <tt class="docutils literal"><span class="pre">.previous_elements</span></tt> 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">element</span> <span class="ow">in</span> <span class="n">last_a_tag</span><span class="o">.</span><span class="n">next_elements</span><span class="p">:</span>
- <span class="k">print</span><span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">element</span><span class="p">))</span>
-<span class="c"># u&#39;Tillie&#39;</span>
-<span class="c"># u&#39;;\nand they lived at the bottom of a well.&#39;</span>
-<span class="c"># u&#39;\n\n&#39;</span>
-<span class="c"># &lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
-<span class="c"># u&#39;...&#39;</span>
-<span class="c"># u&#39;\n&#39;</span>
-<span class="c"># None</span>
-</pre></div>
-</div>
-</div>
-</div>
-</div>
-<div class="section" id="id24">
-<h1>搜索文档树<a class="headerlink" href="#id24" title="Permalink to this headline">¶</a></h1>
-<p>Beautiful Soup定义了很多搜索方法,这里着重介绍2个: <tt class="docutils literal"><span class="pre">find()</span></tt> 和 <tt class="docutils literal"><span class="pre">find_all()</span></tt> .其它方法的参数和用法类似,请读者举一反三.</p>
-<p>再以“爱丽丝”文档作为例子:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">html_doc</span> <span class="o">=</span> <span class="s">&quot;&quot;&quot;</span>
-<span class="s">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;</span>
-
-<span class="s">&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;</span>
-
-<span class="s">&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were</span>
-<span class="s">&lt;a href=&quot;http://example.com/elsie&quot; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="s">&lt;a href=&quot;http://example.com/lacie&quot; class=&quot;sister&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and</span>
-<span class="s">&lt;a href=&quot;http://example.com/tillie&quot; class=&quot;sister&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;</span>
-<span class="s">and they lived at the bottom of a well.&lt;/p&gt;</span>
-
-<span class="s">&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
-<span class="s">&quot;&quot;&quot;</span>
-
-<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
-</pre></div>
-</div>
-<p>使用 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 类似的方法可以查找到想要查找的文档内容</p>
-<div class="section" id="id25">
-<h2>过滤器<a class="headerlink" href="#id25" title="Permalink to this headline">¶</a></h2>
-<p>介绍 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法前,先介绍一下过滤器的类型 <a class="footnote-reference" href="#id84" id="id26">[3]</a> ,这些过滤器贯穿整个搜索的API.过滤器可以被用在tag的name中,节点的属性中,字符串中或他们的混合中.</p>
-<div class="section" id="id27">
-<h3>字符串<a class="headerlink" href="#id27" title="Permalink to this headline">¶</a></h3>
-<p>最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的&lt;b&gt;标签:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&#39;b&#39;</span><span class="p">)</span>
-<span class="c"># [&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;]</span>
-</pre></div>
-</div>
-<p>如果传入字节码参数,Beautiful Soup会当作UTF-8编码,可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错</p>
-</div>
-<div class="section" id="id28">
-<h3>正则表达式<a class="headerlink" href="#id28" title="Permalink to this headline">¶</a></h3>
-<p>如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 <tt class="docutils literal"><span class="pre">match()</span></tt> 来匹配内容.下面例子中找出所有以b开头的标签,这表示&lt;body&gt;和&lt;b&gt;标签都应该被找到:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="kn">import</span> <span class="nn">re</span>
-<span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">&quot;^b&quot;</span><span class="p">)):</span>
- <span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
-<span class="c"># body</span>
-<span class="c"># b</span>
-</pre></div>
-</div>
-<p>下面代码找出所有名字中包含&#8221;t&#8221;的标签:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">&quot;t&quot;</span><span class="p">)):</span>
- <span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
-<span class="c"># html</span>
-<span class="c"># title</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="id29">
-<h3>列表<a class="headerlink" href="#id29" title="Permalink to this headline">¶</a></h3>
-<p>如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有&lt;a&gt;标签和&lt;b&gt;标签:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">([</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="s">&quot;b&quot;</span><span class="p">])</span>
-<span class="c"># [&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="true">
-<h3>True<a class="headerlink" href="#true" title="Permalink to this headline">¶</a></h3>
-<p><tt class="docutils literal"><span class="pre">True</span></tt> 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="bp">True</span><span class="p">):</span>
- <span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
-<span class="c"># html</span>
-<span class="c"># head</span>
-<span class="c"># title</span>
-<span class="c"># body</span>
-<span class="c"># p</span>
-<span class="c"># b</span>
-<span class="c"># p</span>
-<span class="c"># a</span>
-<span class="c"># a</span>
-<span class="c"># a</span>
-<span class="c"># p</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="id30">
-<h3>方法<a class="headerlink" href="#id30" title="Permalink to this headline">¶</a></h3>
-<p>如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 <a class="footnote-reference" href="#id85" id="id31">[4]</a> ,如果这个方法返回 <tt class="docutils literal"><span class="pre">True</span></tt> 表示当前元素匹配并且被找到,如果不是则反回 <tt class="docutils literal"><span class="pre">False</span></tt></p>
-<p>下面方法校验了当前元素,如果包含 <tt class="docutils literal"><span class="pre">class</span></tt> 属性却不包含 <tt class="docutils literal"><span class="pre">id</span></tt> 属性,那么将返回 <tt class="docutils literal"><span class="pre">True</span></tt>:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">def</span> <span class="nf">has_class_but_no_id</span><span class="p">(</span><span class="n">tag</span><span class="p">):</span>
- <span class="k">return</span> <span class="n">tag</span><span class="o">.</span><span class="n">has_attr</span><span class="p">(</span><span class="s">&#39;class&#39;</span><span class="p">)</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">tag</span><span class="o">.</span><span class="n">has_attr</span><span class="p">(</span><span class="s">&#39;id&#39;</span><span class="p">)</span>
-</pre></div>
-</div>
-<p>将这个方法作为参数传入 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法,将得到所有&lt;p&gt;标签:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">has_class_but_no_id</span><span class="p">)</span>
-<span class="c"># [&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;,</span>
-<span class="c"># &lt;p class=&quot;story&quot;&gt;Once upon a time there were...&lt;/p&gt;,</span>
-<span class="c"># &lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;]</span>
-</pre></div>
-</div>
-<p>返回结果中只有&lt;p&gt;标签没有&lt;a&gt;标签,因为&lt;a&gt;标签还定义了&#8221;id&#8221;,没有返回&lt;html&gt;和&lt;head&gt;,因为&lt;html&gt;和&lt;head&gt;中没有定义&#8221;class&#8221;属性.</p>
-<p>下面代码找到所有被文字包含的节点内容:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">NavigableString</span>
-<span class="k">def</span> <span class="nf">surrounded_by_strings</span><span class="p">(</span><span class="n">tag</span><span class="p">):</span>
- <span class="k">return</span> <span class="p">(</span><span class="nb">isinstance</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">next_element</span><span class="p">,</span> <span class="n">NavigableString</span><span class="p">)</span>
- <span class="ow">and</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">previous_element</span><span class="p">,</span> <span class="n">NavigableString</span><span class="p">))</span>
-
-<span class="k">for</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">surrounded_by_strings</span><span class="p">):</span>
- <span class="k">print</span> <span class="n">tag</span><span class="o">.</span><span class="n">name</span>
-<span class="c"># p</span>
-<span class="c"># a</span>
-<span class="c"># a</span>
-<span class="c"># a</span>
-<span class="c"># p</span>
-</pre></div>
-</div>
-<p>现在来了解一下搜索方法的细节</p>
-</div>
-</div>
-<div class="section" id="find-all">
-<h2>find_all()<a class="headerlink" href="#find-all" title="Permalink to this headline">¶</a></h2>
-<p>find_all( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
-<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件.这里有几个例子:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">,</span> <span class="s">&quot;title&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">&quot;link2&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;]</span>
-
-<span class="kn">import</span> <span class="nn">re</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">&quot;sisters&quot;</span><span class="p">))</span>
-<span class="c"># u&#39;Once upon a time there were three little sisters; and their names were\n&#39;</span>
-</pre></div>
-</div>
-<p>有几个方法很相似,还有几个方法是新的,参数中的 <tt class="docutils literal"><span class="pre">text</span></tt> 和 <tt class="docutils literal"><span class="pre">id</span></tt> 是什么含义? 为什么 <tt class="docutils literal"><span class="pre">find_all(&quot;p&quot;,</span> <span class="pre">&quot;title&quot;)</span></tt> 返回的是CSS Class为&#8221;title&#8221;的&lt;p&gt;标签? 我们来仔细看一下 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 的参数</p>
-<div class="section" id="id32">
-<h3>name 参数<a class="headerlink" href="#id32" title="Permalink to this headline">¶</a></h3>
-<p><tt class="docutils literal"><span class="pre">name</span></tt> 参数可以查找所有名字为 <tt class="docutils literal"><span class="pre">name</span></tt> 的tag,字符串对象会被自动忽略掉.</p>
-<p>简单的用法如下:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
-</pre></div>
-</div>
-<p>重申: 搜索 <tt class="docutils literal"><span class="pre">name</span></tt> 参数的值可以使任一类型的 <a class="reference internal" href="#id25">过滤器</a> ,字符窜,正则表达式,列表,方法或是 <tt class="docutils literal"><span class="pre">True</span></tt> .</p>
-</div>
-<div class="section" id="keyword">
-<h3>keyword 参数<a class="headerlink" href="#keyword" title="Permalink to this headline">¶</a></h3>
-<p>如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 <tt class="docutils literal"><span class="pre">id</span></tt> 的参数,Beautiful Soup会搜索每个tag的&#8221;id&#8221;属性.</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">&#39;link2&#39;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-<p>如果传入 <tt class="docutils literal"><span class="pre">href</span></tt> 参数,Beautiful Soup会搜索每个tag的&#8221;href&#8221;属性:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">&quot;elsie&quot;</span><span class="p">))</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-<p>搜索指定名字的属性时可以使用的参数值包括 <a class="reference internal" href="#id27">字符串</a> , <a class="reference internal" href="#id28">正则表达式</a> , <a class="reference internal" href="#id29">列表</a>, <a class="reference internal" href="#true">True</a> .</p>
-<p>下面的例子在文档树中查找所有包含 <tt class="docutils literal"><span class="pre">id</span></tt> 属性的tag,无论 <tt class="docutils literal"><span class="pre">id</span></tt> 的值是什么:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-<p>使用多个指定名字的参数可以同时过滤tag的多个属性:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">&quot;elsie&quot;</span><span class="p">),</span> <span class="nb">id</span><span class="o">=</span><span class="s">&#39;link1&#39;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;three&lt;/a&gt;]</span>
-</pre></div>
-</div>
-<p>有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">data_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;div data-foo=&quot;value&quot;&gt;foo!&lt;/div&gt;&#39;</span><span class="p">)</span>
-<span class="n">data_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">data</span><span class="o">-</span><span class="n">foo</span><span class="o">=</span><span class="s">&quot;value&quot;</span><span class="p">)</span>
-<span class="c"># SyntaxError: keyword can&#39;t be an expression</span>
-</pre></div>
-</div>
-<p>但是可以通过 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法的 <tt class="docutils literal"><span class="pre">attrs</span></tt> 参数定义一个字典参数来搜索包含特殊属性的tag:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">data_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">&quot;data-foo&quot;</span><span class="p">:</span> <span class="s">&quot;value&quot;</span><span class="p">})</span>
-<span class="c"># [&lt;div data-foo=&quot;value&quot;&gt;foo!&lt;/div&gt;]</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="css">
-<h3>按CSS搜索<a class="headerlink" href="#css" title="Permalink to this headline">¶</a></h3>
-<p>按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 <tt class="docutils literal"><span class="pre">class</span></tt> 在Python中是保留字,使用 <tt class="docutils literal"><span class="pre">class</span></tt> 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 <tt class="docutils literal"><span class="pre">class_</span></tt> 参数搜索有指定CSS类名的tag:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">&quot;sister&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-<p><tt class="docutils literal"><span class="pre">class_</span></tt> 参数同样接受不同类型的 <tt class="docutils literal"><span class="pre">过滤器</span></tt> ,字符串,正则表达式,方法或 <tt class="docutils literal"><span class="pre">True</span></tt> :</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">class_</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">&quot;itl&quot;</span><span class="p">))</span>
-<span class="c"># [&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;]</span>
-
-<span class="k">def</span> <span class="nf">has_six_characters</span><span class="p">(</span><span class="n">css_class</span><span class="p">):</span>
- <span class="k">return</span> <span class="n">css_class</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="n">css_class</span><span class="p">)</span> <span class="o">==</span> <span class="mi">6</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">class_</span><span class="o">=</span><span class="n">has_six_characters</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-<p>tag的 <tt class="docutils literal"><span class="pre">class</span></tt> 属性是 <a class="reference internal" href="#id12">多值属性</a> .按照CSS类名搜索tag时,可以分别搜索tag中的每个CSS类名:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">css_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;p class=&quot;body strikeout&quot;&gt;&lt;/p&gt;&#39;</span><span class="p">)</span>
-<span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">&quot;strikeout&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;p class=&quot;body strikeout&quot;&gt;&lt;/p&gt;]</span>
-
-<span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">&quot;body&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;p class=&quot;body strikeout&quot;&gt;&lt;/p&gt;]</span>
-</pre></div>
-</div>
-<p>搜索 <tt class="docutils literal"><span class="pre">class</span></tt> 属性时也可以通过CSS值完全匹配:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">css_soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">,</span> <span class="n">class_</span><span class="o">=</span><span class="s">&quot;body strikeout&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;p class=&quot;body strikeout&quot;&gt;&lt;/p&gt;]</span>
-</pre></div>
-</div>
-<p>完全匹配 <tt class="docutils literal"><span class="pre">class</span></tt> 的值时,如果CSS类名的顺序与实际不符,将搜索不到结果:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">&quot;class&quot;</span><span class="p">:</span> <span class="s">&quot;sister&quot;</span><span class="p">})</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="text">
-<h3><tt class="docutils literal"><span class="pre">text</span></tt> 参数<a class="headerlink" href="#text" title="Permalink to this headline">¶</a></h3>
-<p>通过 <tt class="docutils literal"><span class="pre">text</span></tt> 参数可以搜搜文档中的字符串内容.与 <tt class="docutils literal"><span class="pre">name</span></tt> 参数的可选值一样, <tt class="docutils literal"><span class="pre">text</span></tt> 参数接受 <a class="reference internal" href="#id27">字符串</a> , <a class="reference internal" href="#id28">正则表达式</a> , <a class="reference internal" href="#id29">列表</a>, <a class="reference internal" href="#true">True</a> . 看例子:</p>
-<div class="highlight-python"><pre>soup.find_all(text="Elsie")
-# [u'Elsie']
-
-soup.find_all(text=["Tillie", "Elsie", "Lacie"])
-# [u'Elsie', u'Lacie', u'Tillie']
-
-soup.find_all(text=re.compile("Dormouse"))
-[u"The Dormouse's story", u"The Dormouse's story"]
-
-def is_the_only_string_within_a_tag(s):
- ""Return True if this string is the only child of its parent tag.""
- return (s == s.parent.string)
-
-soup.find_all(text=is_the_only_string_within_a_tag)
-# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']</pre>
-</div>
-<p>虽然 <tt class="docutils literal"><span class="pre">text</span></tt> 参数用于搜索字符串,还可以与其它参数混合使用来过滤tag.Beautiful Soup会找到 <tt class="docutils literal"><span class="pre">.string</span></tt> 方法与 <tt class="docutils literal"><span class="pre">text</span></tt> 参数值相符的tag.下面代码用来搜索内容里面包含“Elsie”的&lt;a&gt;标签:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="s">&quot;Elsie&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a href=&quot;http://example.com/elsie&quot; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="limit">
-<h3><tt class="docutils literal"><span class="pre">limit</span></tt> 参数<a class="headerlink" href="#limit" title="Permalink to this headline">¶</a></h3>
-<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 <tt class="docutils literal"><span class="pre">limit</span></tt> 参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到 <tt class="docutils literal"><span class="pre">limit</span></tt> 的限制时,就停止搜索返回结果.</p>
-<p>文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="recursive">
-<h3><tt class="docutils literal"><span class="pre">recursive</span></tt> 参数<a class="headerlink" href="#recursive" title="Permalink to this headline">¶</a></h3>
-<p>调用tag的 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 <tt class="docutils literal"><span class="pre">recursive=False</span></tt> .</p>
-<p>一段简单的文档:</p>
-<div class="highlight-python"><pre>&lt;html&gt;
- &lt;head&gt;
- &lt;title&gt;
- The Dormouse's story
- &lt;/title&gt;
- &lt;/head&gt;
-...</pre>
-</div>
-<p>是否使用 <tt class="docutils literal"><span class="pre">recursive</span></tt> 参数的搜索结果:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">,</span> <span class="n">recursive</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
-<span class="c"># []</span>
-</pre></div>
-</div>
-</div>
-</div>
-<div class="section" id="find-all-tag">
-<h2>像调用 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 一样调用tag<a class="headerlink" href="#find-all-tag" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> 几乎是Beautiful Soup中最常用的搜索方法,所以我们定义了它的简写方法. <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象和 <tt class="docutils literal"><span class="pre">tag</span></tt> 对象可以被当作一个方法来使用,这个方法的执行结果与调用这个对象的 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法相同,下面两行代码是等价的:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">)</span>
-<span class="n">soup</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">)</span>
-</pre></div>
-</div>
-<p>这两行代码也是等价的:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="find">
-<h2>find()<a class="headerlink" href="#find" title="Permalink to this headline">¶</a></h2>
-<p>find( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
-<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法将返回文档中符合条件的所有tag,尽管有时候我们只想得到一个结果.比如文档中只有一个&lt;body&gt;标签,那么使用 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法来查找&lt;body&gt;标签就不太合适, 使用 <tt class="docutils literal"><span class="pre">find_all</span></tt> 方法并设置 <tt class="docutils literal"><span class="pre">limit=1</span></tt> 参数不如直接使用 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法.下面两行代码是等价的:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">&#39;title&#39;</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
-<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&#39;title&#39;</span><span class="p">)</span>
-<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
-</pre></div>
-</div>
-<p>唯一的区别是 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法的返回结果是值包含一个元素的列表,而 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法直接返回结果.</p>
-<p><tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法没有找到目标是返回空列表, <tt class="docutils literal"><span class="pre">find()</span></tt> 方法找不到目标时,返回 <tt class="docutils literal"><span class="pre">None</span></tt> .</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;nosuchtag&quot;</span><span class="p">))</span>
-<span class="c"># None</span>
-</pre></div>
-</div>
-<p><tt class="docutils literal"><span class="pre">soup.head.title</span></tt> 是 <a class="reference internal" href="#id17">tag的名字</a> 方法的简写.这个简写的原理就是多次调用当前tag的 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">head</span><span class="o">.</span><span class="n">title</span>
-<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;head&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">)</span>
-<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="find-parents-find-parent">
-<h2>find_parents() 和 find_parent()<a class="headerlink" href="#find-parents-find-parent" title="Permalink to this headline">¶</a></h2>
-<p>find_parents( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
-<p>find_parent( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
-<p>我们已经用了很大篇幅来介绍 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 和 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法,Beautiful Soup中还有10个用于搜索的API.它们中的五个用的是与 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 相同的搜索参数,另外5个与 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法的搜索参数类似.区别仅是它们搜索文档的不同部分.</p>
-<p>记住: <tt class="docutils literal"><span class="pre">find_all()</span></tt> 和 <tt class="docutils literal"><span class="pre">find()</span></tt> 只搜索当前节点的所有子节点,孙子节点等. <tt class="docutils literal"><span class="pre">find_parents()</span></tt> 和 <tt class="docutils literal"><span class="pre">find_parent()</span></tt> 用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容. 我们从一个文档中的一个叶子节点开始:</p>
-<div class="highlight-python"><pre>a_string = soup.find(text="Lacie")
-a_string
-# u'Lacie'
-
-a_string.find_parents("a")
-# [&lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt;]
-
-a_string.find_parent("p")
-# &lt;p class="story"&gt;Once upon a time there were three little sisters; and their names were
-# &lt;a class="sister" href="http://example.com/elsie" id="link1"&gt;Elsie&lt;/a&gt;,
-# &lt;a class="sister" href="http://example.com/lacie" id="link2"&gt;Lacie&lt;/a&gt; and
-# &lt;a class="sister" href="http://example.com/tillie" id="link3"&gt;Tillie&lt;/a&gt;;
-# and they lived at the bottom of a well.&lt;/p&gt;
-
-a_string.find_parents("p", class="title")
-# []</pre>
-</div>
-<p>文档中的一个&lt;a&gt;标签是是当前叶子节点的直接父节点,所以可以被找到.还有一个&lt;p&gt;标签,是目标叶子节点的间接父辈节点,所以也可以被找到.包含class值为&#8221;title&#8221;的&lt;p&gt;标签不是不是目标叶子节点的父辈节点,所以通过 <tt class="docutils literal"><span class="pre">find_parents()</span></tt> 方法搜索不到.</p>
-<p><tt class="docutils literal"><span class="pre">find_parent()</span></tt> 和 <tt class="docutils literal"><span class="pre">find_parents()</span></tt> 方法会让人联想到 <a class="reference internal" href="#parent">.parent</a> 和 <a class="reference internal" href="#parents">.parents</a> 属性.它们之间的联系非常紧密.搜索父辈节点的方法实际上就是对 <tt class="docutils literal"><span class="pre">.parents</span></tt> 属性的迭代搜索.</p>
-</div>
-<div class="section" id="find-next-siblings-find-next-sibling">
-<h2>find_next_siblings() 合 find_next_sibling()<a class="headerlink" href="#find-next-siblings-find-next-sibling" title="Permalink to this headline">¶</a></h2>
-<p>find_next_siblings( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
-<p>find_next_sibling( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
-<p>这2个方法通过 <a class="reference internal" href="#next-siblings-previous-siblings">.next_siblings</a> 属性对当tag的所有后面解析 <a class="footnote-reference" href="#id86" id="id33">[5]</a> 的兄弟tag节点进行迭代, <tt class="docutils literal"><span class="pre">find_next_siblings()</span></tt> 方法返回所有符合条件的后面的兄弟节点, <tt class="docutils literal"><span class="pre">find_next_sibling()</span></tt> 只返回符合条件的后面的第一个tag节点.</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">first_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-<span class="n">first_link</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
-
-<span class="n">first_link</span><span class="o">.</span><span class="n">find_next_siblings</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-
-<span class="n">first_story_paragraph</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">,</span> <span class="s">&quot;story&quot;</span><span class="p">)</span>
-<span class="n">first_story_paragraph</span><span class="o">.</span><span class="n">find_next_sibling</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">)</span>
-<span class="c"># &lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="find-previous-siblings-find-previous-sibling">
-<h2>find_previous_siblings() 和 find_previous_sibling()<a class="headerlink" href="#find-previous-siblings-find-previous-sibling" title="Permalink to this headline">¶</a></h2>
-<p>find_previous_siblings( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
-<p>find_previous_sibling( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
-<p>这2个方法通过 <a class="reference internal" href="#next-siblings-previous-siblings">.previous_siblings</a> 属性对当前tag的前面解析 <a class="footnote-reference" href="#id86" id="id34">[5]</a> 的兄弟tag节点进行迭代, <tt class="docutils literal"><span class="pre">find_previous_siblings()</span></tt> 方法返回所有符合条件的前面的兄弟节点, <tt class="docutils literal"><span class="pre">find_previous_sibling()</span></tt> 方法返回第一个符合条件的前面的兄弟节点:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">last_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="s">&quot;link3&quot;</span><span class="p">)</span>
-<span class="n">last_link</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;</span>
-
-<span class="n">last_link</span><span class="o">.</span><span class="n">find_previous_siblings</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
-
-<span class="n">first_story_paragraph</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">,</span> <span class="s">&quot;story&quot;</span><span class="p">)</span>
-<span class="n">first_story_paragraph</span><span class="o">.</span><span class="n">find_previous_sibling</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">)</span>
-<span class="c"># &lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="find-all-next-find-next">
-<h2>find_all_next() 和 find_next()<a class="headerlink" href="#find-all-next-find-next" title="Permalink to this headline">¶</a></h2>
-<p>find_all_next( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
-<p>find_next( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
-<p>这2个方法通过 <a class="reference internal" href="#next-elements-previous-elements">.next_elements</a> 属性对当前tag的之后的 <a class="footnote-reference" href="#id86" id="id35">[5]</a> tag和字符串进行迭代, <tt class="docutils literal"><span class="pre">find_all_next()</span></tt> 方法返回所有符合条件的节点, <tt class="docutils literal"><span class="pre">find_next()</span></tt> 方法返回第一个符合条件的节点:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">first_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-<span class="n">first_link</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
-
-<span class="n">first_link</span><span class="o">.</span><span class="n">find_all_next</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
-<span class="c"># [u&#39;Elsie&#39;, u&#39;,\n&#39;, u&#39;Lacie&#39;, u&#39; and\n&#39;, u&#39;Tillie&#39;,</span>
-<span class="c"># u&#39;;\nand they lived at the bottom of a well.&#39;, u&#39;\n\n&#39;, u&#39;...&#39;, u&#39;\n&#39;]</span>
-
-<span class="n">first_link</span><span class="o">.</span><span class="n">find_next</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">)</span>
-<span class="c"># &lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
-</pre></div>
-</div>
-<p>第一个例子中,字符串 “Elsie”也被显示出来,尽管它被包含在我们开始查找的&lt;a&gt;标签的里面.第二个例子中,最后一个&lt;p&gt;标签也被显示出来,尽管它与我们开始查找位置的&lt;a&gt;标签不属于同一部分.例子中,搜索的重点是要匹配过滤器的条件,并且在文档中出现的顺序而不是开始查找的元素的位置.</p>
-</div>
-<div class="section" id="find-all-previous-find-previous">
-<h2>find_all_previous() 和 find_previous()<a class="headerlink" href="#find-all-previous-find-previous" title="Permalink to this headline">¶</a></h2>
-<p>find_all_previous( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
-<p>find_previous( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )</p>
-<p>这2个方法通过 <a class="reference internal" href="#next-elements-previous-elements">.previous_elements</a> 属性对当前节点前面 <a class="footnote-reference" href="#id86" id="id36">[5]</a> 的tag和字符串进行迭代, <tt class="docutils literal"><span class="pre">find_all_previous()</span></tt> 方法返回所有符合条件的节点, <tt class="docutils literal"><span class="pre">find_previous()</span></tt> 方法返回第一个符合条件的节点.</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">first_link</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-<span class="n">first_link</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;</span>
-
-<span class="n">first_link</span><span class="o">.</span><span class="n">find_all_previous</span><span class="p">(</span><span class="s">&quot;p&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; ...&lt;/p&gt;,</span>
-<span class="c"># &lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;]</span>
-
-<span class="n">first_link</span><span class="o">.</span><span class="n">find_previous</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">)</span>
-<span class="c"># &lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;</span>
-</pre></div>
-</div>
-<p><tt class="docutils literal"><span class="pre">find_all_previous(&quot;p&quot;)</span></tt> 返回了文档中的第一段(class=&#8221;title&#8221;的那段),但还返回了第二段,&lt;p&gt;标签包含了我们开始查找的&lt;a&gt;标签.不要惊讶,这段代码的功能是查找所有出现在指定&lt;a&gt;标签之前的&lt;p&gt;标签,因为这个&lt;p&gt;标签包含了开始的&lt;a&gt;标签,所以&lt;p&gt;标签一定是在&lt;a&gt;之前出现的.</p>
-</div>
-<div class="section" id="id37">
-<h2>CSS选择器<a class="headerlink" href="#id37" title="Permalink to this headline">¶</a></h2>
-<p>Beautiful Soup支持大部分的CSS选择器 <a class="footnote-reference" href="#id87" id="id38">[6]</a> ,在 <tt class="docutils literal"><span class="pre">Tag</span></tt> 或 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的 <tt class="docutils literal"><span class="pre">.select()</span></tt> 方法中传入字符串参数,即可使用CSS选择器的语法找到tag:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;title&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;p nth-of-type(3)&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;]</span>
-</pre></div>
-</div>
-<p>通过tag标签逐层查找:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;body a&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;html head title&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
-</pre></div>
-</div>
-<p>找到某个tag标签下的直接子标签 <a class="footnote-reference" href="#id87" id="id39">[6]</a> :</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;head &gt; title&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;p &gt; a&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;p &gt; a:nth-of-type(2)&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;p &gt; #link1&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;body &gt; a&quot;</span><span class="p">)</span>
-<span class="c"># []</span>
-</pre></div>
-</div>
-<p>找到兄弟节点标签:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;#link1 ~ .sister&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;#link1 + .sister&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-<p>通过CSS的类名查找:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;.sister&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;[class~=sister]&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-<p>通过tag的id查找:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;#link1&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&quot;a#link2&quot;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-<p>通过是否存在某个属性来查找:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&#39;a[href]&#39;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-<p>通过属性的值来查找:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&#39;a[href=&quot;http://example.com/elsie&quot;]&#39;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&#39;a[href^=&quot;http://example.com/&quot;]&#39;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt;,</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&#39;a[href$=&quot;tillie&quot;]&#39;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;]</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&#39;a[href*=&quot;.com/el&quot;]&#39;</span><span class="p">)</span>
-<span class="c"># [&lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;]</span>
-</pre></div>
-</div>
-<p>通过语言设置来查找:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">multilingual_markup</span> <span class="o">=</span> <span class="s">&quot;&quot;&quot;</span>
-<span class="s"> &lt;p lang=&quot;en&quot;&gt;Hello&lt;/p&gt;</span>
-<span class="s"> &lt;p lang=&quot;en-us&quot;&gt;Howdy, y&#39;all&lt;/p&gt;</span>
-<span class="s"> &lt;p lang=&quot;en-gb&quot;&gt;Pip-pip, old fruit&lt;/p&gt;</span>
-<span class="s"> &lt;p lang=&quot;fr&quot;&gt;Bonjour mes amis&lt;/p&gt;</span>
-<span class="s">&quot;&quot;&quot;</span>
-<span class="n">multilingual_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">multilingual_markup</span><span class="p">)</span>
-<span class="n">multilingual_soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s">&#39;p[lang|=en]&#39;</span><span class="p">)</span>
-<span class="c"># [&lt;p lang=&quot;en&quot;&gt;Hello&lt;/p&gt;,</span>
-<span class="c"># &lt;p lang=&quot;en-us&quot;&gt;Howdy, y&#39;all&lt;/p&gt;,</span>
-<span class="c"># &lt;p lang=&quot;en-gb&quot;&gt;Pip-pip, old fruit&lt;/p&gt;]</span>
-</pre></div>
-</div>
-<p>对于熟悉CSS选择器语法的人来说这是个非常方便的方法.Beautiful Soup也支持CSS选择器API,如果你仅仅需要CSS选择器的功能,那么直接使用 <tt class="docutils literal"><span class="pre">lxml</span></tt> 也可以,而且速度更快,支持更多的CSS选择器语法,但Beautiful Soup整合了CSS选择器的语法和自身方便使用API.</p>
-</div>
-</div>
-<div class="section" id="id40">
-<h1>修改文档树<a class="headerlink" href="#id40" title="Permalink to this headline">¶</a></h1>
-<p>Beautiful Soup的强项是文档树的搜索,但同时也可以方便的修改文档树</p>
-<div class="section" id="id41">
-<h2>修改tag的名称和属性<a class="headerlink" href="#id41" title="Permalink to this headline">¶</a></h2>
-<p>在 <a class="reference internal" href="#attributes">Attributes</a> 的章节中已经介绍过这个功能,但是再看一遍也无妨. 重命名一个tag,改变属性的值,添加或删除属性:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&#39;&lt;b class=&quot;boldest&quot;&gt;Extremely bold&lt;/b&gt;&#39;</span><span class="p">)</span>
-<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
-
-<span class="n">tag</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">&quot;blockquote&quot;</span>
-<span class="n">tag</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#39;verybold&#39;</span>
-<span class="n">tag</span><span class="p">[</span><span class="s">&#39;id&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
-<span class="n">tag</span>
-<span class="c"># &lt;blockquote class=&quot;verybold&quot; id=&quot;1&quot;&gt;Extremely bold&lt;/blockquote&gt;</span>
-
-<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">&#39;class&#39;</span><span class="p">]</span>
-<span class="k">del</span> <span class="n">tag</span><span class="p">[</span><span class="s">&#39;id&#39;</span><span class="p">]</span>
-<span class="n">tag</span>
-<span class="c"># &lt;blockquote&gt;Extremely bold&lt;/blockquote&gt;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="id42">
-<h2>修改 .string<a class="headerlink" href="#id42" title="Permalink to this headline">¶</a></h2>
-<p>给tag的 <tt class="docutils literal"><span class="pre">.string</span></tt> 属性赋值,就相当于用当前的内容替代了原来的内容:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
-
-<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-<span class="n">tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">&quot;New link text.&quot;</span>
-<span class="n">tag</span>
-<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;New link text.&lt;/a&gt;</span>
-</pre></div>
-</div>
-<p>注意: 如果当前的tag包含了其它tag,那么给它的 <tt class="docutils literal"><span class="pre">.string</span></tt> 属性赋值会覆盖掉原有的所有内容包括子tag</p>
-</div>
-<div class="section" id="append">
-<h2>append()<a class="headerlink" href="#append" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">Tag.append()</span></tt> 方法想tag中添加内容,就好像Python的列表的 <tt class="docutils literal"><span class="pre">.append()</span></tt> 方法:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;Foo&lt;/a&gt;&quot;</span><span class="p">)</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s">&quot;Bar&quot;</span><span class="p">)</span>
-
-<span class="n">soup</span>
-<span class="c"># &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;a&gt;FooBar&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">contents</span>
-<span class="c"># [u&#39;Foo&#39;, u&#39;Bar&#39;]</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="beautifulsoup-new-string-new-tag">
-<h2>BeautifulSoup.new_string() 和 .new_tag()<a class="headerlink" href="#beautifulsoup-new-string-new-tag" title="Permalink to this headline">¶</a></h2>
-<p>如果想添加一段文本内容到文档中也没问题,可以调用Python的 <tt class="docutils literal"><span class="pre">append()</span></tt> 方法或调用工厂方法 <tt class="docutils literal"><span class="pre">BeautifulSoup.new_string()</span></tt> :</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;b&gt;&lt;/b&gt;&quot;</span><span class="p">)</span>
-<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
-<span class="n">tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s">&quot;Hello&quot;</span><span class="p">)</span>
-<span class="n">new_string</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_string</span><span class="p">(</span><span class="s">&quot; there&quot;</span><span class="p">)</span>
-<span class="n">tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_string</span><span class="p">)</span>
-<span class="n">tag</span>
-<span class="c"># &lt;b&gt;Hello there.&lt;/b&gt;</span>
-<span class="n">tag</span><span class="o">.</span><span class="n">contents</span>
-<span class="c"># [u&#39;Hello&#39;, u&#39; there&#39;]</span>
-</pre></div>
-</div>
-<p>如果想要创建一段注释,或 <tt class="docutils literal"><span class="pre">NavigableString</span></tt> 的任何子类,将子类作为 <tt class="docutils literal"><span class="pre">new_string()</span></tt> 方法的第二个参数传入:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">Comment</span>
-<span class="n">new_comment</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_string</span><span class="p">(</span><span class="s">&quot;Nice to see you.&quot;</span><span class="p">,</span> <span class="n">Comment</span><span class="p">)</span>
-<span class="n">tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_comment</span><span class="p">)</span>
-<span class="n">tag</span>
-<span class="c"># &lt;b&gt;Hello there&lt;!--Nice to see you.--&gt;&lt;/b&gt;</span>
-<span class="n">tag</span><span class="o">.</span><span class="n">contents</span>
-<span class="c"># [u&#39;Hello&#39;, u&#39; there&#39;, u&#39;Nice to see you.&#39;]</span>
-</pre></div>
-</div>
-<p># 这是Beautiful Soup 4.2.1 中新增的方法</p>
-<p>创建一个tag最好的方法是调用工厂方法 <tt class="docutils literal"><span class="pre">BeautifulSoup.new_tag()</span></tt> :</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;b&gt;&lt;/b&gt;&quot;</span><span class="p">)</span>
-<span class="n">original_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">b</span>
-
-<span class="n">new_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">,</span> <span class="n">href</span><span class="o">=</span><span class="s">&quot;http://www.example.com&quot;</span><span class="p">)</span>
-<span class="n">original_tag</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">new_tag</span><span class="p">)</span>
-<span class="n">original_tag</span>
-<span class="c"># &lt;b&gt;&lt;a href=&quot;http://www.example.com&quot;&gt;&lt;/a&gt;&lt;/b&gt;</span>
-
-<span class="n">new_tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">&quot;Link text.&quot;</span>
-<span class="n">original_tag</span>
-<span class="c"># &lt;b&gt;&lt;a href=&quot;http://www.example.com&quot;&gt;Link text.&lt;/a&gt;&lt;/b&gt;</span>
-</pre></div>
-</div>
-<p>第一个参数作为tag的name,是必填,其它参数选填</p>
-</div>
-<div class="section" id="insert">
-<h2>insert()<a class="headerlink" href="#insert" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">Tag.insert()</span></tt> 方法与 <tt class="docutils literal"><span class="pre">Tag.append()</span></tt> 方法类似,区别是不会把新元素添加到父节点 <tt class="docutils literal"><span class="pre">.contents</span></tt> 属性的最后,而是把元素插入到指定的位置.与Python列表总的 <tt class="docutils literal"><span class="pre">.insert()</span></tt> 方法的用法下同:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
-<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-
-<span class="n">tag</span><span class="o">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">&quot;but did not endorse &quot;</span><span class="p">)</span>
-<span class="n">tag</span>
-<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;I linked to but did not endorse &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;</span>
-<span class="n">tag</span><span class="o">.</span><span class="n">contents</span>
-<span class="c"># [u&#39;I linked to &#39;, u&#39;but did not endorse&#39;, &lt;i&gt;example.com&lt;/i&gt;]</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="insert-before-insert-after">
-<h2>insert_before() 和 insert_after()<a class="headerlink" href="#insert-before-insert-after" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">insert_before()</span></tt> 方法在当前tag或文本节点前插入内容:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;b&gt;stop&lt;/b&gt;&quot;</span><span class="p">)</span>
-<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">&quot;i&quot;</span><span class="p">)</span>
-<span class="n">tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">&quot;Don&#39;t&quot;</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">insert_before</span><span class="p">(</span><span class="n">tag</span><span class="p">)</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">b</span>
-<span class="c"># &lt;b&gt;&lt;i&gt;Don&#39;t&lt;/i&gt;stop&lt;/b&gt;</span>
-</pre></div>
-</div>
-<p><tt class="docutils literal"><span class="pre">insert_after()</span></tt> 方法在当前tag或文本节点后插入内容:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">insert_after</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_string</span><span class="p">(</span><span class="s">&quot; ever &quot;</span><span class="p">))</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">b</span>
-<span class="c"># &lt;b&gt;&lt;i&gt;Don&#39;t&lt;/i&gt; ever stop&lt;/b&gt;</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">contents</span>
-<span class="c"># [&lt;i&gt;Don&#39;t&lt;/i&gt;, u&#39; ever &#39;, u&#39;stop&#39;]</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="clear">
-<h2>clear()<a class="headerlink" href="#clear" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">Tag.clear()</span></tt> 方法移除当前tag的内容:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
-<span class="n">tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-
-<span class="n">tag</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
-<span class="n">tag</span>
-<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;&lt;/a&gt;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="extract">
-<h2>extract()<a class="headerlink" href="#extract" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">PageElement.extract()</span></tt> 方法将当前tag移除文档树,并作为方法结果返回:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
-<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-
-<span class="n">i_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
-
-<span class="n">a_tag</span>
-<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;I linked to&lt;/a&gt;</span>
-
-<span class="n">i_tag</span>
-<span class="c"># &lt;i&gt;example.com&lt;/i&gt;</span>
-
-<span class="k">print</span><span class="p">(</span><span class="n">i_tag</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
-<span class="bp">None</span>
-</pre></div>
-</div>
-<p>这个方法实际上产生了2个文档树: 一个是用来解析原始文档的 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象,另一个是被移除并且返回的tag.被移除并返回的tag可以继续调用 <tt class="docutils literal"><span class="pre">extract</span></tt> 方法:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">my_string</span> <span class="o">=</span> <span class="n">i_tag</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
-<span class="n">my_string</span>
-<span class="c"># u&#39;example.com&#39;</span>
-
-<span class="k">print</span><span class="p">(</span><span class="n">my_string</span><span class="o">.</span><span class="n">parent</span><span class="p">)</span>
-<span class="c"># None</span>
-<span class="n">i_tag</span>
-<span class="c"># &lt;i&gt;&lt;/i&gt;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="decompose">
-<h2>decompose()<a class="headerlink" href="#decompose" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">Tag.decompose()</span></tt> 方法将当前节点移除文档树并完全销毁:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
-<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">decompose</span><span class="p">()</span>
-
-<span class="n">a_tag</span>
-<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;I linked to&lt;/a&gt;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="replace-with">
-<h2>replace_with()<a class="headerlink" href="#replace-with" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">PageElement.replace_with()</span></tt> 方法移除文档树中的某段内容,并用新tag或文本节点替代它:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
-<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-
-<span class="n">new_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">&quot;b&quot;</span><span class="p">)</span>
-<span class="n">new_tag</span><span class="o">.</span><span class="n">string</span> <span class="o">=</span> <span class="s">&quot;example.net&quot;</span>
-<span class="n">a_tag</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="n">new_tag</span><span class="p">)</span>
-
-<span class="n">a_tag</span>
-<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;b&gt;example.net&lt;/b&gt;&lt;/a&gt;</span>
-</pre></div>
-</div>
-<p><tt class="docutils literal"><span class="pre">replace_with()</span></tt> 方法返回被替代的tag或文本节点,可以用来浏览或添加到文档树其它地方</p>
-</div>
-<div class="section" id="wrap">
-<h2>wrap()<a class="headerlink" href="#wrap" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">PageElement.wrap()</span></tt> 方法可以对指定的tag元素进行包装 <a class="footnote-reference" href="#id89" id="id43">[8]</a> ,并返回包装后的结果:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;p&gt;I wish I was bold.&lt;/p&gt;&quot;</span><span class="p">)</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">string</span><span class="o">.</span><span class="n">wrap</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">&quot;b&quot;</span><span class="p">))</span>
-<span class="c"># &lt;b&gt;I wish I was bold.&lt;/b&gt;</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">wrap</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">new_tag</span><span class="p">(</span><span class="s">&quot;div&quot;</span><span class="p">))</span>
-<span class="c"># &lt;div&gt;&lt;p&gt;&lt;b&gt;I wish I was bold.&lt;/b&gt;&lt;/p&gt;&lt;/div&gt;</span>
-</pre></div>
-</div>
-<p>该方法在 Beautiful Soup 4.0.5 中添加</p>
-</div>
-<div class="section" id="unwrap">
-<h2>unwrap()<a class="headerlink" href="#unwrap" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">Tag.unwrap()</span></tt> 方法与 <tt class="docutils literal"><span class="pre">wrap()</span></tt> 方法相反.将移除tag内的所有tag标签,该方法常被用来进行标记的解包:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
-<span class="n">a_tag</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">a</span>
-
-<span class="n">a_tag</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">unwrap</span><span class="p">()</span>
-<span class="n">a_tag</span>
-<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;I linked to example.com&lt;/a&gt;</span>
-</pre></div>
-</div>
-<p>与 <tt class="docutils literal"><span class="pre">replace_with()</span></tt> 方法相同, <tt class="docutils literal"><span class="pre">unwrap()</span></tt> 方法返回被移除的tag</p>
-</div>
-</div>
-<div class="section" id="id44">
-<h1>输出<a class="headerlink" href="#id44" title="Permalink to this headline">¶</a></h1>
-<div class="section" id="id45">
-<h2>格式化输出<a class="headerlink" href="#id45" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">prettify()</span></tt> 方法将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">()</span>
-<span class="c"># &#39;&lt;html&gt;\n &lt;head&gt;\n &lt;/head&gt;\n &lt;body&gt;\n &lt;a href=&quot;http://example.com/&quot;&gt;\n...&#39;</span>
-
-<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
-<span class="c"># &lt;html&gt;</span>
-<span class="c"># &lt;head&gt;</span>
-<span class="c"># &lt;/head&gt;</span>
-<span class="c"># &lt;body&gt;</span>
-<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;</span>
-<span class="c"># I linked to</span>
-<span class="c"># &lt;i&gt;</span>
-<span class="c"># example.com</span>
-<span class="c"># &lt;/i&gt;</span>
-<span class="c"># &lt;/a&gt;</span>
-<span class="c"># &lt;/body&gt;</span>
-<span class="c"># &lt;/html&gt;</span>
-</pre></div>
-</div>
-<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象和它的tag节点都可以调用 <tt class="docutils literal"><span class="pre">prettify()</span></tt> 方法:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
-<span class="c"># &lt;a href=&quot;http://example.com/&quot;&gt;</span>
-<span class="c"># I linked to</span>
-<span class="c"># &lt;i&gt;</span>
-<span class="c"># example.com</span>
-<span class="c"># &lt;/i&gt;</span>
-<span class="c"># &lt;/a&gt;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="id46">
-<h2>压缩输出<a class="headerlink" href="#id46" title="Permalink to this headline">¶</a></h2>
-<p>如果只想得到结果字符串,不重视格式,那么可以对一个 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象或 <tt class="docutils literal"><span class="pre">Tag</span></tt> 对象使用Python的 <tt class="docutils literal"><span class="pre">unicode()</span></tt> 或 <tt class="docutils literal"><span class="pre">str()</span></tt> 方法:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
-<span class="c"># &#39;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;&#39;</span>
-
-<span class="nb">unicode</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">a</span><span class="p">)</span>
-<span class="c"># u&#39;&lt;a href=&quot;http://example.com/&quot;&gt;I linked to &lt;i&gt;example.com&lt;/i&gt;&lt;/a&gt;&#39;</span>
-</pre></div>
-</div>
-<p><tt class="docutils literal"><span class="pre">str()</span></tt> 方法返回UTF-8编码的字符串,可以指定 <a class="reference internal" href="#id51">编码</a> 的设置.</p>
-<p>还可以调用 <tt class="docutils literal"><span class="pre">encode()</span></tt> 方法获得字节码或调用 <tt class="docutils literal"><span class="pre">decode()</span></tt> 方法获得Unicode.</p>
-</div>
-<div class="section" id="id47">
-<h2>输出格式<a class="headerlink" href="#id47" title="Permalink to this headline">¶</a></h2>
-<p>Beautiful Soup输出是会将HTML中的特殊字符转换成Unicode,比如“&amp;lquot;”:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&amp;ldquo;Dammit!&amp;rdquo; he said.&quot;</span><span class="p">)</span>
-<span class="nb">unicode</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
-<span class="c"># u&#39;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;\u201cDammit!\u201d he said.&lt;/body&gt;&lt;/html&gt;&#39;</span>
-</pre></div>
-</div>
-<p>如果将文档转换成字符串,Unicode编码会被编码成UTF-8.这样就无法正确显示HTML特殊字符了:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="nb">str</span><span class="p">(</span><span class="n">soup</span><span class="p">)</span>
-<span class="c"># &#39;&lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;\xe2\x80\x9cDammit!\xe2\x80\x9d he said.&lt;/body&gt;&lt;/html&gt;&#39;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="get-text">
-<h2>get_text()<a class="headerlink" href="#get-text" title="Permalink to this headline">¶</a></h2>
-<p>如果只想得到tag中包含的文本内容,那么可以嗲用 <tt class="docutils literal"><span class="pre">get_text()</span></tt> 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&#39;&lt;a href=&quot;http://example.com/&quot;&gt;</span><span class="se">\n</span><span class="s">I linked to &lt;i&gt;example.com&lt;/i&gt;</span><span class="se">\n</span><span class="s">&lt;/a&gt;&#39;</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span>
-<span class="s">u&#39;</span><span class="se">\n</span><span class="s">I linked to example.com</span><span class="se">\n</span><span class="s">&#39;</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">i</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span>
-<span class="s">u&#39;example.com&#39;</span>
-</pre></div>
-</div>
-<p>可以通过参数指定tag的文本内容的分隔符:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="c"># soup.get_text(&quot;|&quot;)</span>
-<span class="s">u&#39;</span><span class="se">\n</span><span class="s">I linked to |example.com|</span><span class="se">\n</span><span class="s">&#39;</span>
-</pre></div>
-</div>
-<p>还可以去除获得文本内容的前后空白:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="c"># soup.get_text(&quot;|&quot;, strip=True)</span>
-<span class="s">u&#39;I linked to|example.com&#39;</span>
-</pre></div>
-</div>
-<p>或者使用 <a class="reference internal" href="#strings-stripped-strings">.stripped_strings</a> 生成器,获得文本列表后手动处理列表:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="p">[</span><span class="n">text</span> <span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">stripped_strings</span><span class="p">]</span>
-<span class="c"># [u&#39;I linked to&#39;, u&#39;example.com&#39;]</span>
-</pre></div>
-</div>
-</div>
-</div>
-<div class="section" id="id48">
-<h1>指定文档解析器<a class="headerlink" href="#id48" title="Permalink to this headline">¶</a></h1>
-<p>如果仅是想要解析HTML文档,只要用文档创建 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象就可以了.Beautiful Soup会自动选择一个解析器来解析文档.但是还可以通过参数指定使用那种解析器来解析当前文档.</p>
-<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 第一个参数应该是要被解析的文档字符串或是文件句柄,第二个参数用来标识怎样解析文档.如果第二个参数为空,那么Beautiful Soup根据当前系统安装的库自动选择解析器,解析器的优先数序: lxml, html5lib, Python标准库.在下面两种条件下解析器优先顺序会变化:</p>
-<blockquote>
-<div><ul class="simple">
-<li>要解析的文档是什么类型: 目前支持, “html”, “xml”, 和 “html5”</li>
-<li>指定使用哪种解析器: 目前支持, “lxml”, “html5lib”, 和 “html.parser”</li>
-</ul>
-</div></blockquote>
-<p><a class="reference internal" href="#id9">安装解析器</a> 章节介绍了可以使用哪种解析器,以及如何安装.</p>
-<p>如果指定的解析器没有安装,Beautiful Soup会自动选择其它方案.目前只有 lxml 解析器支持XML文档的解析,在没有安装lxml库的情况下,创建 <tt class="docutils literal"><span class="pre">beautifulsoup</span></tt> 对象时无论是否指定使用lxml,都无法得到解析后的对象</p>
-<div class="section" id="id49">
-<h2>解析器之间的区别<a class="headerlink" href="#id49" title="Permalink to this headline">¶</a></h2>
-<p>Beautiful Soup为不同的解析器提供了相同的接口,但解析器本身时有区别的.同一篇文档被不同的解析器解析后可能会生成不同结构的树型文档.区别最大的是HTML解析器和XML解析器,看下面片段被解析成HTML结构:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;&lt;b /&gt;&lt;/a&gt;&quot;</span><span class="p">)</span>
-<span class="c"># &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;a&gt;&lt;b&gt;&lt;/b&gt;&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;</span>
-</pre></div>
-</div>
-<p>因为空标签&lt;b /&gt;不符合HTML标准,所以解析器把它解析成&lt;b&gt;&lt;/b&gt;</p>
-<p>同样的文档使用XML解析如下(解析XML需要安装lxml库).注意,空标签&lt;b /&gt;依然被保留,并且文档前添加了XML头,而不是被包含在&lt;html&gt;标签内:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;&lt;b /&gt;&lt;/a&gt;&quot;</span><span class="p">,</span> <span class="s">&quot;xml&quot;</span><span class="p">)</span>
-<span class="c"># &lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;</span>
-<span class="c"># &lt;a&gt;&lt;b/&gt;&lt;/a&gt;</span>
-</pre></div>
-</div>
-<p>HTML解析器之间也有区别,如果被解析的HTML文档是标准格式,那么解析器之间没有任何差别,只是解析速度不同,结果都会返回正确的文档树.</p>
-<p>但是如果被解析文档不是标准格式,那么不同的解析器返回结果可能不同.下面例子中,使用lxml解析错误格式的文档,结果&lt;/p&gt;标签被直接忽略掉了:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;&lt;/p&gt;&quot;</span><span class="p">,</span> <span class="s">&quot;lxml&quot;</span><span class="p">)</span>
-<span class="c"># &lt;html&gt;&lt;body&gt;&lt;a&gt;&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;</span>
-</pre></div>
-</div>
-<p>使用html5lib库解析相同文档会得到不同的结果:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;&lt;/p&gt;&quot;</span><span class="p">,</span> <span class="s">&quot;html5lib&quot;</span><span class="p">)</span>
-<span class="c"># &lt;html&gt;&lt;head&gt;&lt;/head&gt;&lt;body&gt;&lt;a&gt;&lt;p&gt;&lt;/p&gt;&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;</span>
-</pre></div>
-</div>
-<p>html5lib库没有忽略掉&lt;/p&gt;标签,而是自动补全了标签,还给文档树添加了&lt;head&gt;标签.</p>
-<p>使用pyhton内置库解析结果如下:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">BeautifulSoup</span><span class="p">(</span><span class="s">&quot;&lt;a&gt;&lt;/p&gt;&quot;</span><span class="p">,</span> <span class="s">&quot;html.parser&quot;</span><span class="p">)</span>
-<span class="c"># &lt;a&gt;&lt;/a&gt;</span>
-</pre></div>
-</div>
-<p>与lxml <a class="footnote-reference" href="#id88" id="id50">[7]</a> 库类似的,Python内置库忽略掉了&lt;/p&gt;标签,与html5lib库不同的是标准库没有尝试创建符合标准的文档格式或将文档片段包含在&lt;body&gt;标签内,与lxml不同的是标准库甚至连&lt;html&gt;标签都没有尝试去添加.</p>
-<p>因为文档片段“&lt;a&gt;&lt;/p&gt;”是错误格式,所以以上解析方式都能算作&#8221;正确&#8221;,html5lib库使用的是HTML5的部分标准,所以最接近&#8221;正确&#8221;.不过所有解析器的结构都能够被认为是&#8221;正常&#8221;的.</p>
-<p>不同的解析器可能影响代码执行结果,如果在分发给别人的代码中使用了 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> ,那么最好注明使用了哪种解析器,以减少不必要的麻烦.</p>
-</div>
-</div>
-<div class="section" id="id51">
-<h1>编码<a class="headerlink" href="#id51" title="Permalink to this headline">¶</a></h1>
-<p>任何HTML或XML文档都有自己的编码方式,比如ASCII 或 UTF-8,但是使用Beautiful Soup解析后,文档都被转换成了Unicode:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">&quot;&lt;h1&gt;Sacr</span><span class="se">\xc3\xa9</span><span class="s"> bleu!&lt;/h1&gt;&quot;</span>
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">h1</span>
-<span class="c"># &lt;h1&gt;Sacré bleu!&lt;/h1&gt;</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">h1</span><span class="o">.</span><span class="n">string</span>
-<span class="c"># u&#39;Sacr\xe9 bleu!&#39;</span>
-</pre></div>
-</div>
-<p>这不是魔术(但很神奇),Beautiful Soup用了 <a class="reference internal" href="#unicode-dammit">编码自动检测</a> 子库来识别当前文档编码并转换成Unicode编码. <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的 <tt class="docutils literal"><span class="pre">.original_encoding</span></tt> 属性记录了自动识别编码的结果:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">original_encoding</span>
-<span class="s">&#39;utf-8&#39;</span>
-</pre></div>
-</div>
-<p><a class="reference internal" href="#unicode-dammit">编码自动检测</a> 功能大部分时候都能猜对编码格式,但有时候也会出错.有时候即使猜测正确,也是在逐个字节的遍历整个文档后才猜对的,这样很慢.如果预先知道文档编码,可以设置编码参数来减少自动检查编码出错的概率并且提高文档解析速度.在创建 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的时候设置 <tt class="docutils literal"><span class="pre">from_encoding</span></tt> 参数.</p>
-<p>下面一段文档用了ISO-8859-8编码方式,这段文档太短,结果Beautiful Soup以为文档是用ISO-8859-7编码:</p>
-<div class="highlight-python"><pre>markup = b"&lt;h1&gt;\xed\xe5\xec\xf9&lt;/h1&gt;"
-soup = BeautifulSoup(markup)
-soup.h1
-&lt;h1&gt;νεμω&lt;/h1&gt;
-soup.original_encoding
-'ISO-8859-7'</pre>
-</div>
-<p>通过传入 <tt class="docutils literal"><span class="pre">from_encoding</span></tt> 参数来指定编码方式:</p>
-<div class="highlight-python"><pre>soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
-soup.h1
-&lt;h1&gt;םולש&lt;/h1&gt;
-soup.original_encoding
-'iso8859-8'</pre>
-</div>
-<p>少数情况下(通常是UTF-8编码的文档中包含了其它编码格式的文件),想获得正确的Unicode编码就不得不将文档中少数特殊编码字符替换成特殊Unicode编码,“REPLACEMENT CHARACTER” (U+FFFD, �) <a class="footnote-reference" href="#id90" id="id52">[9]</a> . 如果Beautifu Soup猜测文档编码时作了特殊字符的替换,那么Beautiful Soup会把 <tt class="docutils literal"><span class="pre">UnicodeDammit</span></tt> 或 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的 <tt class="docutils literal"><span class="pre">.contains_replacement_characters</span></tt> 属性标记为 <tt class="docutils literal"><span class="pre">True</span></tt> .这样就可以知道当前文档进行Unicode编码后丢失了一部分特殊内容字符.如果文档中包含�而 <tt class="docutils literal"><span class="pre">.contains_replacement_characters</span></tt> 属性是 <tt class="docutils literal"><span class="pre">False</span></tt> ,则表示�就是文档中原来的字符,不是转码失败.</p>
-<div class="section" id="id53">
-<h2>输出编码<a class="headerlink" href="#id53" title="Permalink to this headline">¶</a></h2>
-<p>通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码,下面例子输入文档是Latin-1编码:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="n">b</span><span class="s">&#39;&#39;&#39;</span>
-<span class="s">&lt;html&gt;</span>
-<span class="s"> &lt;head&gt;</span>
-<span class="s"> &lt;meta content=&quot;text/html; charset=ISO-Latin-1&quot; http-equiv=&quot;Content-type&quot; /&gt;</span>
-<span class="s"> &lt;/head&gt;</span>
-<span class="s"> &lt;body&gt;</span>
-<span class="s"> &lt;p&gt;Sacr</span><span class="se">\xe9</span><span class="s"> bleu!&lt;/p&gt;</span>
-<span class="s"> &lt;/body&gt;</span>
-<span class="s">&lt;/html&gt;</span>
-<span class="s">&#39;&#39;&#39;</span>
-
-<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
-<span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
-<span class="c"># &lt;html&gt;</span>
-<span class="c"># &lt;head&gt;</span>
-<span class="c"># &lt;meta content=&quot;text/html; charset=utf-8&quot; http-equiv=&quot;Content-type&quot; /&gt;</span>
-<span class="c"># &lt;/head&gt;</span>
-<span class="c"># &lt;body&gt;</span>
-<span class="c"># &lt;p&gt;</span>
-<span class="c"># Sacré bleu!</span>
-<span class="c"># &lt;/p&gt;</span>
-<span class="c"># &lt;/body&gt;</span>
-<span class="c"># &lt;/html&gt;</span>
-</pre></div>
-</div>
-<p>注意,输出文档中的&lt;meta&gt;标签的编码设置已经修改成了与输出编码一致的UTF-8.</p>
-<p>如果不想用UTF-8编码输出,可以将编码方式传入 <tt class="docutils literal"><span class="pre">prettify()</span></tt> 方法:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">(</span><span class="s">&quot;latin-1&quot;</span><span class="p">))</span>
-<span class="c"># &lt;html&gt;</span>
-<span class="c"># &lt;head&gt;</span>
-<span class="c"># &lt;meta content=&quot;text/html; charset=latin-1&quot; http-equiv=&quot;Content-type&quot; /&gt;</span>
-<span class="c"># ...</span>
-</pre></div>
-</div>
-<p>还可以调用 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象或任意节点的 <tt class="docutils literal"><span class="pre">encode()</span></tt> 方法,就像Python的字符串调用 <tt class="docutils literal"><span class="pre">encode()</span></tt> 方法一样:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;latin-1&quot;</span><span class="p">)</span>
-<span class="c"># &#39;&lt;p&gt;Sacr\xe9 bleu!&lt;/p&gt;&#39;</span>
-
-<span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;utf-8&quot;</span><span class="p">)</span>
-<span class="c"># &#39;&lt;p&gt;Sacr\xc3\xa9 bleu!&lt;/p&gt;&#39;</span>
-</pre></div>
-</div>
-<p>如果文档中包含当前编码不支持的字符,那么这些字符将呗转换成一系列XML特殊字符引用,下面例子中包含了Unicode编码字符SNOWMAN:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="s">u&quot;&lt;b&gt;</span><span class="se">\N{SNOWMAN}</span><span class="s">&lt;/b&gt;&quot;</span>
-<span class="n">snowman_soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">)</span>
-<span class="n">tag</span> <span class="o">=</span> <span class="n">snowman_soup</span><span class="o">.</span><span class="n">b</span>
-</pre></div>
-</div>
-<p>SNOWMAN字符在UTF-8编码中可以正常显示(看上去像是☃),但有些编码不支持SNOWMAN字符,比如ISO-Latin-1或ASCII,那么在这些编码中SNOWMAN字符会被转换成“&amp;#9731”:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">tag</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;utf-8&quot;</span><span class="p">))</span>
-<span class="c"># &lt;b&gt;☃&lt;/b&gt;</span>
-
-<span class="k">print</span> <span class="n">tag</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;latin-1&quot;</span><span class="p">)</span>
-<span class="c"># &lt;b&gt;&amp;#9731;&lt;/b&gt;</span>
-
-<span class="k">print</span> <span class="n">tag</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;ascii&quot;</span><span class="p">)</span>
-<span class="c"># &lt;b&gt;&amp;#9731;&lt;/b&gt;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="unicode-dammit">
-<h2>Unicode, dammit! (靠!)<a class="headerlink" href="#unicode-dammit" title="Permalink to this headline">¶</a></h2>
-<p><a class="reference internal" href="#unicode-dammit">编码自动检测</a> 功能可以在Beautiful Soup以外使用,检测某段未知编码时,可以使用这个方法:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">UnicodeDammit</span>
-<span class="n">dammit</span> <span class="o">=</span> <span class="n">UnicodeDammit</span><span class="p">(</span><span class="s">&quot;Sacr</span><span class="se">\xc3\xa9</span><span class="s"> bleu!&quot;</span><span class="p">)</span>
-<span class="k">print</span><span class="p">(</span><span class="n">dammit</span><span class="o">.</span><span class="n">unicode_markup</span><span class="p">)</span>
-<span class="c"># Sacré bleu!</span>
-<span class="n">dammit</span><span class="o">.</span><span class="n">original_encoding</span>
-<span class="c"># &#39;utf-8&#39;</span>
-</pre></div>
-</div>
-<p>如果Python中安装了 <tt class="docutils literal"><span class="pre">chardet</span></tt> 或 <tt class="docutils literal"><span class="pre">cchardet</span></tt> 那么编码检测功能的准确率将大大提高.输入的字符越多,检测结果越精确,如果事先猜测到一些可能编码,那么可以将猜测的编码作为参数,这样将优先检测这些编码:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">dammit</span> <span class="o">=</span> <span class="n">UnicodeDammit</span><span class="p">(</span><span class="s">&quot;Sacr</span><span class="se">\xe9</span><span class="s"> bleu!&quot;</span><span class="p">,</span> <span class="p">[</span><span class="s">&quot;latin-1&quot;</span><span class="p">,</span> <span class="s">&quot;iso-8859-1&quot;</span><span class="p">])</span>
-<span class="k">print</span><span class="p">(</span><span class="n">dammit</span><span class="o">.</span><span class="n">unicode_markup</span><span class="p">)</span>
-<span class="c"># Sacré bleu!</span>
-<span class="n">dammit</span><span class="o">.</span><span class="n">original_encoding</span>
-<span class="c"># &#39;latin-1&#39;</span>
-</pre></div>
-</div>
-<p><a class="reference internal" href="#unicode-dammit">编码自动检测</a> 功能中有2项功能是Beautiful Soup库中用不到的</p>
-<div class="section" id="id54">
-<h3>智能引号<a class="headerlink" href="#id54" title="Permalink to this headline">¶</a></h3>
-<p>使用Unicode时,Beautiful Soup还会智能的把引号 <a class="footnote-reference" href="#id91" id="id55">[10]</a> 转换成HTML或XML中的特殊字符:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">markup</span> <span class="o">=</span> <span class="n">b</span><span class="s">&quot;&lt;p&gt;I just </span><span class="se">\x93</span><span class="s">love</span><span class="se">\x94</span><span class="s"> Microsoft Word</span><span class="se">\x92</span><span class="s">s smart quotes&lt;/p&gt;&quot;</span>
-
-<span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">&quot;windows-1252&quot;</span><span class="p">],</span> <span class="n">smart_quotes_to</span><span class="o">=</span><span class="s">&quot;html&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">unicode_markup</span>
-<span class="c"># u&#39;&lt;p&gt;I just &amp;ldquo;love&amp;rdquo; Microsoft Word&amp;rsquo;s smart quotes&lt;/p&gt;&#39;</span>
-
-<span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">&quot;windows-1252&quot;</span><span class="p">],</span> <span class="n">smart_quotes_to</span><span class="o">=</span><span class="s">&quot;xml&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">unicode_markup</span>
-<span class="c"># u&#39;&lt;p&gt;I just &amp;#x201C;love&amp;#x201D; Microsoft Word&amp;#x2019;s smart quotes&lt;/p&gt;&#39;</span>
-</pre></div>
-</div>
-<p>也可以把引号转换为ASCII码:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">&quot;windows-1252&quot;</span><span class="p">],</span> <span class="n">smart_quotes_to</span><span class="o">=</span><span class="s">&quot;ascii&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">unicode_markup</span>
-<span class="c"># u&#39;&lt;p&gt;I just &quot;love&quot; Microsoft Word\&#39;s smart quotes&lt;/p&gt;&#39;</span>
-</pre></div>
-</div>
-<p>很有用的功能,但是Beautiful Soup没有使用这种方式.默认情况下,Beautiful Soup把引号转换成Unicode:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">UnicodeDammit</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="p">[</span><span class="s">&quot;windows-1252&quot;</span><span class="p">])</span><span class="o">.</span><span class="n">unicode_markup</span>
-<span class="c"># u&#39;&lt;p&gt;I just \u201clove\u201d Microsoft Word\u2019s smart quotes&lt;/p&gt;&#39;</span>
-</pre></div>
-</div>
-</div>
-<div class="section" id="id56">
-<h3>矛盾的编码<a class="headerlink" href="#id56" title="Permalink to this headline">¶</a></h3>
-<p>有时文档的大部分都是用UTF-8,但同时还包含了Windows-1252编码的字符,就像微软的智能引号 <a class="footnote-reference" href="#id91" id="id57">[10]</a> 一样.一些包含多个信息的来源网站容易出现这种情况. <tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> 方法可以把这类文档转换成纯UTF-8编码格式,看个简单的例子:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">snowmen</span> <span class="o">=</span> <span class="p">(</span><span class="s">u&quot;</span><span class="se">\N{SNOWMAN}</span><span class="s">&quot;</span> <span class="o">*</span> <span class="mi">3</span><span class="p">)</span>
-<span class="n">quote</span> <span class="o">=</span> <span class="p">(</span><span class="s">u&quot;</span><span class="se">\N{LEFT DOUBLE QUOTATION MARK}</span><span class="s">I like snowmen!</span><span class="se">\N{RIGHT DOUBLE QUOTATION MARK}</span><span class="s">&quot;</span><span class="p">)</span>
-<span class="n">doc</span> <span class="o">=</span> <span class="n">snowmen</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;utf8&quot;</span><span class="p">)</span> <span class="o">+</span> <span class="n">quote</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">&quot;windows_1252&quot;</span><span class="p">)</span>
-</pre></div>
-</div>
-<p>这段文档很杂乱,snowmen是UTF-8编码,引号是Windows-1252编码,直接输出时不能同时显示snowmen和引号,因为它们编码不同:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">print</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span>
-<span class="c"># ☃☃☃�I like snowmen!�</span>
-
-<span class="k">print</span><span class="p">(</span><span class="n">doc</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s">&quot;windows-1252&quot;</span><span class="p">))</span>
-<span class="c"># ☃☃☃“I like snowmen!”</span>
-</pre></div>
-</div>
-<p>如果对这段文档用UTF-8解码就会得到 <tt class="docutils literal"><span class="pre">UnicodeDecodeError</span></tt> 异常,如果用Windows-1252解码就回得到一堆乱码.幸好, <tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> 方法会吧这段字符串转换成UTF-8编码,允许我们同时显示出文档中的snowmen和引号:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">new_doc</span> <span class="o">=</span> <span class="n">UnicodeDammit</span><span class="o">.</span><span class="n">detwingle</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span>
-<span class="k">print</span><span class="p">(</span><span class="n">new_doc</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s">&quot;utf8&quot;</span><span class="p">))</span>
-<span class="c"># ☃☃☃“I like snowmen!”</span>
-</pre></div>
-</div>
-<p><tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> 方法只能解码包含在UTF-8编码中的Windows-1252编码内容,但这解决了最常见的一类问题.</p>
-<p>在创建 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 或 <tt class="docutils literal"><span class="pre">UnicodeDammit</span></tt> 对象前一定要先对文档调用 <tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> 确保文档的编码方式正确.如果尝试去解析一段包含Windows-1252编码的UTF-8文档,就会得到一堆乱码,比如: ☃☃☃“I like snowmen!”.</p>
-<p><tt class="docutils literal"><span class="pre">UnicodeDammit.detwingle()</span></tt> 方法在Beautiful Soup 4.1.0版本中新增</p>
-</div>
-</div>
-</div>
-<div class="section" id="id58">
-<h1>解析部分文档<a class="headerlink" href="#id58" title="Permalink to this headline">¶</a></h1>
-<p>如果仅仅因为想要查找文档中的&lt;a&gt;标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把&lt;a&gt;标签以外的东西都忽略掉. <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 中定义过的文档. 创建一个 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 对象并作为 <tt class="docutils literal"><span class="pre">parse_only</span></tt> 参数给 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 的构造方法即可.</p>
-<div class="section" id="soupstrainer">
-<h2>SoupStrainer<a class="headerlink" href="#soupstrainer" title="Permalink to this headline">¶</a></h2>
-<p><tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 类接受与典型搜索方法相同的参数:<a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> 。下面举例说明三种 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 对象:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">SoupStrainer</span>
-
-<span class="n">only_a_tags</span> <span class="o">=</span> <span class="n">SoupStrainer</span><span class="p">(</span><span class="s">&quot;a&quot;</span><span class="p">)</span>
-
-<span class="n">only_tags_with_id_link2</span> <span class="o">=</span> <span class="n">SoupStrainer</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">&quot;link2&quot;</span><span class="p">)</span>
-
-<span class="k">def</span> <span class="nf">is_short_string</span><span class="p">(</span><span class="n">string</span><span class="p">):</span>
- <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">string</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">10</span>
-
-<span class="n">only_short_strings</span> <span class="o">=</span> <span class="n">SoupStrainer</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="n">is_short_string</span><span class="p">)</span>
-</pre></div>
-</div>
-<p>再拿“爱丽丝”文档来举例,来看看使用三种 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 对象做参数会有什么不同:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">html_doc</span> <span class="o">=</span> <span class="s">&quot;&quot;&quot;</span>
-<span class="s">&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;</span>
-
-<span class="s">&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;</span>
-
-<span class="s">&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were</span>
-<span class="s">&lt;a href=&quot;http://example.com/elsie&quot; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,</span>
-<span class="s">&lt;a href=&quot;http://example.com/lacie&quot; class=&quot;sister&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and</span>
-<span class="s">&lt;a href=&quot;http://example.com/tillie&quot; class=&quot;sister&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;</span>
-<span class="s">and they lived at the bottom of a well.&lt;/p&gt;</span>
-
-<span class="s">&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;</span>
-<span class="s">&quot;&quot;&quot;</span>
-
-<span class="k">print</span><span class="p">(</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">,</span> <span class="s">&quot;html.parser&quot;</span><span class="p">,</span> <span class="n">parse_only</span><span class="o">=</span><span class="n">only_a_tags</span><span class="p">)</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/elsie&quot; id=&quot;link1&quot;&gt;</span>
-<span class="c"># Elsie</span>
-<span class="c"># &lt;/a&gt;</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;</span>
-<span class="c"># Lacie</span>
-<span class="c"># &lt;/a&gt;</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/tillie&quot; id=&quot;link3&quot;&gt;</span>
-<span class="c"># Tillie</span>
-<span class="c"># &lt;/a&gt;</span>
-
-<span class="k">print</span><span class="p">(</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">,</span> <span class="s">&quot;html.parser&quot;</span><span class="p">,</span> <span class="n">parse_only</span><span class="o">=</span><span class="n">only_tags_with_id_link2</span><span class="p">)</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
-<span class="c"># &lt;a class=&quot;sister&quot; href=&quot;http://example.com/lacie&quot; id=&quot;link2&quot;&gt;</span>
-<span class="c"># Lacie</span>
-<span class="c"># &lt;/a&gt;</span>
-
-<span class="k">print</span><span class="p">(</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">,</span> <span class="s">&quot;html.parser&quot;</span><span class="p">,</span> <span class="n">parse_only</span><span class="o">=</span><span class="n">only_short_strings</span><span class="p">)</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
-<span class="c"># Elsie</span>
-<span class="c"># ,</span>
-<span class="c"># Lacie</span>
-<span class="c"># and</span>
-<span class="c"># Tillie</span>
-<span class="c"># ...</span>
-<span class="c">#</span>
-</pre></div>
-</div>
-<p>还可以将 <tt class="docutils literal"><span class="pre">SoupStrainer</span></tt> 作为参数传入 <a class="reference internal" href="#id24">搜索文档树</a> 中提到的方法.这可能不是个常用用法,所以还是提一下:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_doc</span><span class="p">)</span>
-<span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="n">only_short_strings</span><span class="p">)</span>
-<span class="c"># [u&#39;\n\n&#39;, u&#39;\n\n&#39;, u&#39;Elsie&#39;, u&#39;,\n&#39;, u&#39;Lacie&#39;, u&#39; and\n&#39;, u&#39;Tillie&#39;,</span>
-<span class="c"># u&#39;\n\n&#39;, u&#39;...&#39;, u&#39;\n&#39;]</span>
-</pre></div>
-</div>
-</div>
-</div>
-<div class="section" id="id59">
-<h1>常见问题<a class="headerlink" href="#id59" title="Permalink to this headline">¶</a></h1>
-<div class="section" id="id60">
-<h2>代码诊断<a class="headerlink" href="#id60" title="Permalink to this headline">¶</a></h2>
-<p>如果想知道Beautiful Soup到底怎样处理一份文档,可以将文档传入 <tt class="docutils literal"><span class="pre">diagnose()</span></tt> 方法(Beautiful Soup 4.2.0中新增),Beautiful Soup会输出一份报告,说明不同的解析器会怎样处理这段文档,并标出当前的解析过程会使用哪种解析器:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4.diagnose</span> <span class="kn">import</span> <span class="n">diagnose</span>
-<span class="n">data</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">&quot;bad.html&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
-<span class="n">diagnose</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
-
-<span class="c"># Diagnostic running on Beautiful Soup 4.2.0</span>
-<span class="c"># Python version 2.7.3 (default, Aug 1 2012, 05:16:07)</span>
-<span class="c"># I noticed that html5lib is not installed. Installing it may help.</span>
-<span class="c"># Found lxml version 2.3.2.0</span>
-<span class="c">#</span>
-<span class="c"># Trying to parse your data with html.parser</span>
-<span class="c"># Here&#39;s what html.parser did with the document:</span>
-<span class="c"># ...</span>
-</pre></div>
-</div>
-<p><tt class="docutils literal"><span class="pre">diagnose()</span></tt> 方法的输出结果可能帮助你找到问题的原因,如果不行,还可以把结果复制出来以便寻求他人的帮助</p>
-</div>
-<div class="section" id="id61">
-<h2>文档解析错误<a class="headerlink" href="#id61" title="Permalink to this headline">¶</a></h2>
-<p>文档解析错误有两种.一种是崩溃,Beautiful Soup尝试解析一段文档结果却抛除了异常,通常是 <tt class="docutils literal"><span class="pre">HTMLParser.HTMLParseError</span></tt> .还有一种异常情况,是Beautiful Soup解析后的文档树看起来与原来的内容相差很多.</p>
-<p>这些错误几乎都不是Beautiful Soup的原因,这不会是因为Beautiful Soup得代码写的太优秀,而是因为Beautiful Soup没有包含任何文档解析代码.异常产生自被依赖的解析器,如果解析器不能很好的解析出当前的文档,那么最好的办法是换一个解析器.更多细节查看 <a class="reference internal" href="#id9">安装解析器</a> 章节.</p>
-<p>最常见的解析错误是 <tt class="docutils literal"><span class="pre">HTMLParser.HTMLParseError:</span> <span class="pre">malformed</span> <span class="pre">start</span> <span class="pre">tag</span></tt> 和 <tt class="docutils literal"><span class="pre">HTMLParser.HTMLParseError:</span> <span class="pre">bad</span> <span class="pre">end</span> <span class="pre">tag</span></tt> .这都是由Python内置的解析器引起的,解决方法是 <a class="reference internal" href="#id9">安装lxml或html5lib</a></p>
-<p>最常见的异常现象是当前文档找不到指定的Tag,而这个Tag光是用眼睛就足够发现的了. <tt class="docutils literal"><span class="pre">find_all()</span></tt> 方法返回 [] ,而 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法返回 None .这是Python内置解析器的又一个问题: 解析器会跳过那些它不知道的tag.解决方法还是 <a class="reference internal" href="#id9">安装lxml或html5lib</a></p>
-</div>
-<div class="section" id="id62">
-<h2>版本错误<a class="headerlink" href="#id62" title="Permalink to this headline">¶</a></h2>
-<ul class="simple">
-<li><tt class="docutils literal"><span class="pre">SyntaxError:</span> <span class="pre">Invalid</span> <span class="pre">syntax</span></tt> (异常位置在代码行: <tt class="docutils literal"><span class="pre">ROOT_TAG_NAME</span> <span class="pre">=</span> <span class="pre">u'[document]'</span></tt> ),因为Python2版本的代码没有经过迁移就在Python3中窒息感</li>
-<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">HTMLParser</span></tt> 因为在Python3中执行Python2版本的Beautiful Soup</li>
-<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">html.parser</span></tt> 因为在Python2中执行Python3版本的Beautiful Soup</li>
-<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">BeautifulSoup</span></tt> 因为在没有安装BeautifulSoup3库的Python环境下执行代码,或忘记了BeautifulSoup4的代码需要从 <tt class="docutils literal"><span class="pre">bs4</span></tt> 包中引入</li>
-<li><tt class="docutils literal"><span class="pre">ImportError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">bs4</span></tt> 因为当前Python环境下还没有安装BeautifulSoup4</li>
-</ul>
-</div>
-<div class="section" id="xml">
-<h2>解析成XML<a class="headerlink" href="#xml" title="Permalink to this headline">¶</a></h2>
-<p>默认情况下,Beautiful Soup会将当前文档作为HTML格式解析,如果要解析XML文档,要在 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 构造方法中加入第二个参数 &#8220;xml&#8221;:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">markup</span><span class="p">,</span> <span class="s">&quot;xml&quot;</span><span class="p">)</span>
-</pre></div>
-</div>
-<p>当然,还需要 <a class="reference internal" href="#id9">安装lxml</a></p>
-</div>
-<div class="section" id="id63">
-<h2>解析器的错误<a class="headerlink" href="#id63" title="Permalink to this headline">¶</a></h2>
-<ul class="simple">
-<li>如果同样的代码在不同环境下结果不同,可能是因为两个环境下使用不同的解析器造成的.例如这个环境中安装了lxml,而另一个环境中只有html5lib, <a class="reference internal" href="#id49">解析器之间的区别</a> 中说明了原因.修复方法是在 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 的构造方法中中指定解析器</li>
-<li>因为HTML标签是 <a class="reference external" href="http://www.w3.org/TR/html5/syntax.html#syntax">大小写敏感</a> 的,所以3种解析器再出来文档时都将tag和属性转换成小写.例如文档中的 &lt;TAG&gt;&lt;/TAG&gt; 会被转换为 &lt;tag&gt;&lt;/tag&gt; .如果想要保留tag的大写的话,那么应该将文档 <a class="reference internal" href="#xml">解析成XML</a> .</li>
-</ul>
-</div>
-<div class="section" id="id65">
-<h2>杂项错误<a class="headerlink" href="#id65" title="Permalink to this headline">¶</a></h2>
-<ul class="simple">
-<li><tt class="docutils literal"><span class="pre">UnicodeEncodeError:</span> <span class="pre">'charmap'</span> <span class="pre">codec</span> <span class="pre">can't</span> <span class="pre">encode</span> <span class="pre">character</span> <span class="pre">u'\xfoo'</span> <span class="pre">in</span> <span class="pre">position</span> <span class="pre">bar</span></tt> (或其它类型的 <tt class="docutils literal"><span class="pre">UnicodeEncodeError</span></tt> )的错误,主要是两方面的错误(都不是Beautiful Soup的原因),第一种是正在使用的终端(console)无法显示部分Unicode,参考 <a class="reference external" href="http://wiki.Python.org/moin/PrintFails">Python wiki</a> ,第二种是向文件写入时,被写入文件不支持部分Unicode,这时只要用 <tt class="docutils literal"><span class="pre">u.encode(&quot;utf8&quot;)</span></tt> 方法将编码转换为UTF-8.</li>
-<li><tt class="docutils literal"><span class="pre">KeyError:</span> <span class="pre">[attr]</span></tt> 因为调用 <tt class="docutils literal"><span class="pre">tag['attr']</span></tt> 方法而引起,因为这个tag没有定义该属性.出错最多的是 <tt class="docutils literal"><span class="pre">KeyError:</span> <span class="pre">'href'</span></tt> 和 <tt class="docutils literal"><span class="pre">KeyError:</span> <span class="pre">'class'</span></tt> .如果不确定某个属性是否存在时,用 <tt class="docutils literal"><span class="pre">tag.get('attr')</span></tt> 方法去获取它,跟获取Python字典的key一样</li>
-<li><tt class="docutils literal"><span class="pre">AttributeError:</span> <span class="pre">'ResultSet'</span> <span class="pre">object</span> <span class="pre">has</span> <span class="pre">no</span> <span class="pre">attribute</span> <span class="pre">'foo'</span></tt> 错误通常是因为把 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 的返回结果当作一个tag或文本节点使用,实际上返回结果是一个列表或 <tt class="docutils literal"><span class="pre">ResultSet</span></tt> 对象的字符串,需要对结果进行循环才能得到每个节点的 <tt class="docutils literal"><span class="pre">.foo</span></tt> 属性.或者使用 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法仅获取到一个节点</li>
-<li><tt class="docutils literal"><span class="pre">AttributeError:</span> <span class="pre">'NoneType'</span> <span class="pre">object</span> <span class="pre">has</span> <span class="pre">no</span> <span class="pre">attribute</span> <span class="pre">'foo'</span></tt> 这个错误通常是在调用了 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法后直节点取某个属性 .foo 但是 <tt class="docutils literal"><span class="pre">find()</span></tt> 方法并没有找到任何结果,所以它的返回值是 <tt class="docutils literal"><span class="pre">None</span></tt> .需要找出为什么 <tt class="docutils literal"><span class="pre">find()</span></tt> 的返回值是 <tt class="docutils literal"><span class="pre">None</span></tt> .</li>
-</ul>
-</div>
-<div class="section" id="id66">
-<h2>如何提高效率<a class="headerlink" href="#id66" title="Permalink to this headline">¶</a></h2>
-<p>Beautiful Soup对文档的解析速度不会比它所依赖的解析器更快,如果对计算时间要求很高或者计算机的时间比程序员的时间更值钱,那么就应该直接使用 <a class="reference external" href="http://lxml.de/">lxml</a> .</p>
-<p>换句话说,还有提高Beautiful Soup效率的办法,使用lxml作为解析器.Beautiful Soup用lxml做解析器比用html5lib或Python内置解析器速度快很多.</p>
-<p>安装 <a class="reference external" href="http://pypi.Python.org/pypi/cchardet/">cchardet</a> 后文档的解码的编码检测会速度更快</p>
-<p><a class="reference internal" href="#id58">解析部分文档</a> 不会节省多少解析时间,但是会节省很多内存,并且搜索时也会变得更快.</p>
-</div>
-</div>
-<div class="section" id="beautiful-soup-3">
-<h1>Beautiful Soup 3<a class="headerlink" href="#beautiful-soup-3" title="Permalink to this headline">¶</a></h1>
-<p>Beautiful Soup 3是上一个发布版本,目前已经停止维护.Beautiful Soup 3库目前已经被几个主要的linux平台添加到源里:</p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">Python-beautifulsoup</span></tt></p>
-<p>在PyPi中分发的包名字是 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> :</p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">BeautifulSoup</span></tt></p>
-<p><tt class="docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">BeautifulSoup</span></tt></p>
-<p>或通过 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz">Beautiful Soup 3.2.0源码包</a> 安装</p>
-<p>Beautiful Soup 3的在线文档查看 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">这里</a> ,当然还有 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html">中文版</a> ,然后再读本片文档,来对比Beautiful Soup 4中有什新变化.</p>
-<div class="section" id="id70">
-<h2>迁移到BS4<a class="headerlink" href="#id70" title="Permalink to this headline">¶</a></h2>
-<p>只要一个小变动就能让大部分的Beautiful Soup 3代码使用Beautiful Soup 4的库和方法&#8212;-修改 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 对象的引入方式:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">BeautifulSoup</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
-</pre></div>
-</div>
-<p>修改为:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
-</pre></div>
-</div>
-<ul class="simple">
-<li>如果代码抛出 <tt class="docutils literal"><span class="pre">ImportError</span></tt> 异常“No module named BeautifulSoup”,原因可能是尝试执行Beautiful Soup 3,但环境中只安装了Beautiful Soup 4库</li>
-<li>如果代码跑出 <tt class="docutils literal"><span class="pre">ImportError</span></tt> 异常“No module named bs4”,原因可能是尝试运行Beautiful Soup 4的代码,但环境中只安装了Beautiful Soup 3.</li>
-</ul>
-<p>虽然BS4兼容绝大部分BS3的功能,但BS3中的大部分方法已经不推荐使用了,就方法按照 <a class="reference external" href="http://www.Python.org/dev/peps/pep-0008/">PEP8标准</a> 重新定义了方法名.很多方法都重新定义了方法名,但只有少数几个方法没有向下兼容.</p>
-<p>上述内容就是BS3迁移到BS4的注意事项</p>
-<div class="section" id="id71">
-<h3>需要的解析器<a class="headerlink" href="#id71" title="Permalink to this headline">¶</a></h3>
-<p>Beautiful Soup 3曾使用Python的 <tt class="docutils literal"><span class="pre">SGMLParser</span></tt> 解析器,这个模块在Python3中已经被移除了.Beautiful Soup 4默认使用系统的 <tt class="docutils literal"><span class="pre">html.parser</span></tt> ,也可以使用lxml或html5lib扩展库代替.查看 <a class="reference internal" href="#id9">安装解析器</a> 章节</p>
-<p>因为 <tt class="docutils literal"><span class="pre">html.parser</span></tt> 解析器与 <tt class="docutils literal"><span class="pre">SGMLParser</span></tt> 解析器不同,它们在处理格式不正确的文档时也会产生不同结果.通常 <tt class="docutils literal"><span class="pre">html.parser</span></tt> 解析器会抛出异常.所以推荐安装扩展库作为解析器.有时 <tt class="docutils literal"><span class="pre">html.parser</span></tt> 解析出的文档树结构与 <tt class="docutils literal"><span class="pre">SGMLParser</span></tt> 的不同.如果发生这种情况,那么需要升级BS3来处理新的文档树.</p>
-</div>
-<div class="section" id="id72">
-<h3>方法名的变化<a class="headerlink" href="#id72" title="Permalink to this headline">¶</a></h3>
-<ul class="simple">
-<li><tt class="docutils literal"><span class="pre">renderContents</span></tt> -&gt; <tt class="docutils literal"><span class="pre">encode_contents</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">replaceWith</span></tt> -&gt; <tt class="docutils literal"><span class="pre">replace_with</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">replaceWithChildren</span></tt> -&gt; <tt class="docutils literal"><span class="pre">unwrap</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">findAll</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_all</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">findAllNext</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_all_next</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">findAllPrevious</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_all_previous</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">findNext</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_next</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">findNextSibling</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_next_sibling</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">findNextSiblings</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_next_siblings</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">findParent</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_parent</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">findParents</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_parents</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">findPrevious</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_previous</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">findPreviousSibling</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_previous_sibling</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">findPreviousSiblings</span></tt> -&gt; <tt class="docutils literal"><span class="pre">find_previous_siblings</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">nextSibling</span></tt> -&gt; <tt class="docutils literal"><span class="pre">next_sibling</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">previousSibling</span></tt> -&gt; <tt class="docutils literal"><span class="pre">previous_sibling</span></tt></li>
-</ul>
-<p>Beautiful Soup构造方法的参数部分也有名字变化:</p>
-<ul class="simple">
-<li><tt class="docutils literal"><span class="pre">BeautifulSoup(parseOnlyThese=...)</span></tt> -&gt; <tt class="docutils literal"><span class="pre">BeautifulSoup(parse_only=...)</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">BeautifulSoup(fromEncoding=...)</span></tt> -&gt; <tt class="docutils literal"><span class="pre">BeautifulSoup(from_encoding=...)</span></tt></li>
-</ul>
-<p>为了适配Python3,修改了一个方法名:</p>
-<ul class="simple">
-<li><tt class="docutils literal"><span class="pre">Tag.has_key()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">Tag.has_attr()</span></tt></li>
-</ul>
-<p>修改了一个属性名,让它看起来更专业点:</p>
-<ul class="simple">
-<li><tt class="docutils literal"><span class="pre">Tag.isSelfClosing</span></tt> -&gt; <tt class="docutils literal"><span class="pre">Tag.is_empty_element</span></tt></li>
-</ul>
-<p>修改了下面3个属性的名字,以免雨Python保留字冲突.这些变动不是向下兼容的,如果在BS3中使用了这些属性,那么在BS4中这些代码无法执行.</p>
-<ul class="simple">
-<li>UnicodeDammit.Unicode -&gt; UnicodeDammit.Unicode_markup``</li>
-<li><tt class="docutils literal"><span class="pre">Tag.next</span></tt> -&gt; <tt class="docutils literal"><span class="pre">Tag.next_element</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">Tag.previous</span></tt> -&gt; <tt class="docutils literal"><span class="pre">Tag.previous_element</span></tt></li>
-</ul>
-</div>
-<div class="section" id="id73">
-<h3>生成器<a class="headerlink" href="#id73" title="Permalink to this headline">¶</a></h3>
-<p>将下列生成器按照PEP8标准重新命名,并转换成对象的属性:</p>
-<ul class="simple">
-<li><tt class="docutils literal"><span class="pre">childGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">children</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">nextGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">next_elements</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">nextSiblingGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">next_siblings</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">previousGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">previous_elements</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">previousSiblingGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">previous_siblings</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">recursiveChildGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">descendants</span></tt></li>
-<li><tt class="docutils literal"><span class="pre">parentGenerator()</span></tt> -&gt; <tt class="docutils literal"><span class="pre">parents</span></tt></li>
-</ul>
-<p>所以迁移到BS4版本时要替换这些代码:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">tag</span><span class="o">.</span><span class="n">parentGenerator</span><span class="p">():</span>
- <span class="o">...</span>
-</pre></div>
-</div>
-<p>替换为:</p>
-<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">tag</span><span class="o">.</span><span class="n">parents</span><span class="p">:</span>
- <span class="o">...</span>
-</pre></div>
-</div>
-<p>(两种调用方法现在都能使用)</p>
-<p>BS3中有的生成器循环结束后会返回 <tt class="docutils literal"><span class="pre">None</span></tt> 然后结束.这是个bug.新版生成器不再返回 <tt class="docutils literal"><span class="pre">None</span></tt> .</p>
-<p>BS4中增加了2个新的生成器, <a class="reference internal" href="#strings-stripped-strings">.strings 和 stripped_strings</a> . <tt class="docutils literal"><span class="pre">.strings</span></tt> 生成器返回NavigableString对象, <tt class="docutils literal"><span class="pre">.stripped_strings</span></tt> 方法返回去除前后空白的Python的string对象.</p>
-</div>
-<div class="section" id="id74">
-<h3>XML<a class="headerlink" href="#id74" title="Permalink to this headline">¶</a></h3>
-<p>BS4中移除了解析XML的 <tt class="docutils literal"><span class="pre">BeautifulStoneSoup</span></tt> 类.如果要解析一段XML文档,使用 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 构造方法并在第二个参数设置为“xml”.同时 <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 构造方法也不再识别 <tt class="docutils literal"><span class="pre">isHTML</span></tt> 参数.</p>
-<p>Beautiful Soup处理XML空标签的方法升级了.旧版本中解析XML时必须指明哪个标签是空标签. 构造方法的 <tt class="docutils literal"><span class="pre">selfClosingTags</span></tt> 参数已经不再使用.新版Beautiful Soup将所有空标签解析为空元素,如果向空元素中添加子节点,那么这个元素就不再是空元素了.</p>
-</div>
-<div class="section" id="id75">
-<h3>实体<a class="headerlink" href="#id75" title="Permalink to this headline">¶</a></h3>
-<p>HTML或XML实体都会被解析成Unicode字符,Beautiful Soup 3版本中有很多处理实体的方法,在新版中都被移除了. <tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 构造方法也不再接受 <tt class="docutils literal"><span class="pre">smartQuotesTo</span></tt> 或 <tt class="docutils literal"><span class="pre">convertEntities</span></tt> 参数. <a class="reference internal" href="#unicode-dammit">编码自动检测</a> 方法依然有 <tt class="docutils literal"><span class="pre">smart_quotes_to</span></tt> 参数,但是默认会将引号转换成Unicode.内容配置项 <tt class="docutils literal"><span class="pre">HTML_ENTITIES</span></tt> , <tt class="docutils literal"><span class="pre">XML_ENTITIES</span></tt> 和 <tt class="docutils literal"><span class="pre">XHTML_ENTITIES</span></tt> 在新版中被移除.因为它们代表的特性已经不再被支持.</p>
-<p>如果在输出文档时想把Unicode字符转换成HTML实体,而不是输出成UTF-8编码,那就需要用到 <a class="reference internal" href="#id47">输出格式</a> 的方法.</p>
-</div>
-<div class="section" id="id76">
-<h3>迁移杂项<a class="headerlink" href="#id76" title="Permalink to this headline">¶</a></h3>
-<p><a class="reference internal" href="#string">Tag.string</a> 属性现在是一个递归操作.如果A标签只包含了一个B标签,那么A标签的.string属性值与B标签的.string属性值相同.</p>
-<p><a class="reference internal" href="#id12">多值属性</a> 比如 <tt class="docutils literal"><span class="pre">class</span></tt> 属性包含一个他们的值的列表,而不是一个字符串.这可能会影响到如何按照CSS类名哦搜索tag.</p>
-<p>如果使用 <tt class="docutils literal"><span class="pre">find*</span></tt> 方法时同时传入了 <a class="reference internal" href="#text">text 参数</a> 和 <a class="reference internal" href="#id32">name 参数</a> .Beautiful Soup会搜索指定name的tag,并且这个tag的 <a class="reference internal" href="#string">Tag.string</a> 属性包含text参数的内容.结果中不会包含字符串本身.旧版本中Beautiful Soup会忽略掉tag参数,只搜索text参数.</p>
-<p><tt class="docutils literal"><span class="pre">BeautifulSoup</span></tt> 构造方法不再支持 markupMassage 参数.现在由解析器负责文档的解析正确性.</p>
-<p>很少被用到的几个解析器方法在新版中被移除,比如 <tt class="docutils literal"><span class="pre">ICantBelieveItsBeautifulSoup</span></tt> 和 <tt class="docutils literal"><span class="pre">BeautifulSOAP</span></tt> .现在由解析器完全负责如何解释模糊不清的文档标记.</p>
-<p><tt class="docutils literal"><span class="pre">prettify()</span></tt> 方法在新版中返回Unicode字符串,不再返回字节流.</p>
-<p><a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html">BeautifulSoup3 文档</a></p>
-<table class="docutils footnote" frame="void" id="id82" rules="none">
-<colgroup><col class="label" /><col /></colgroup>
-<tbody valign="top">
-<tr><td class="label"><a class="fn-backref" href="#id3">[1]</a></td><td>BeautifulSoup的google讨论组不是很活跃,可能是因为库已经比较完善了吧,但是作者还是会很热心的尽量帮你解决问题的.</td></tr>
-</tbody>
-</table>
-<table class="docutils footnote" frame="void" id="id83" rules="none">
-<colgroup><col class="label" /><col /></colgroup>
-<tbody valign="top">
-<tr><td class="label">[2]</td><td><em>(<a class="fn-backref" href="#id19">1</a>, <a class="fn-backref" href="#id23">2</a>)</em> 文档被解析成树形结构,所以下一步解析过程应该是当前节点的子节点</td></tr>
-</tbody>
-</table>
-<table class="docutils footnote" frame="void" id="id84" rules="none">
-<colgroup><col class="label" /><col /></colgroup>
-<tbody valign="top">
-<tr><td class="label"><a class="fn-backref" href="#id26">[3]</a></td><td>过滤器只能作为搜索文档的参数,或者说应该叫参数类型更为贴切,原文中用了 <tt class="docutils literal"><span class="pre">filter</span></tt> 因此翻译为过滤器</td></tr>
-</tbody>
-</table>
-<table class="docutils footnote" frame="void" id="id85" rules="none">
-<colgroup><col class="label" /><col /></colgroup>
-<tbody valign="top">
-<tr><td class="label"><a class="fn-backref" href="#id31">[4]</a></td><td>元素参数,HTML文档中的一个tag节点,不能是文本节点</td></tr>
-</tbody>
-</table>
-<table class="docutils footnote" frame="void" id="id86" rules="none">
-<colgroup><col class="label" /><col /></colgroup>
-<tbody valign="top">
-<tr><td class="label">[5]</td><td><em>(<a class="fn-backref" href="#id18">1</a>, <a class="fn-backref" href="#id33">2</a>, <a class="fn-backref" href="#id34">3</a>, <a class="fn-backref" href="#id35">4</a>, <a class="fn-backref" href="#id36">5</a>)</em> 采用先序遍历方式</td></tr>
-</tbody>
-</table>
-<table class="docutils footnote" frame="void" id="id87" rules="none">
-<colgroup><col class="label" /><col /></colgroup>
-<tbody valign="top">
-<tr><td class="label">[6]</td><td><em>(<a class="fn-backref" href="#id38">1</a>, <a class="fn-backref" href="#id39">2</a>)</em> CSS选择器是一种单独的文档搜索语法, 参考 <a class="reference external" href="http://www.w3school.com.cn/css/css_selector_type.asp">http://www.w3school.com.cn/css/css_selector_type.asp</a></td></tr>
-</tbody>
-</table>
-<table class="docutils footnote" frame="void" id="id88" rules="none">
-<colgroup><col class="label" /><col /></colgroup>
-<tbody valign="top">
-<tr><td class="label"><a class="fn-backref" href="#id50">[7]</a></td><td>原文写的是 html5lib, 译者觉得这是愿文档的一个笔误</td></tr>
-</tbody>
-</table>
-<table class="docutils footnote" frame="void" id="id89" rules="none">
-<colgroup><col class="label" /><col /></colgroup>
-<tbody valign="top">
-<tr><td class="label"><a class="fn-backref" href="#id43">[8]</a></td><td>wrap含有包装,打包的意思,但是这里的包装不是在外部包装而是将当前tag的内部内容包装在一个tag里.包装原来内容的新tag依然在执行 <a class="reference internal" href="#wrap">wrap()</a> 方法的tag内</td></tr>
-</tbody>
-</table>
-<table class="docutils footnote" frame="void" id="id90" rules="none">
-<colgroup><col class="label" /><col /></colgroup>
-<tbody valign="top">
-<tr><td class="label"><a class="fn-backref" href="#id52">[9]</a></td><td>文档中特殊编码字符被替换成特殊字符(通常是�)的过程是Beautful Soup自动实现的,如果想要多种编码格式的文档被完全转换正确,那么,只好,预先手动处理,统一编码格式</td></tr>
-</tbody>
-</table>
-<table class="docutils footnote" frame="void" id="id91" rules="none">
-<colgroup><col class="label" /><col /></colgroup>
-<tbody valign="top">
-<tr><td class="label">[10]</td><td><em>(<a class="fn-backref" href="#id55">1</a>, <a class="fn-backref" href="#id57">2</a>)</em> 智能引号,常出现在microsoft的word软件中,即在某一段落中按引号出现的顺序每个引号都被自动转换为左引号,或右引号.</td></tr>
-</tbody>
-</table>
-</div>
-</div>
-</div>
-
-
- </div>
- </div>
- </div>
- <div class="sphinxsidebar">
- <div class="sphinxsidebarwrapper">
- <h3><a href="index.html">Table Of Contents</a></h3>
- <ul>
-<li><a class="reference internal" href="#">Beautiful Soup 4.2.0 文档</a><ul>
-<li><a class="reference internal" href="#id1">寻求帮助</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id4">快速开始</a></li>
-<li><a class="reference internal" href="#id5">安装 Beautiful Soup</a><ul>
-<li><a class="reference internal" href="#id8">安装完成后的问题</a></li>
-<li><a class="reference internal" href="#id9">安装解析器</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id10">如何使用</a></li>
-<li><a class="reference internal" href="#id11">对象的种类</a><ul>
-<li><a class="reference internal" href="#tag">Tag</a><ul>
-<li><a class="reference internal" href="#name">Name</a></li>
-<li><a class="reference internal" href="#attributes">Attributes</a><ul>
-<li><a class="reference internal" href="#id12">多值属性</a></li>
-</ul>
-</li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id13">可以遍历的字符串</a></li>
-<li><a class="reference internal" href="#beautifulsoup">BeautifulSoup</a></li>
-<li><a class="reference internal" href="#id14">注释及特殊字符串</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id15">遍历文档树</a><ul>
-<li><a class="reference internal" href="#id16">子节点</a><ul>
-<li><a class="reference internal" href="#id17">tag的名字</a></li>
-<li><a class="reference internal" href="#contents-children">.contents 和 .children</a></li>
-<li><a class="reference internal" href="#descendants">.descendants</a></li>
-<li><a class="reference internal" href="#string">.string</a></li>
-<li><a class="reference internal" href="#strings-stripped-strings">.strings 和 stripped_strings</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id20">父节点</a><ul>
-<li><a class="reference internal" href="#parent">.parent</a></li>
-<li><a class="reference internal" href="#parents">.parents</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id21">兄弟节点</a><ul>
-<li><a class="reference internal" href="#next-sibling-previous-sibling">.next_sibling 和 .previous_sibling</a></li>
-<li><a class="reference internal" href="#next-siblings-previous-siblings">.next_siblings 和 .previous_siblings</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id22">回退和前进</a><ul>
-<li><a class="reference internal" href="#next-element-previous-element">.next_element 和 .previous_element</a></li>
-<li><a class="reference internal" href="#next-elements-previous-elements">.next_elements 和 .previous_elements</a></li>
-</ul>
-</li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id24">搜索文档树</a><ul>
-<li><a class="reference internal" href="#id25">过滤器</a><ul>
-<li><a class="reference internal" href="#id27">字符串</a></li>
-<li><a class="reference internal" href="#id28">正则表达式</a></li>
-<li><a class="reference internal" href="#id29">列表</a></li>
-<li><a class="reference internal" href="#true">True</a></li>
-<li><a class="reference internal" href="#id30">方法</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#find-all">find_all()</a><ul>
-<li><a class="reference internal" href="#id32">name 参数</a></li>
-<li><a class="reference internal" href="#keyword">keyword 参数</a></li>
-<li><a class="reference internal" href="#css">按CSS搜索</a></li>
-<li><a class="reference internal" href="#text"><tt class="docutils literal"><span class="pre">text</span></tt> 参数</a></li>
-<li><a class="reference internal" href="#limit"><tt class="docutils literal"><span class="pre">limit</span></tt> 参数</a></li>
-<li><a class="reference internal" href="#recursive"><tt class="docutils literal"><span class="pre">recursive</span></tt> 参数</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#find-all-tag">像调用 <tt class="docutils literal"><span class="pre">find_all()</span></tt> 一样调用tag</a></li>
-<li><a class="reference internal" href="#find">find()</a></li>
-<li><a class="reference internal" href="#find-parents-find-parent">find_parents() 和 find_parent()</a></li>
-<li><a class="reference internal" href="#find-next-siblings-find-next-sibling">find_next_siblings() 合 find_next_sibling()</a></li>
-<li><a class="reference internal" href="#find-previous-siblings-find-previous-sibling">find_previous_siblings() 和 find_previous_sibling()</a></li>
-<li><a class="reference internal" href="#find-all-next-find-next">find_all_next() 和 find_next()</a></li>
-<li><a class="reference internal" href="#find-all-previous-find-previous">find_all_previous() 和 find_previous()</a></li>
-<li><a class="reference internal" href="#id37">CSS选择器</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id40">修改文档树</a><ul>
-<li><a class="reference internal" href="#id41">修改tag的名称和属性</a></li>
-<li><a class="reference internal" href="#id42">修改 .string</a></li>
-<li><a class="reference internal" href="#append">append()</a></li>
-<li><a class="reference internal" href="#beautifulsoup-new-string-new-tag">BeautifulSoup.new_string() 和 .new_tag()</a></li>
-<li><a class="reference internal" href="#insert">insert()</a></li>
-<li><a class="reference internal" href="#insert-before-insert-after">insert_before() 和 insert_after()</a></li>
-<li><a class="reference internal" href="#clear">clear()</a></li>
-<li><a class="reference internal" href="#extract">extract()</a></li>
-<li><a class="reference internal" href="#decompose">decompose()</a></li>
-<li><a class="reference internal" href="#replace-with">replace_with()</a></li>
-<li><a class="reference internal" href="#wrap">wrap()</a></li>
-<li><a class="reference internal" href="#unwrap">unwrap()</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id44">输出</a><ul>
-<li><a class="reference internal" href="#id45">格式化输出</a></li>
-<li><a class="reference internal" href="#id46">压缩输出</a></li>
-<li><a class="reference internal" href="#id47">输出格式</a></li>
-<li><a class="reference internal" href="#get-text">get_text()</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id48">指定文档解析器</a><ul>
-<li><a class="reference internal" href="#id49">解析器之间的区别</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id51">编码</a><ul>
-<li><a class="reference internal" href="#id53">输出编码</a></li>
-<li><a class="reference internal" href="#unicode-dammit">Unicode, dammit! (靠!)</a><ul>
-<li><a class="reference internal" href="#id54">智能引号</a></li>
-<li><a class="reference internal" href="#id56">矛盾的编码</a></li>
-</ul>
-</li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id58">解析部分文档</a><ul>
-<li><a class="reference internal" href="#soupstrainer">SoupStrainer</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#id59">常见问题</a><ul>
-<li><a class="reference internal" href="#id60">代码诊断</a></li>
-<li><a class="reference internal" href="#id61">文档解析错误</a></li>
-<li><a class="reference internal" href="#id62">版本错误</a></li>
-<li><a class="reference internal" href="#xml">解析成XML</a></li>
-<li><a class="reference internal" href="#id63">解析器的错误</a></li>
-<li><a class="reference internal" href="#id65">杂项错误</a></li>
-<li><a class="reference internal" href="#id66">如何提高效率</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#beautiful-soup-3">Beautiful Soup 3</a><ul>
-<li><a class="reference internal" href="#id70">迁移到BS4</a><ul>
-<li><a class="reference internal" href="#id71">需要的解析器</a></li>
-<li><a class="reference internal" href="#id72">方法名的变化</a></li>
-<li><a class="reference internal" href="#id73">生成器</a></li>
-<li><a class="reference internal" href="#id74">XML</a></li>
-<li><a class="reference internal" href="#id75">实体</a></li>
-<li><a class="reference internal" href="#id76">迁移杂项</a></li>
-</ul>
-</li>
-</ul>
-</li>
-</ul>
-
- <h3>This Page</h3>
- <ul class="this-page-menu">
- <li><a href="_sources/zh.txt"
- rel="nofollow">Show Source</a></li>
- </ul>
-<div id="searchbox" style="display: none">
- <h3>Quick search</h3>
- <form class="search" action="search.html" method="get">
- <input type="text" name="q" />
- <input type="submit" value="Go" />
- <input type="hidden" name="check_keywords" value="yes" />
- <input type="hidden" name="area" value="default" />
- </form>
- <p class="searchtip" style="font-size: 90%">
- Enter search terms or a module, class or function name.
- </p>
-</div>
-<script type="text/javascript">$('#searchbox').show(0);</script>
- </div>
- </div>
- <div class="clearer"></div>
- </div>
- <div class="related">
- <h3>Navigation</h3>
- <ul>
- <li class="right" style="margin-right: 10px">
- <a href="genindex.html" title="General Index"
- >index</a></li>
- <li><a href="index.html">Beautiful Soup 4.2.0 documentation</a> &raquo;</li>
- </ul>
- </div>
- <div class="footer">
- &copy; Copyright 2012, Leonard Richardson.
- Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.2b1.
- </div>
- </body>
-</html> \ No newline at end of file
diff --git a/doc/source/README b/doc/source/README
new file mode 100644
index 0000000..6fbab84
--- /dev/null
+++ b/doc/source/README
@@ -0,0 +1,23 @@
+Translation credits
+###################
+
+I keep a copy of all translations in this repository so that it's easy
+to host translations on the Beautiful Soup website. These are
+generally not the canonical versions of the translations, though.
+
+doc.html/index.jp.html is a copy of the 2013 Japanese translation hosted at
+http://kondou.com/BS4/. I don't know who to credit for the
+translation.
+
+doc.html/index.kr.html is a copy of the 2012 Korean translation formerly hosted
+at http://coreapython.hosting.paran.com/etc/beautifulsoup4.html. I
+retrieved this copy from the Wayback Machine. I'm not sure who wrote
+the translation but I believe the credit goes to "Johnsonj".
+
+doc.ptbr/source/index.rst is a 2019 Brazilian Portuguese translation by Cezar
+Peixeiro. The version in this repository has been modified from
+https://github.com/czrpxr/BeautifulSoup4-ptbr-translation.
+
+doc.zh/source/index.rst is a 2018 Chinese translation by Deron
+Wang. The version in this repository has been copied from
+https://github.com/DeronW/beautifulsoup.
diff --git a/doc/source/index.rst b/doc/source/index.rst
index ac7409f..08063a5 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -3114,6 +3114,27 @@ You can speed up encoding detection significantly by installing the
the document, but it can save a lot of memory, and it'll make
`searching` the document much faster.
+Translating this documentation
+==============================
+
+New translations of the Beautiful Soup documentation are greatly
+appreciated. Translations should be licensed under the MIT license,
+just like Beautiful Soup and its English documentation are.
+
+There are two ways of getting your translation into the main code base
+and onto the Beautiful Soup website:
+
+1. Create a branch of the Beautiful Soup repository, add your
+ translation, and propose a merge with the main branch, the same
+ as you would do with a proposed change to the source code.
+2. Send a message to the Beautiful Soup discussion group with a link to
+ your translation, or attach your translation to the message.
+
+Use the Chinese or Brazilian Portuguese translations as your model. In
+particular, please translate ``doc/source/index.rst`` file, rather
+than the HTML version of the documentation. This makes it possible to
+publish the documentation in a variety of formats, not just HTML.
+
Beautiful Soup 3
================