クローラーの注意事項 #クローラー

BeautifulSoup

BeautifulSoupオブジェクトのprettify()メソッド：解析された文字列を標準的なインデント形式で出力できます BeautifulSoupオブジェクトのtitle.stringは、titleノードのテキストをHTMLで出力します。

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')

print(soup.prettify()
print(soup.title.string)

出力結果：

<html>
 <head>
 <title>
 The Dormouse's story
 </title>
 </head>
 <body>
 <p class="title" name="dormouse">
 <b>
 The Dormouse's story
 </b>
 </p>
 <p class="story">
 Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://.com/elsie" id="link1">
 <!-- Elsie -->
 </a>
 ,
 <a class="sister" href="http://.com/lacie" id="link2">
 Lacie
 </a>
 and
 <a class="sister" href="http://.com/tillie" id="link3">
 Tillie
 </a>
 ;
and they lived at the bottom of a well.
 </p>
 <p class="story">
 ...
 </p>
 </body>
</html>
The Dormouse's story

ノードセレクタ

ノード名を直接呼び出すことでノード要素を選択し、string属性を呼び出すことでノード内のテキストを取得できます。

print(soup.title)
print(soup.title.string)
print(type(soup.title))
print(soup.head)
print(soup.p)

出力結果：


<title>The Dormouse's story</title>
The Dormouse's story
<class 'bs4.element.Tag'>
<head><title>The Dormouse's story</title></head>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>

注意：ノードが複数ある場合、このメソッドは最初のノードにのみマッチします。例えば、上記の p ノード

情報を取り出す

名前の取得

print(soup.p.name)

属性の取得

print(soup.p.attrs)
print(soup.p.attrs['name'])

コンテンツの取得

print(soup.p.string)

関連選択：直接の子ノード

print(soup.p.contents)

リターン結果


['Once upon a time there were three little sisters; and their names were
', <a class="sister" href="http://.com/elsie" id="link1"><!-- Elsie --></a>, ',
', <a class="sister" href="http://.com/lacie" id="link2">Lacie</a>, ' and
', <a class="sister" href="http://.com/tillie" id="link3">Tillie</a>, ';
and they lived at the bottom of a well.']

もう一つの直接の子ノードの子

print(soup.p.children)
for i, child in enumerate(soup.p.children):
 print(i,child)

子孫ノード子孫：

print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
 print(i,child)

直接の親：親は一人ですが、子供はたくさんいます。親と子のタイプを比較すると

print(soup.p.parent)

先祖のノード親：

for i, parent in enumerate(soup.p.parents):
 print(i,parent)

兄弟ノード：

next_sibling 次の兄弟ノード
previous_sibling 前の兄弟ノード
次の兄弟次の兄弟
前の兄弟前の兄弟

メソッド・セレクタ

soup.find_all(name = 'ul')

アトラクション

soup.find_all(attrs = {'id':"list-1"})
#属性
soup.find_all(id = 'list-1')
soup.find_all(class_='element')#classはPythonキーワードなので、アンダースコアが続く必要がある。

テキスト

は、文字列または正規表現として渡されたノードのテキストにマッチするために使用できます。

soup.find_all(text=re.compile('link'))

find() は find_all() と同じですが、単一の要素、つまり最初にマッチした要素を返します。

find_parents() および find_parent(): 前者はすべての祖先ノードを返し、後者は直接の親ノードを返します find_next_siblings() および find_next_sibling(): 前者は後ろにあるすべての兄弟ノードを返し、後者は後ろにある最初の兄弟ノードを返します。

find_previous_siblings() および find_previous_sibling(): 前者はすべての前の兄弟ノードを返し、後者は最初の前の兄弟ノードを返します。

find_all_next() および find_next(): 前者はそのノードの後にあるすべての適格なノードを返し、後者は最初の適格なノードを返します。

クローラーの注意事項

BeautifulSoup

ノードセレクタ

メソッド・セレクタ

Read next

コンポーネント化されたデザイン思考

LayoutInflaterの原理

UIPageViewControllerとUITableViewCellの左スライド削除ジェスチャ間の競合を解決する。

半日かけてソースコードを拾い集め、ようやくOauth2カスタム処理結果の最適解を見つけた！

アルゴリズムの複雑さO(1),O(n),O(logn),O(nlogn)の意味を説明した記事。

オリジナル・チェーンへの資産引き出し