BeautifulSoup是Python的一个HTML或XML的解析库,可以用它来方便地从网页提取数据(以下为崔庆才的爬虫书的学习笔记)
一. 安装方式
#安装beautifulsoup4 pip install beautifulsoup4 #安装lxml pip install lxml
二. 基本语法
1. 节点选择器:基本用法
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story>Once upon a time there are three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie -->/a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
假如想要获取上述html中的title节点及其文本内容,请看以下语法:
引入并初始化beautifulsoup
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')
初始化对于一些不标准的html,可以自动更正格式,如补全标签等等
获取title节点,查看它的类型
print(soup.title) print(type(soup.title)) #输出结果 <title>The Dormouse's story</title> <class 'bs4.element.Tag'>
获取到的title节点,正是节点加文本内容
获取title节点文本内容
print(soup.title.string) #输出结果 The Dormouse's story
如果想获取其他信息,比如节点的名字及其属性,这些也比较方便
获取title节点的名字
print(soup.title.name) #输出结果 title
获取p节点的多个属性和某一属性
p节点有多个属性,比如class和name等,可以调用attrs获取所有属性
#获取多个属性 print(soup.p.attrs) #输出结果: {'class': ['title'], 'name': 'dromouse'} #获取某个属性:方法一 print(soup.p.attrs['name'] #输出结果: dromouse #获取某个属性:方法二 print(soup.p['name'] #输出结果: dromouse #获取单个属性需要注意的地方 print(soup.p['class']) #输出结果: ['title']
需要注意的是,有的返回的是字符串,有的返回的是字符串组成的列表。比如,name属性的值是唯一的,返回的结果就是单个字符串,而对于class,一个节点的元素可能有多个class,所以返回的是列表。另外,这里的p节点是第一个p节点
嵌套选择或层级选择
如果多个节点层级嵌套在一起,可以通过层级关系依次选择,比如要选择title节点及其内容,之前我们是soup.title,现在可以这样操作:soup.head.title
html = """ <html><head><title>The Dormouse's story</title></head> <body> """
print(soup.head.title) print(type(soup.head.title)) print(soup.head.title.string) #输出结果: <title>The Dormouse's story</title> <class 'bs4.element.Tag'> The Dormouse's story
2. 节点选择器:高级用法
父节点和祖先节点
如果要获取某个节点元素的父节点,可以调用parent属性
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> </p> <p class="story>...</p> """
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.a.parent) #输出结果:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
</p>
这里我们选择的是第一个a节点的父节点元素,很明显,它的父节点是p节点,输出结果便是p节点及其内部的内容
如果想要获取所有的祖先元素,可以调用parents属性:
html = """ <html> <body> <p class="story"> <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> </p> """
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(type(soup.a.parents)) print(list(enumerate(soup.a.parents))) #运行结果: <class 'generator'> [(0, <p class="story"> <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> </p>), (1, <body> <p class="story"> <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> </p> </body>), (2, <html> <body> <p class="story"> <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> </p> </body></html>), (3, <html> <body> <p class="story"> <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a> </p> </body></html>)]
这里为什么出现了两个html开头的文本呢?是因为parents遍历的顺序是p—body—html—[document]
子节点和子孙节点
选取节点元素知乎,如果想要获取它的直接子节点,可以调用contents属性:
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elise" class="sister" id="link1"> <span>Elise</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """
可以看到,返回结果是列表形式。p节点里既包含文本,又包含节点
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.contents) #运行结果: [' Once upon a time there were three little sisters; and their names were ', <a class="sister" href="http://example.com/elise" id="link1"> <span>Elise</span> </a>, ' ', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ' and they lived at the bottom of a well. ']
span节点作为p节点的孙子节点,并没有单独列出,而是包含在a中被列出,说明contents属性得到的结果是直接子节点的列表
同样,我们可以调用children属性得到相应的结果:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.children) for i, child in enumerate(soup.p.children): print(i, child) #运行结果: <list_iterator object at 0x000000000303F7B8> 0 Once upon a time there were three little sisters; and their names were 1 <a class="sister" href="http://example.com/elise" id="link1"> <span>Elise</span> </a> 2 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 4 and 5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 6 and they lived at the bottom of a well.
如果还想获得所有的子孙节点的话,可以调用descendants属性:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.descendants) for i, child in enumerate(soup.p.descendants): print(i, child) #运行结果: <generator object Tag.descendants at 0x000000000301F228> 0 Once upon a time there were three little sisters; and their names were 1 <a class="sister" href="http://example.com/elise" id="link1"> <span>Elise</span> </a> 2 3 <span>Elise</span> 4 Elise 5 6 7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 8 Lacie 9 and 10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 11 Tillie 12 and they lived at the bottom of a well.
遍历输出可以看到,这次输出的结果就包含了span节点,descendants会递归查询所有子节点,得到所有的子孙节点
兄弟节点
如果想获取兄弟节点,应该怎么办呢?
html = """ <html> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> Hello <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> """
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print('Next Sibling', soup.a.next_sibling) print('Prev Sibling', soup.a.previous_sibling) print('Next Siblings', list(enumerate(soup.a.next_siblings))) print('Prev Siblings', list(enumerate(soup.a.previous_siblings))) #输出结果: Next Sibling Hello Prev Sibling Once upon a time there were three little sisters; and their names were Next Siblings [(0, ' Hello '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' and '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, ' and they lived at the bottom of a well. ')] Prev Siblings [(0, ' Once upon a time there were three little sisters; and their names were ')]
next_sibling和previous_sibling分别获取节点的下一个和上一个兄弟元素,next_siblings和previous_siblings则分别返回后面和前面的兄弟节点
3. 方法选择器
find_all():查询所有符合条件的元素
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(name='ul')) print(type(soup.find_all(name='ul')[0])) #运行结果: [<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>] <class 'bs4.element.Tag'>
利用find_all方法查询ul节点,返回结果是列表类型,长度为2,每个元素都是bs4.element.Tag类型
还可以进行嵌套查询,获取li节点的文本内容
for ul in soup.find_all(name='ul'): print(ul.find_all(name='li')) for li in ul.find_all(name='li'): print(li.string) #输出结果: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] Foo Bar Jay [<li class="element">Foo</li>, <li class="element">Bar</li>] Foo Bar
除了根据节点名查询,还可以传入一些属性来查询
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={'id': 'list-1'})) print(soup.find_all(attrs={'name': 'elements'})) #输出结果: [<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>] [<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>]
对于一些常用的属性,比如id和class等,可以不用attrs来传递。比如,要查询id为list-1的节点,可以直接传入id这个参数。还是上面的文本,我们换一种方式来查询:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(id='list-1')) print(soup.find_all(class_='element')) #输出结果: [<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
text参数可以用来匹配节点的文本,传入的形式可以是字符串,可以是正则表达式对象
html = ''' <div class="panel"> <div class="panel-body"> <a>Hello, this is a link</a> <a>Hello, this is a link, too</a> </div> </div> '''
from bs4 import BeautifulSoup import re soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text=re.compile('link'))) #输出结果: ['Hello, this is a link', 'Hello, this is a link, too']
find():返回单个元素,也就是第一个匹配的元素
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find(name='ul')) print(type(soup.find(name='ul'))) print(soup.find(class_='list')) #输出结果: <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <class 'bs4.element.Tag'> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>
其他的查询方法
find_parents()和find_parent():前者返回所有祖先节点,后者返回直接父节点
find_next_siblings()和find_next_sibling():前者返回后面所有的兄弟节点,后者返回后面第一个兄弟节点
find_previous_siblings()和find_previous_sibling():前者返回前面所有的兄弟节点,后者返回前面第一个兄弟节点
find_all_next()和find_next():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点
find_all_previous()和find_previous():前者返回节点前所有符合条件的节点,后者返回第一个符合条件的节点
3. CSS选择器
使用CSS选择器时,只需要调用select()方法,传入相应的CSS选择器即可
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0])) #输出结果: [<div class="panel-heading"> <h4>Hello</h4> </div>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>] <class 'bs4.element.Tag'>
嵌套选择
select()方法同样支持嵌套选择。例如,先选择所有ul节点,再遍历每个ul节点
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul.select('li')) #输出结果: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
可以看到,这里正常输出了所有ul节点下所有li节点组成的列表
获取属性
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id']) #输出结果: list-1 list-1 list-2 list-2
可以看到,直接传入中括号和属性名,或通过attrs属性获取属性值,都可以成功
获取文本
要获取文本,可以用前面所讲的string属性或者get_text()方法
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'): print('Get Text:', li.get_text()) print('String:', li.string) #输出结果: Get Text: Foo String: Foo Get Text: Bar String: Bar Get Text: Jay String: Jay Get Text: Foo String: Foo Get Text: Bar String: Bar