python beautifulsoup基本用法-文档搜索

zoukankan html css js c++ java

python beautifulsoup基本用法-文档搜索
以如下html段落为例进行介绍
<html> <head> <title>The Dormouse's story</title> </head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> </body> </html>
soup = BeautifulSoup(html_doc,'lxml')
一、find_all()

s = soup.find_all( 标签名 , id='id' , _class='class' , attrs={ 'attr' : 'value' } ,text='text' , rescursive) #返回匹配到的元素节点，结果为列表

info1 = s.string、s.text、s.get_text() #返回匹配到的元素节点的文本值，结果为列表

info2 = s.attrs['attr'] #返回匹配到的元素节点的属性值，结果为列表

语法find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)，搜索当前标签的子节点，将所有满足条件的结果以列表的形式返回，列表的每一个元素为bs4.element.Tag对象。
- name：标签名
- attrs：标签的属性
- recursive：默认为true会搜索当前标签的所有子孙节点，False则只搜索直接子节点
- text：标签的文本
- limit：限制返回的结果的数量，默认返回所有搜索到的结果
1.name指定要搜索的标签名

传入单个的标签，或者多个标签组成的列表
print(soup.find_all('a')) #返回一个列表，包含3个a标签 print(soup.find_all(['a','b'])) #返回一个列表，包含3个a标签和一个b标签 print(type(soup.html.find_all('a')[0])) #<class 'bs4.element.Tag'>
传入正则表达式
import re print(soup.find_all(re.compile('b'))) #搜索标签包含b的标签，body标签和b标签都会返回 print(soup.find_all(re.compile('^a'))) #搜索以a开头的标签，返回3个a标签
传入True，表示搜索所有tag标签，
list(map(lambda x:x.name,soup.find_all(True))) #['html', 'head', 'title', 'body', 'p', 'b', 'p', 'a', 'a', 'a', 'p']
如果没有合适的过滤器，可以定义一个方法传入作为name，该方法只接受一个元素参数，如果方法返回True表示当前元素匹配并且被找到，方法返回False则元素未匹配成功。

例如要搜索有id属性的标签，可以定义一个方法has_id()，并将该方法传入find_all()。
def has_id(tag): return tag.has_attr('id') soup.find_all(has_id) #返回3个a标签
2.attrs指定要搜索标签的属性

如果属性名称不是搜索的内置参数，可以直接使用属性名称='属性值'来搜索。需要注意的是由于class是python的关键字所以用class属性搜索时需要使用class_。
print(soup.find_all(id='link1')) print(soup.find_all(class_='title')) print(soup.find_all(class_='sister',href=re.compile('elsie')))
有些tag属性在搜索时不能使用，比如HTML5中的data-* 属性，使用k=v形式搜索会报“SyntaxError: keyword can't be an expression”错误，这时需要使用attrs={'属性名称':'属性值'}来搜索。
data = BeautifulSoup('<div data-foo="value">foo!</div>') print(data.find_all(attrs={'data-foo':'value'})) # [<div data-foo="value">foo!</div>] print(data.find_all(data-foo='value')) # SyntaxError: keyword can't be an expression
3.recursive

recursive默认为true，会搜索当前标签的所有子孙节点，如果设置为False则只会搜索直接子节点。
print(soup.html.find_all(re.compile('a'))) #所有包含a的标签，包括head标签和3个a标签 print(soup.html.find_all(re.compile('a'),recursive=False)) #直接子元素包含a的标签，只有head标签
4.text

搜索标签的文本，返回的也是标签的文本，与name一样可接受字符串、字符串列表、正则表达式和True。
print(soup.find_all(text='Lacie')) #['Lacie'] print(soup.find_all(text=['Lacie','Tillie'])) #['Lacie', 'Tillie'] print(soup.find_all(text=re.compile('and'))) #['Once upon a time there were three little sisters; and their names were ', ' and ', '; and they lived at the bottom of a well. ']
5.limit

如果返回的结果较多，可以使用limit限制返回的结果数量。如果limit设置的值超过了返回的数量会显示所有结果，不会报错。
print(soup.find_all('a',limit=2)) #返回2个a标签 print(soup.find_all('a',limit=10)) #不会报错
二、find()

语法find(name , attrs , recursive , text , **kwargs )

find()与find_all()唯一的区别是，后者返回一个包含所有符合条件的列表，而find()只返回第一个满足条件的。

三、find_parents()与find_parent()

find_parents(name=None, attrs={}, limit=None, **kwargs)，当前节点的所有父节点

find_parent(name=None, attrs={}, **kwargs)，当前节点的上一级父节点

四、find_next_siblings()与find_next_sibling()

find_next_siblings(name=None, attrs={}, text=None, limit=None, **kwargs)，后面节点中所有符合条件的兄弟节点

find_next_sibling(name=None, attrs={}, text=None, **kwargs)，后面节点中第一个符合条件兄弟节点

五、find_previous_siblings()与find_previous_sibling()

find_previous_siblings(name=None, attrs={}, text=None, limit=None, **kwargs)，前面节点中所有符合条件的兄弟节点

find_previous_sibling(name=None, attrs={}, text=None, **kwargs)，前面节点中第一个符合条件的兄弟节点

六、find_all_next()与find_next()

find_all_next(name=None, attrs={}, text=None, limit=None, **kwargs)，后面节点中所有符合条件的节点，不分层级

find_next(name=None, attrs={}, text=None, **kwargs)，后面节点中第一个符合条件的节点，不分层级

七、find_all_previous()与find_previous()

find_all_previous(name=None, attrs={}, text=None, limit=None, **kwargs)，前面节点中所有符合条件的节点，不分层级

find_previous(name=None, attrs={}, text=None, **kwargs)，前面节点中第一个符合条件的节点，不分层级

注：以上而、三、四、五、六、七方法参数的用法与find_all()类似。

八、CSS选择器

使用CSS选择器，主要是用到select()与select_one()方法。select()将所有符合条件的结果以列表形式返回，列表的每一个元素为bs4.element.Tag对象。

标签直接通过标签名选取'tagname'、id通过#选取'#id'、类通过.选取'.class'选取，也可三种方式组合进行选择。

select(selector, namespaces=None, limit=None, **kwargs)，所有符合条件的

select_one(selector, namespaces=None, **kwargs)，第一个符合条件的

1.通过标签名、id、class查找标签
print(soup.select('title')) #[<title>The Dormouse's story</title>] print(soup.select('#link1')) #[<a class="sister" href="http://example.com/elsie" id="link1"></a>] print(soup.select('.title')) #[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>] print(type(soup.select('title')[0])) #<class 'bs4.element.Tag'>
2.标签、id、class的组合查找标签
print(soup.select('p #link1')) #p标签下id为link的子标签 print(soup.select('p a')) #p标签下的所有a子标签 print(soup.select('p > a')) #p标签下的直接a子标签 print(soup.select('p + a')) #p标签后的相邻a兄弟标签 print(soup.select('p ~ a')) #p标签后的所有a兄弟标签，此例中p a 、p > a、p + a、p ~ a结果相同，但表示的意义不同
3.标签、id、class结合属性查找标签，属性在中括号内通过属性名称='属性值'表示
print (soup.select('p[class="title"]')) #[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>] print (soup.select('a[id="link1"]')) #[<a class="sister" href="http://example.com/elsie" id="link1"></a>] print (soup.select('p a[id="link2"]')) #[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
4.select()与select_one()获取文本值
print (soup.select('p[class="title"]')[0].string) #<class 'bs4.element.NavigableString'>对象 print (soup.select('p[class="title"]')[0].get_text() #字符串
查看全文

相关阅读:
Wiin10 深色模式暗色省电经济护眼dark mode energy saving ecol
云主机终端现实中文
 systemctl enable与systemctl start的区别
 centos技巧
 用国内的vps登录亚马逊的ec2
小米 redmi note 8 打开开发者选项
 javascript数组、对象和Null的typeof同为object，区分解决办法
 组件中是否可以判断slot是否有内容？
xlsx-style 行高设置
 修改xlsx-style 源码解决报错

原文地址：https://www.cnblogs.com/Forever77/p/11434403.html

python beautifulsoup基本用法-文档搜索

一、find_all()

1.name指定要搜索的标签名

2.attrs指定要搜索标签的属性

3.recursive

4.text

5.limit

二、find()

三、find_parents()与find_parent()

四、find_next_siblings()与find_next_sibling()

五、find_previous_siblings()与find_previous_sibling()

六、find_all_next()与find_next()

七、find_all_previous()与find_previous()

八、CSS选择器

1.通过标签名、id、class查找标签

2.标签、id、class的组合查找标签

3.标签、id、class结合属性查找标签，属性在中括号内通过属性名称='属性值'表示

4.select()与select_one()获取文本值