zoukankan html css js c++ java

03 bs4的使用

bs4的使用

一、安装

pip3 install beautifulsoup4

二、使用方法

这是我们需要解析的内容

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my_p" class="title">hello<b id="bbb" class="boldest">The Dormouse's story</b>
</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

2.1 遍历文档树

即直接通过标签名字选择，特点是选择速度快，但如果存在多个相同的标签则只返回第一个。

2.1.1 用法

head=soup.head
print(head)

2.1.2 获取标签的名称

head=soup.head
print(head.name)

2.1.3 获取标签的属性（重点）

p=soup.body.p
# 类可能有多个，即便只有一个也放到列表中
print(p.attrs)
print(p.attrs.get('class'))
print(p['class'])
print(p.get('class'))

2.1.4 获取标签的内容

p=soup.body.p
# text会取该标签，子子孙孙的内容，拼到一起
print(p.text)
print(p.string) # p下的文本只有一个时，取到，否则为None
print(p.strings) # 生成器
print(list(p.strings))  #拿到一个生成器对象, 取到p下所有的文本内容,一个一个的在生成器中

2.1.5 嵌套选择

a=soup.body.a
print(a.get('id'))

ps：下面都是不怎么常用的

2.1.6 子节点、子孙节点

print(soup.p.contents) #p下所有子节点
print(soup.p.children) #得到一个迭代器,包含p下所有子节点
print(list(soup.p.children)) #得到一个迭代器,包含p下所有子节点

2.1.7 父节点、祖先节点

print(soup.a.parent) #获取a标签的父节点(只有一个)
print(soup.p.parent) #获取p标签的父节点
print(soup.a.parents) #找到a标签所有的祖先节点，父亲的父亲，父亲的父亲的父亲...
print(list(soup.a.parents))#找到a标签所有的祖先节点，父亲的父亲，父亲的父亲的父亲...
print(len(list(soup.a.parents)))#找到a标签所有的祖先节点，父亲的父亲，父亲的父亲的父亲...

2.1.8 兄弟节点

print(soup.a.next_sibling) #下一个兄弟
print(soup.a.previous_sibling) #上一个兄弟

print(list(soup.a.next_siblings)) #下面的兄弟们=>生成器对象
print(list(soup.a.previous_siblings)) #上面的兄弟们=>生成器对象

ps：遍历文档树重点：取标签名，取属性值，嵌套选择

2.2 搜索文档树

2.2.1 两种搜索方法

find()  # 只返回找到的第一个
find_all() # 找到的所有

2.2.2 五种过滤方法

字符串、正则表达式、列表、True、方法

ps：最常用的还是字符串过滤

2.2.2.1 字符串过滤

a=soup.find(name='a')
res=soup.find(id='my_p')
res=soup.find(class_='story')
res=soup.find(href='http://example.com/elsie')

res=soup.find(attrs={'id':'my_p'})
res=soup.find(attrs={'class':'story'})
print(res)

2.2.2.2 正则表达式过滤

import re
re_b=re.compile('^b')
res=soup.find(name=re_b)
res=soup.find_all(name=re_b)
res=soup.find_all(id=re.compile('^l'))
print(res)

2.2.2.3 列表

res=soup.find_all(name=['body','b'])
res=soup.find_all(class_=['sister','title'])
print(res)

2.2.2.4 True和False

res=soup.find_all(name=True)
res=soup.find_all(id=True)
res=soup.find_all(id=False)
res=soup.find_all(href=True)
print(res)

2.2.2.5 方法

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print(soup.find_all(has_class_but_no_id))

2.2.3 limit(限制查找的条数)

res=soup.find_all(name=True,limit=1)
print(res)
recursive（recursive递归查找，找子子孙孙）
res=soup.body.find_all(name='b ',recursive=False)
res=soup.body.find_all(name='p',recursive=False)
res=soup.body.find_all(name='b',recursive=True)
print(res)

2.2.4 css选择

ret=soup.select('#my_p')
https://www.w3school.com.cn/cssref/css_selectors.asp
ret=soup.select('body p')  # 子子孙孙 常用
ret=soup.select('body>p')  # 直接子节点（儿子）常用
ret=soup.select('body>p')[0].text  # 直接子节点（儿子）
ret=soup.select('body>p')[0].a.find()
print(ret)

查看全文

相关阅读:
[学习笔记]设计模式之Bridge
整数划分问题动态规划
 powershell 发邮件
 python 对象序列化并压缩
 python的序列化与反序列化(例子:dict保存成文件，文件读取成dict)
ACM-ICPC 2018 world final A题 Catch the Plane
AlphaPose论文笔记《RMPE: Regional Multi-person Pose Estimation》
《DensePose: Dense Human Pose Estimation In The Wild》阅读笔记
 [转]tensorflow 中的卷积conv2d的padding 到底要padding多少
 OpenPose论文笔记《Realtime Multi-Person 2D Human Pose Estimation using Part Affinity Fields》

原文地址：https://www.cnblogs.com/bailongcaptain/p/13440589.html