python爬虫：BeautifulSoup 库的基本函数用法及框架

zoukankan html css js c++ java

python爬虫：BeautifulSoup 库的基本函数用法及框架
安装：

Win平台: “以管理员身份运行”cmd 执行
```
pip install beautifulsoup4
```
Beautiful Soup 库的理解：

Beautiful Soup 库解析器：

Beautiful Soup 库的基本元素：

基于bs4库的HTML内容遍历方法：

下行遍历：
```
soup = BeautifulSoup(url,"html.parser")

#遍历儿子节点
for child in soup.body.children: 
 print(child)


#遍历子孙节点
for child in soup.body.descendants: 
 print(child)
```
标签树的上行遍历:

标签树的平行遍历:
```
#遍历后续节点
for sibling in soup.a.next_sibling: 
 print(sibling)


#遍历前续节点
for sibling in soup.a.previous_sibling: 
 print(sibling)
```
小结：

函数调用：
```
soup = BeautifulSoup(open("index.html"))
# 打开当前目录下 index.html 文件
```
soup.prettify()函数的作用是打印整个 html 文件的 dom 树

解析 BeautifulSoup 对象

想从 html 中获取到自己所想要的内容，我归纳出三种办法：

1）利用 Tag 对象

从上文得知，BeautifulSoup 将复杂 HTML 文档转换成一个复杂的树形结构,每个节点都是Python对象。跟安卓中的Gson库有异曲同工之妙。节点对象可以分为 4 种：Tag, NavigableString, BeautifulSoup, Comment。

Tag 对象可以看成 HTML 中的标签。这样说，你大概明白具体是怎么回事。我们再通过例子来更加深入了解 Tag 对象。以下代码是以 prettify() 打印的结果为前提。
- 例子1
获取head标签内容
```
print(soup.head)
# 输出结果如下：
<head><title>The Dormouse's story</title></head>
```
- 例子2
获取title标签内容
```
print(soup.title)
# 输出结果如下：
<title>The Dormouse's story</title>
```
- 例子3
获取p标签内容
```
print(soup.p)
# 输出结果如下：
The Dormouse's story
```
如果 Tag 对象要获取的标签有多个的话，它只会返回所以内容中第一个符合要求的标签。

对象一般含有属性，Tag 对象也不例外。它具有两个非常重要的属性， name 和 attrs。

name
name 属性是 Tag 对象的标签名。不过也有特殊的，soup 对象的 name 是 [document]
```
print(soup.name)
print(soup.head.name)
# 输出结果如下：
[document]
head
```
attrs
attrs 属性是 Tag 对象所包含的属性值，它是一个字典类型。
```
print(soup.p.attrs）
# 输出结果如下：
{'class': ['title'], 'name': 'dromouse'}
```
其他三个属性也顺带介绍下:
- NavigableString
说白了就是：Tag 对象里面的内容
```
print(soup.title.string)
 # 输出结果如下：
The Dormouse's story
```
- BeautifulSoup
BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象。它是一个特殊的 Tag。
```
print(type(soup.name))
print(soup.name)
print(soup.attrs)
# 输出结果如下：
<type 'unicode'>
[document]
{} 空字典
```
- Comment
Comment 对象是一个特殊类型的 NavigableString 对象。如果 HTML 页面中含有注释及特殊字符串的内容。而那些内容不是我们想要的，所以我们在使用前最好做下类型判断。例如：
```
if type(soup.a.string) == bs4.element.Comment:
 ... # 执行其他操作，例如打印内容
```
2）利用过滤器

过滤器其实是一个find_all()函数，它会将所有符合条件的内容以列表形式返回。它的构造方法如下：
```
find_all(name, attrs, recursive, text, **kwargs )
```
name 参数可以有多种写法：
- （1）节点名
```
print(soup.find_all('p'))
# 输出结果如下：
[The Dormouse's story, Once upon a time there were three little sisters; and their names were]
```
- （2）正则表达式
```
print(soup.find_all(re.compile('^p')))
# 输出结果如下：
[The Dormouse's story, Once upon a time there were three little sisters; and their names were]
```
- （3）列表
 如果参数为列表，过滤标准为列表中的所有元素。看下具体代码，你就会一目了然了。
```
print(soup.find_all(['p', 'a']))
# 输出结果如下：
[The Dormouse's story, Once upon a time there were three little sisters; and their names were, <a href="http://example.com/elsie" class="sister" id="link1"></a>]
```
另外 attrs 参数可以也作为过滤条件来获取内容，而 limit 参数是限制返回的条数。

3）利用 CSS 选择器

以 CSS 语法为匹配标准找到 Tag。同样也是使用到一个函数，该函数为select()，返回类型也是 list。它的具体用法如下, 同样以 prettify() 打印的结果为前提：
- （1）通过 tag 标签查找
```
print(soup.select(head))
# 输出结果如下：
[<head><title>The Dormouse's story</title></head>]
```
- （2）通过 id 查找
```
print(soup.select('#link1'))
# 输出结果如下：
[<a href="http://example.com/elsie" class="sister" id="link1"></a>]
```
- （3）通过 class 查找
```
print(soup.select('.sister'))
# 输出结果如下：
[<a href="http://example.com/elsie" class="sister" id="link1"></a>]
```
- （4）通过属性查找
```
print(soup.select('p[name=dromouse]'))
# 输出结果如下：
[The Dormouse's story]
```
```
print(soup.select('p[class=title]'))
# 输出结果如下：
[The Dormouse's story]
```
- （5）组合查找
```
print(soup.select("body p"))
# 输出结果如下：
[The Dormouse's story,
Once upon a time there were three little sisters; and their names were]
```
```
print(soup.select("p > a"))
# 输出结果如下：
[<a href="http://example.com/elsie" class="sister" id="link1"></a>]
```
```
print(soup.select("p > .sister"))
# 输出结果如下：
[<a href="http://example.com/elsie" class="sister" id="link1"></a>]
```
5 处理上下关系

从上文可知，我们已经能获取到节点对象，但有时候需要获取其父节点或者子节点的内容，我们要怎么做了？这就需要对parse tree进行遍历

（1）获取子节点
利用.children属性，该属性会返回当前节点所以的子节点。但是它返回的类型不是列表，而是迭代器

（2）获取所有子孙节点
使用.descendants属性，它会返回所有子孙节点的迭代器

（3）获取父节点
通过.parent属性可以获得所有子孙节点的迭代器

（4）获取所有父节点
.parents属性，也是返回所有子孙节点的迭代器

（5）获取兄弟节点
兄弟节点可以理解为和本节点处在统一级的节点，.next_sibling属性获取了该节点的下一个兄弟节点，.previous_sibling则与之相反，如果节点不存在，则返回 None

注意：实际 HTML 中的 tag 的.next_sibling和 .previous_sibling属性通常是字符串或空白，因为空白或者换行也可以被视作一个节点，所以得到的结果可能是空白或者换行

（5）获取所有兄弟节点
通过.next_siblings和.previous_siblings属性可以对当前节点的兄弟节点迭代输出
查看全文

相关阅读:
目录
 DRF的分页
 Django Rest Framework 视图和路由
 爬虫基本原理
 C# System.Threading.Timer的使用
 C# Task的使用
 C# 线程池的使用
 C# 异步委托回调函数使用
 C#异步委托等待句柄的使用
 C# 异步委托的使用

原文地址：https://www.cnblogs.com/Romantic-Chopin/p/12451039.html

python爬虫：BeautifulSoup 库 的基本函数用法及框架

安装：

Beautiful Soup 库的理解：

Beautiful Soup 库解析器：

Beautiful Soup 库的基本元素：

基于bs4库的HTML内容遍历方法：

下行遍历：

标签树的上行遍历:

标签树的平行遍历:

小结：

函数调用：

解析 BeautifulSoup 对象

5 处理上下关系

python爬虫：BeautifulSoup 库的基本函数用法及框架