zoukankan html css js c++ java

十三、CSS选择器：BeautifulSoup4

（1）和lxml一样,Beautifu Soup也是一个HTML/XML的解析器,主要的功能也是如何解析和提取HTML/XML数据。

（2）lxml只会局部遍历，而Beautiful Soup是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。

（3）BeautifulSoup用来解析HTML比较简单，API非常人性化，支持CSS选择器、python标准库中的HTML解析器，也支持lxml的XML解析器。

安装：`pip install beautifusoup4`

官方文档：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

抓取工具	速度	使用难度	安装难度
正则	最快	困难	无（内置）
BeautifulSoup	慢	最简单	简单
lxml	快	简单	一般

1、示例

from bs4 import BeautifulSoup

html = """
    <html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# 创建Beautiful Soup对象
soup = BeautifulSoup(html,'lxml')

# 打开本地HTML文件的方式来创建对象
# soup = BeautifulSoup(open('index.html'))

# 格式化输出soup对象的内容
print(soup.prettify())

　　运行结果：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

　　如果没有显示地指定解析器，会默认使用这个系统的最佳可用HTML解析器（'lxml'）。当在另一个系统中运行这段代码，或者在不同的虚拟环境中，使用不同的解析器会造成不同行为。

　　可以通过`soup=BeautifuSoup(html,'lxml')`方式指定lxml解析器。

2、四大对象种类

　　Beautifu Soup将复杂HTML文档转换成一个复杂的树形结构，每个节点都是python对象，所有对象可以归纳为4种：（1）Tag（2）NavigableString（3）BeautifulSoup（4）Comment

　　2.1 Tag　　　

<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

　　　　Tag是，HTML中的一个个标签（即上面代码中的`title`、`head`、`a`、`p`等等HTML标签）加上里面包括的内容。

from bs4 import BeautifulSoup

html = """
    <html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# 创建Beautiful Soup对象
soup = BeautifulSoup(html,'lxml')

# 打开本地HTML文件的方式来创建对象
# soup = BeautifulSoup(open('index.html'))

# 格式化输出soup对象的内容
# print(soup.prettify())

print(soup.title)
# <title>The Dormouse's story</title>

print(soup.head)
# <head><title>The Dormouse's story</title></head>

print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

print(type(soup.a))
# <class 'bs4.element.Tag'>


# 通过soup加标签名获取这些标签的内容，这些对象的类型是bs4.element.Tag
# 通过这种方法查找的是在所有内容中的第一个符合要求的标签。



# 对于Tag，它本身有两个重要的属性，即name和attrs

print(soup.name)
# [document] 
# soup对象本身比较特殊，它的name即为[document]

print(soup.head.name)
# head
# 对于其他内部标签，输出的值便为标签本身的名称

print(soup.p.attrs)
# {'class': ['title'], 'name': 'dromouse'}
# 在这里，我们把p标签的所有属性打印了出来，得到的类型是一个字典

print(soup.p['class'])
# [‘title’] 获取属性的值
# 等同下列get方法
print(soup.p.get('class'))
# ['title'] 获取属性的值

soup.p['class'] = 'newClass'
# 对p标签下的class属性的内容进行修改
print(soup.p)
# <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

del soup.p['class'] # 还可以对这个属性进行删除
print soup.p
# <p name="dromouse"><b>The Dormouse's story</b></p>

　　2.2 NavigableString

　　　　通过.string的方式即可获取标签内部的文字

print soup.p.string
# The Dormouse's story

print type(soup.p.string)
# In [13]: <class 'bs4.element.NavigableString'>

　　2.3 BeautifulSoup

　　　　BeautifulSoup对象表示的是一个文档的内容，可以把它当做是一个特殊的Tag对象，可以分别获取它的类型，名称以及属性。

print(type(soup.name))
# <class 'str'>

print(soup.name)
# [document]

print(soup.attrs)
# {}    文档本身的属性为空

　　2.4 Comment

　　　　Comment对象是一个特殊类型的NavigableString对象，其输出的内容不包括注释符号。

print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

print(soup.a.string)
#  Elsie

print(type(soup.a.string))
# <class 'bs4.element.Comment'>

　　　　注意Comment和NavigableString对象的区别，当HTML标签的.string中有注释时，忽视注释符号，返回其中的内容，这时它是一个Comment对象；当没有注释时，返回其中的内容，这时它是一个NavigableString对象。

3、遍历文档树

　　3.1 直接子节点：`.contents`，`.children`属性

　　　　（1）`.content`属性

　　　　　　Tag的`.contents`属性可以将Tag的子节点以列表的方式输出

print(soup.body.contents)
# tag的.contents属性可以将tag的子节点以列表的方式输出
"""
['
', <p class="title" name="dromouse"><b>The Dormouse's story</b></p>, '
', <p class="story">Once upon a time there were three little sisters; and their names were
<a class = "sister" href = "http://example.com/elsie" id = "link1" > <!-- Elsie - -> < /a > ,
<a class = "sister" href = "http://example.com/lacie" id = "link2" > Lacie < /a > and
<a class = "sister" href = "http://example.com/tillie" id = "link3" > Tillie < /a >;
and they lived at the bottom of a well. < /p > , '
', < p class = "story" > ... < /p > , '
']
"""
# 输出方式为列表，可以用列表索引来获取它的某一个元素
print(soup.body.contents[1])
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

　　　　（2）`.children`属性

　　　　　　Tag的`.children`属性，返回一个list生成器对象。

print(soup.body.children)
# <list_iterator object at 0x7f55adea9d68>

for  child in soup.body.children:
    print(child)
# 输出结果
"""


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>


"""

　　3.2 所有子孙节点：`.descendants`属性

　　　　`.contents`和`.children`属性仅包含Tag的直接子节点，`.descendants`属性可以对所有Tag的子孙节点进行递归循环，和`.children`类似，返回一个生成器对象。

print(soup.descendants)
# <generator object descendants at 0x7f98e70050f8>

for child in soup.descendants:
    print(child)
"""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...


"""

　　3.3 节点内容：`.string`属性

　　　　如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点。如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同。　　　

print soup.head.string
#The Dormouse's story
print soup.title.string
#The Dormouse's story

4、搜索文档树

　　4.1 find_all(name,attrs,recursive,text,**kwargs)

　　　　4.1.1 name参数

　　　　　　name参数可以查找所有名字为name的tag，字符串对象会被自动忽略掉。

　　　　　　（1）传字符串

　　　　　　　　在搜索方法中传入一个字符串参数，Beautiful Soup会查找与字符串完整匹配的内容

print(type(soup.find_all('p')))
# <class 'bs4.element.ResultSet' >

print(soup.find_all('p'))
"""
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
"""

　　　　　　（2）传入正则表达式

　　　　　　　　如果传入正则表达式作为参数，Beautiful Soup会通过正则表达式的macth()来匹配内容。

# 找出所有以b开头的标签
import re
for tag in soup.find_all(re.compile('^b')):
    print(tag.name)
# body
# b

　　　　　　（3）传列表

　　　　　　　　如果传入列表参数，Beautiful Soup会将把与列表中任一元素匹配的内容返回。

print(soup.find_all(["a",'p']))
"""
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]
"""

　　　　4.1.2 keyword参数

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

　　　　4.1.3 text参数

　　　　　　通过text参数可以搜索文档中的字符串内容，与name参数的可选值一样，text参数接受字符串、正则表达式及列表　　

import re
print(soup.find_all(text='Tillie'))

print(soup.find_all(text=["Tillie","Elsie","Lacie"]))

print(soup.find_all(text=re.compile("Dormouse")))
"""
['Tillie']
['Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]
"""

　　4.2 CSS选择器

　　　　写CSS时，标签名不加任何修饰，类名前加`.`，id名前加`#`

　　　　用soup.select()，返回类型是list

　　　　4.2.1 通过标签名查找

print(soup.select("title"))
# [<title>The Dormouse's story</title>]
print(soup.select("a"))
"""
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""
print(soup.select('b'))
# [<b>The Dormouse's story</b>]

　　　　4.2.2 通过类名查找

print(soup.select(".sister"))
"""
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""

　　　　4.2.3 通过id名查找

print(soup.select("#link1"))
# [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

　　　　4.2.4 组合查找

　　　　　　组合查找即和写css文件时，标签名与类名、id名进行组合的原理是一样的，其各之间需要用空格分开。

print(soup.select("p #link1"))
# [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

　　　　　　直接子标签查找，则使用`>`分隔　　　　

print(soup.select("head > title"))
#[<title>The Dormouse's story</title>]

　　　　4.2.5 属性查找

　　　　　　查找时还可以加入属性元素，属性需要用中括号括起来，注意属性与标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print(soup.select('a[class="sister"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('a[href="http://example.com/elsie"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

　　　　　　同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格　　　

print(soup.select('p a[href="http://example.com/elsie"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

　　　　4.2.6 获取内容

　　　　　　 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。

soup = BeautifulSoup(html,'lxml')
print(type(soup.select("title")))
# <class 'bs4.element.ResultSet'>
print(soup.select('title')[0])
# <title>The Dormouse's story</title>
print(soup.select("title")[0].get_text())
# The Dormouse's story

for title in soup.select("title"):
    print(title.get_text())
# The Dormouse's story

查看全文

相关阅读:
孙鑫vc++学习（vs2008）笔记之第五课文字处理程序
 lesson2 流水灯
 lesson1 预备知识
 第二章寄存器（CPU工作原理）
孙鑫vc++学习（vs2008）笔记之第一课Windows程序运行原理
 孙鑫vc++学习（vs2008）笔记之第二课掌握C++
孙鑫vc++学习（vs2008）笔记之第三课MFC内部运行原理
 第一章基础知识
 小小说（文摘）
孙鑫vc++学习（vs2008）笔记之第四课MFC消息映射、画图

原文地址：https://www.cnblogs.com/nuochengze/p/12863045.html