zoukankan html css js c++ java

12_Python_解析库_BeautifulSoup的使用

1、安装

pip3 install BeautifulSoup

Beautiful Soup支持的解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3及Python 3.2.2之前的版本文档容错能力差
lxml HTML解析器	BeautifulSoup(markup, "lxml")	速度快、文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup, "xml")	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

2、基本使用方法

2.1、查找元素

html = """<div><title>我是html文件</title><p>我是 p 标签</p></div>"""

#导入方法
from bs4 import BeautifulSoup
#初始化 html 文件
soup = BeautifulSoup(html, 'lxml')

# 1 获取 title 标签
title = soup.title
print(title)
#输出结果：<title>我是html文件</title>

# 2 获取 p 标签
p = soup.p
print(p)
#输出结果：<p>我是 p 标签</p>

2.2、获取属性

html = ''' <p name = "hello">我是 p 标签</p>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

# 1 直接传入中括号和属性名，获取 p 标签 name 的属性
name_1 = soup.p['name']
# 输出结果：hello

# 2 使用attrs属性获取 p 标签 name 的属性
name_2 = soup.p.attrs['name']
# 输出结果：hello

2.3、获取内容

html = ''' <p name = "hello">我是 p 标签</p>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

# 1 使用string属性获取 p 标签的文本信息
text_1 = soup.p.string
print(text_1)
# 输出结果：我是 p 标签

# 2 使用 get_text()方法获取 p 标签的文本信息
text_2 = soup.p.get_text()
print(text_2)
# 输出结果：我是 p 标签

2.4、嵌套选择

html = ''' <title><p name = "hello">我是 p 标签</p></title>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

# 1 使用string属性获取 title 标签里面 p 标签的文本内容
text_1 = soup.title.p.string
print(text_1)
# 输出结果：我是 p 标签

# 2 使用 get_text()方法获取 title 标签里面 p 标签的文本内容
text_2 = soup.title.p.get_text()
print(text_2)
# 输出结果：我是 p 标签

3、关联选择

3.1、子节点

contents：获取所有直接子节点，返回结果会是列表形式。
children：获取所有子节点，返回结果是生成器类型，可以用for循环输出相应的内容。

html = """
    <ul>
        <li>水果菜单
            <p class='banner'>香蕉
                <a>小香蕉</a></p></li></ul>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

'''contents：获取所有直接子节点，返回结果会是列表形式。'''
result_01 = soup.ul.contents
print(result_01)
# 输出结果：['
', <li>水果菜单<p class="banner">香蕉<a>小香蕉</a></p></li>]


'''children：获取所有子节点，返回结果是生成器类型，可以用for循环输出相应的内容。'''

result_02 = soup.ul.children
for i in result_02:
    print(i)
# 输出结果：<li>水果菜单  <p class="banner">香蕉  <a>小香蕉</a></p></li>

3.2、子孙节点

descendants属性获取所有子孙节点，返回结果也是生成器，descendants会递归查询所有子节点，得到所有的子孙节点。

html = """
    <ul>
        <li>水果菜单
            <p class='banner'>香蕉
                <a>小香蕉</a></p></li></ul>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

'''descendants属性获取所有子孙节点，返回结果也是生成器，descendants会递归查询所有子节点，得到所有的子孙节点。 '''

result = soup.ul.descendants
for i in result:
    print(i)

# 输出结果：
"""
<li>水果菜单
            <p class="banner">香蕉
                <a>小香蕉</a></p></li>
水果菜单
            
<p class="banner">香蕉
                <a>小香蕉</a></p>
香蕉
                
<a>小香蕉</a>
小香蕉
"""

3.3、父节点、祖先节点

parent属性获取父节点
parents属性可以递归得到元素的所有父辈节点

html = """
    <ul>
        <li>水果菜单
            <p class='banner'>香蕉
                <a>小香蕉</a></p></li></ul>"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

'''parent属性获取父节点'''
print(soup.a.parent)
# 输出结果：
'''<p class="banner">香蕉
                <a>小香蕉</a></p>'''

'''parents属性可以递归得到元素的所有父辈节点'''
result = soup.a.parents
for i in result:
    print(i)

3.4 兄弟节点

next_sibling :查找兄弟节点的下一个标签。
previous_sibling：查找兄弟节点的前一个标签。
next_siblings：对当前节点的兄弟节点迭代输出。
previous_siblings ：对当前节点的兄弟节点迭代输出。

html = """
    <ul>
        <li>水果菜单
            <p class='banner'>香蕉
            <p class='apple'>苹果
                <a>小香蕉</a></p></p></li></ul>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

'''next_sibling 属性获取节点下一个兄弟元素'''
result_1 = soup.p.next_sibling
print(result_1)

'''next_siblings 属性获取所有前面的节点元素'''
result_2 = soup.p.next_siblings
for i in result_2:
    print(i)

'''previous_sibling 属性获取上一个兄弟元素'''
result_3 = soup.p.previous_sibling
print(result_3)

'''previous_siblings 属性获取后面的所有节点元素'''
result_4 = soup.p.previous_siblings
for i in result_2:
    print(i)

4、find_all()方法选择器

4、find_all()：顾名思义，就是查询所有符合条件的元素。

find_all(name , attrs , recursive , text , **kwargs)

4.1、name

根据节点名来查询元素，示例如下：

html = """
<a  class="apple" id="link1">苹果</a>
<a  class="banana" id="link2">香蕉<span>皇帝蕉</span></a>
<p  class="cole" id="link3">可乐</p>

"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

'''根据name值查找元素'''
a = soup.find_all(name="a")
print(a)
# 输出结果：[<a class="apple" id="link1">苹果</a>, <a class="banana" id="link2">香蕉<span>皇帝蕉</span></a>]


'''根据name值进行嵌套查询'''
for a in soup.find_all(name="a"):
    span = a.find_all(name='span')
    print(span)
# 输出结果： [<span>皇帝蕉</span>]

4.2、attrs={'key':'value'}

根据标签的属性来进行查询，示例如下：

html = """
    <p class="apple" id="item_1">苹果</p>
    <p class="coffee" id="item_2">咖啡</p>
    <p class="cole" id="item_3">可乐</p>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

'''根据 id 的值来查找 '''

id_1 = soup.find_all(attrs={'id': 'item_1'})
id_2 = soup.find_all(id="item_1")           # 也可以直接传入id这个参数进行查找。

'''根据 class 的值来查找 '''

class_1 = soup.find_all(attrs={'class': 'coffee'})
class_2 = soup.find_all(class_="coffee")    # 由于class在Python里是一个关键字，可以直接在class后面加一个下划线（class_）进行查找。

【注意】

1 传入的attrs参数，参数的类型是字典类型。

2 查询后得到的结果是列表形式。

3 直接传入 class 这个参数来查找，后面需要加一个下划线。

4.3、text

text参数可用来匹配节点的文本，传入的形式可以是字符串，可以是正则表达式对象，示例如下：

html = """
<p class="apple" id="item_1">苹果apple</p>
<p class="coffee" id="item_2">咖啡coffee</p>
<p class="cole" id="item_3">可乐</p>"""

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

'''查找文本信息中含有apple的元素'''

apple = re.compile('apple')
result = soup.find_all(text=apple)
print(result)
# 输出结果：['苹果apple']

'''查找text="可乐"的元素'''

result = soup.find_all(text="可乐")
print(result)
# 输出结果：['可乐']

5、find()方法选择器

find()方法：find()和find_all()的使用是差不多的，唯一的区别是find()返回的是单个元素，也就是第一个匹配的元素。

html = """<p class="item">苹果apple</p><p class="item">可乐</p>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

'''1、find()方法：返回的是单个元素，也就是第一个匹配的元素。'''

item = soup.find(class_='item')
print(item)
# 输出结果：<p class="item">苹果apple</p>

'''2、find_all()方法：返回的是所有匹配的元素组成的列表。'''

items = soup.find_all(class_='item')
print(items)
# 输出结果：[<p class="item">苹果apple</p>, <p class="item">可乐</p>]

6、其余的方法选择器

- find_parents()和find_parent()：前者返回所有祖先节点，后者返回直接父节点。

- find_next_siblings()和find_next_sibling()：前者返回后面所有的兄弟节点，后者返回后面第一个兄弟节点。

- find_previous_siblings()和find_previous_sibling()：前者返回前面所有的兄弟节点，后者返回前面第一个兄弟节点。

- find_all_next()和find_next()：前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点。

- find_all_previous()和find_previous()：前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点。

7、 CSS选择器

使用CSS选择器查找元素时，只需要调用select()方法，传入相应的CSS选择器即可，示例如下：

7.1、select()方法基本使用

html = '''
<p class="drink">饮料套餐
    <a class="drink" id="coffee">咖啡</a>
     <a class="drink" id="milk">牛奶</a></p>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

'''1、查找 class=drink 元素里面 id=coffee 的元素'''
print(soup.select('.drink #coffee'))

'''2、查找 id =coffee 的元素'''
print(soup.select('#coffee'))

'''3、查找p 标签下面的 a 标签的所有元素'''
print(soup.select('p a'))

7.2、获取元素属性

html = '''
<p class="drink">饮料套餐
    <a class="drink" id="coffee">咖啡</a>
     <a class="drink" id="milk">牛奶</a></p>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

'''获取 p 标签元素中 a 标签的所有元素'''
a = soup.select('a')

'''获取a 标签元素的id属性'''
for i in a:
    result = i['id']
    result = i.attrs['id']  # 使用attrs属性获取属性名
    print(result)

# 输出结果：
'''
coffee
milk
'''

7.3、获取元素文本信息

获取文本信息的方法有：

string属性
get_text()的方法
text

html = '''
<p class="drink">饮料套餐
    <a class="drink" id="coffee">咖啡</a>
     <a class="drink" id="milk">牛奶</a></p>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

'''获取id=coffee的元素'''
coffee = soup.select('#coffee')

for i in coffee:
    result_1 = i.text
    result_2 = i.string
    result_3 = i.get_text()
    print(result_1,result_2,result_3)

# 输出结果：咖啡 咖啡 咖啡

参考资料：

静觅 » [Python3网络爬虫开发实战] 4.2-使用Beautiful Soup
CSS选择器使用参考：http://www.w3school.com.cn/cssref/css_selectors.asp

查看全文

相关阅读:
用css给控件加渐变色
 身乃自之才体乃人之本
 选择永恒的无悔改
 asp.net 异步加载
 没有做不到的只有想不到的
 VS2005中BackgroundWorker组件的使用经验(转)
c#调用Dos命令(超捷)
CMM/CMMI 与敏捷的比较(转)
软件项目的质量管理(转)
移动硬盘格式影响文件拷贝

原文地址：https://www.cnblogs.com/jasontang369/p/9616026.html