zoukankan html css js c++ java

Beautiful Soup的使用

使用Beautiful Soup

1.简介

　　简单来说Beautiful Soup是Python的一个HTML或XML解析库，可以用来方便的从网页中提取数据。Beautiful Soup提供了一些简单的Python式的函数来打处理导航，搜索，修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据。

　　Beautiful Soup自动将文本文档转换为Unicode编码，输出文档转换为UTF-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时你仅仅需要说明一下原始编码方式就可以了。

2.准备工作

安装Beautiful Soup

a.相关链接

　　官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

　　中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh

　　PyPi　　: https://pypi.python.org/pypi/beautifulsoup4

b.pip3安装

　　pip3 install beautifulsoup4

c.whell安装

　　从PiPy下载whell文件

　　然后使用pip安装whell文件

3.使用Beautiful Soup

1.基本用法

from bs4 import BeautifulSoup

html = """
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>The Beautiful Suop</title>
</head>
<body>
<p class="title" name="dromouse"><b>The story</b></p>
<p class="story" >once upon a time there were three title sisters;and their name were
<a href="http://example.com/elsie" class="sister" id="link1">Elise</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

运行结果如下：

<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   The Beautiful Suop
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The story
   </b>
  </p>
  <p class="story">
   once upon a time there were three title sisters;and their name were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elise
   </a>
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
    and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Beautiful Suop

　　这里首先声明一个变量html，它是一个HTML字符串。但是需要注意，它并不是一个完成的HTML字符串，body和html节点没有闭合。接着我们将它作为第一个参数传递给Beautiful Soup对象，第二个参数为解析器的类型（这里使用的是lxml），此时就完成了Beautiful Soup对象的初始化。然后将这个对象复制给soup变量。接下来就可以调用soup的各个方法和属性来解析这串HTML代码了。

　　首先，调用prettify()方法。这个方法可以把要解析的字符串以标准的缩进格式输出。这里需要注意的是，输出结果包含了body和html节点，也就是说对于不标准的HTML代码Beautiful Soup可以自动更正格式。这一步并不是prettify()做的，而是在初始化时就已经完成了。

　　然后调用soup.title.string。这实际上是输出HTML中title节点的文本内容。So，soup.title可以选出HTML中的节点，再调用string属性就可以得到里面的文本了。

2.节点选择器

直接调用节点的名称就可以选择节点元素，在调用string就可以得到节点的文本了。选择方式非常快速，如果单个节点层次非常清晰，可以选用这种方法。

　　♦选择元素　　

from bs4 import BeautifulSoup

html = """
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>The Beautiful Suop</title>
</head>
<body>
<p class="title" name="dromouse"><b>The story</b></p>
<p class="story" >once upon a time there were three title sisters;and their name were
<a href="http://example.com/elsie" class="sister" id="link1">Elise</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

运行结果如下：

<title>The Beautiful Suop</title>
<class 'bs4.element.Tag'>
The Beautiful Suop
<head>
<meta charset="utf-8"/>
<title>The Beautiful Suop</title>
</head>
<p class="title" name="dromouse"><b>The story</b></p>

　　这里依旧选用刚才的示例代码，首先打印title节点的选择结果，输出title节点的文本内容。接下来是它的类型，<class 'bs4.element.Tag'>这是Beautiful Soup中一个重要的数据结构。

　　接下来，我们又尝试了head节点，p节点，选择p节点时只是输出了第一个p节点的内容。当有多个节点时，这种方式只会匹配到第一个节点，后面的节点都会忽略。

　　♦提取信息

　　　　如何获取节点的属性值？获取节点的名称?

　　(1)名称获取

　　利用name属性获取节点的名称　　

print(soup.title.name)

输出结果：

title

　　(2)获取属性

　　每个节点可以有多个属性，例如id和class等，选择这个节点后可以调用attrs获取所有属性：

print(soup.p.attrs)
运行结果：
{'class': ['title'], 'name': 'dromouse'}

　　可以看到，attrs返回的结果是字典型式，把所有属性的和属性值组成了一个字典。如果想获取name属性，只需要加上键值，可以使用attrs['name']来获取。有一种更简便的写法，直接在节点元素后面加上属性名称：

print(soup.p['name'])
print(soup.p['class'])

输出结果：
dromouse
['title']

　　这里需要注意的是，有的结果返回的是字符串，有的结果返回的是列表。比如name属性的值是唯一的，返回的结果就是单个字符串，class的属性可以有多个，所有返回的是一个列表。需要在实际使用中判断。

(3)获取内容

　　可以使用string获取内容

print(soup.p.string)

输出结果：
The story

这里的p节点是第一个p节点

　　♦嵌套选择

　　在上面的例子中，每一步的返回结果都是bs4.element.Tag,我们可以继续调用节点进行下一步：

print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

　　输出结果：

<title>The Beautiful Suop</title>
<class 'bs4.element.Tag'>
The Beautiful Suop

　　♦关联选择

　　先选取某一个节点元素，在以它为基准去选择其父节点，子节点，兄弟节点等。

（1）子节点及子孙节点

　　使用contents属性获取子节点

from bs4 import BeautifulSoup

html = """
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>The Beautiful Suop</title>
</head>
<body>
<p class="story" >once upon a time there were three title sisters;and their name were
<a href="http://example.com/elsie" class="sister" id="link1">
    <span>Elise</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

输出结果：

['once upon a time there were three title sisters;and their name were
', 
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>, '
', 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '
', 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';
    
and they lived at the bottom of a well.
']

　　p节点里包含文本，节点，所以返回一个列表形式。

　　使用children可以得到相同的结果，此时返回的是一个生成器类型。

print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

输出结果：

<list_iterator object at 0x0000016B477884A8>
0 once upon a time there were three title sisters;and their name were

1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 

5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 ;
    and they lived at the bottom of a well.

　　使用descendants属性获取子孙节点，返回一个生成器，输出的结果包含了span节点。descendants会查询所有子节点，得到所有的子孙节点

<generator object descendants at 0x0000029DA472D9E8>
0 once upon a time there were three title sisters;and their name were

1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>
2 

3 <span>Elise</span>
4 Elise
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9 

10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 ;
    and they lived at the bottom of a well.

（2）父节点和爷爷节点

　　调用parent获取某个节点的父节点；

print(soup.a.parent)

　　输出结果：

<p class="story">once upon a time there were three title sisters;and their name were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>

　　很明显，a的直接父节点是p节点，这里直接输出p节点的内容。

　　调用parents选取到爷爷节点，返回的结果是生成器类型，用列表输出了它的索引和内容，列表中的元素就是a节点的祖先节点。

print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))

　　输出结果：

<class 'generator'>
[(0, <p class="story">once upon a time there were three title sisters;and their name were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>), 

(1, <body>
<p class="story">once upon a time there were three title sisters;and their name were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>), 

(2, <html lang="en">
<head>
<meta charset="utf-8"/>
<title>The Beautiful Suop</title>
</head>
<body>
<p class="story">once upon a time there were three title sisters;and their name were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body></html>), 

(3, <html lang="en">
<head>
<meta charset="utf-8"/>
<title>The Beautiful Suop</title>
</head>
<body>
<p class="story">once upon a time there were three title sisters;and their name were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body></html>)]

（3）兄弟节点

　　同级节点获取，next_sibling和previous_sibling分别获取的是节点的下一个兄弟元素和节点的上一个兄弟元素。next_siblings和previous_siblings分别返回后面和前面的所有兄弟元素。

from bs4 import BeautifulSoup

html = """
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>The Beautiful Suop</title>
</head>
<body>
<p class="story" >once upon a time there were three title sisters;and their name were

<a href="http://example.com/elsie" class="sister" id="link1">
    <span>Elise</span>
</a>
hello
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
"""
soup = BeautifulSoup(html, 'lxml')

print("Next Sibling:", soup.a.next_sibling)
print("Prev Sibling:", soup.a.previous_sibling)
print("Next Siblings:", list(soup.a.next_siblings))
print("Prev Siblings:", list(soup.a.previous_siblings))

　　输出结果；

Next Sibling: 
hello

Prev Sibling: once upon a time there were three title sisters;and their name were


Next Siblings: ['
hello
', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '
', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';
    and they lived at the bottom of a well.
']
Prev Siblings: ['once upon a time there were three title sisters;and their name were

']

（4）信息提取

　　单个节点可以直接调用string，attrs等属性获取其文本内容和属性，多个节点的生成器转化为列表后，取到某个节点后再调用string，attrs等属性获取相对应的节点的文本和属性。

from bs4 import BeautifulSoup

html = """
<html lang="en">
<body>
<p class="story" >once upon a time there were three title sisters;and their name were
    <a href="http://example.com/elsie" class="sister" id="link1">
        <span>Elise</span>
    </a>
</p>
"""
soup = BeautifulSoup(html, 'lxml')

print(soup.a.next_sibling.string)
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

　　输出结果：

<p class="story">once upon a time there were three title sisters;and their name were
    <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elise</span>
</a>
</p>
['story']

3.方法选择器

　　♦find_all()

　　查询所有符合条件的元素，给它传入一些属性和文本就可以得到符合条件的元素，功能十分强大

　　find_all(name,attrs,recursive,text,**kwargs)

　　(1)name

　　　根据节点名称查询元素：

from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
        
    </div>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))

输出结果：

[
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2"> 
<li class="element">Foo</li> <li class="element">Bar</li> </ul>
]

<class 'bs4.element.Tag'>

　　调用find_all()方法，name参数的值为ul,查询到所有ul节点，返回列表类型，每个元素都是bs4.element.Tag类型。key继续进行嵌套查询，查询其内部的li节点：

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))

　　输出结果：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

　　遍历每个li，获取其文本内容：

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)

　　输出结果：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar

　（2）attrs

　　根据传入的属性查询：

print(soup.find_all(attrs={'id': 'list-1'}))

　　输出结果:

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

　　对于一些常见的属性id和class，可以直接使用，不需要attrs。其中class为Python关键字，需要加上下划线：class_='element'

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

　　输出结果：

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, 
<li class="element">Bar</li>, 
<li class="element">Jay</li>, 
<li class="element">Foo</li>, 
<li class="element">Bar</li>]

　　（3）text

　　text参数可以匹配节点的文本，传入的形式可以是字符串，可以是正则表达式对象，：

 import re

 print(soup.find_all(text=re.compile('F')))

　　输出结果：

['Foo', 'Foo']

　　♦find()方法

　　find()方法返回的是单个元素，也就是第一个匹配的元素。　　

print(soup.find(name='ul'))
print(soup.find(class_='list'))

　　输出结果：

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

　　这里还有很多类似的方法：

　　find_parent():返回父节点

　　find_parents():返回祖先节点

　　find_next_sibling():返回后面的第一个兄弟节点

　　find_next_siblings():返回后面所有的兄弟节点

　　find_previous_sibling():返回前面的第一个兄弟节点

　　find_previous_siblings():返回前面所有的兄弟节点

　　find_next():返回节点后面第一个符合条件的节点

　　find_all_next():返回节点后面所有符合条件的节点

　　find_previous():返回节点前面第一个符合条件的节点

　　find_all_previous():返回节点前面所有符合条件的节点

　4.CSS选择器

　　使用CSS选择器只需要调用select()方法，传入响应的CSS选择器：　　

print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))

　　输出结果：

[<div class="panel-heading">
<h4>hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

　　♦嵌套选择

　　遍历每个ul节点，选择其中的li节点：　

for ul in soup.select('ul'):
    print(ul.select('li'))

　　输出结果：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

　　♦获取属性

for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

　　输出结果：

list-1
list-1
list-2
list-2

　　♦获取文本

　　想要获取文本，除了string以后还可以使用get_text():

# 获取文本
for li in soup.select('li'):
    print(li.get_text())
    print(li.string)

　　输出结果：

Foo
Foo
Bar
Bar
Jay
Jay
Foo
Foo

　　推荐使用lxml解析库

　　节点筛选虽然功能弱但是快

　　建议使用find() 和find_all()匹配单个或多个

　　熟悉CSS的可以使用select()进行匹配

查看全文

相关阅读:
JobHistory搜索智能化
 JobHistory搜索智能化
 JobHistory搜索智能化
 Hadoop Ls命令增加显示条数限制参数
 Hadoop Ls命令增加显示条数限制参数
 Markdown的简单用法
 Markdown常用编辑器
 搜索引擎的使用
 avalon.js 文字显示更多与收起
 浏览器访问网页的详细内部过程

原文地址：https://www.cnblogs.com/zivli/p/10845856.html