Beautiful Soup库 - 走看看

zoukankan html css js c++ java

Beautiful Soup库

Beautiful Soup：美味汤

非常优秀的python第三方库

能够对html、xml格式进行解析，并且提取其中的相关信息

Beautiful Soup可以对你提供给他的任何格式进行相关的爬取，并且可以进行树形解析

使用原理：把任何你给他的文档当成一锅汤，然后煲制这锅汤

一、安装：

pip3 install beautifulsoup4

HTML页面是以尖括号为主的一些标签封装的一些信息

>>> import requests
>>> r=requests.get("https://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head> <body> The demo python introduces several python courses. Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>. </body></html>'
>>> demo=r.text

>>> from bs4 import BeautifulSoup#bs4是beautifulsoup4库的简写，从bs4 库导入BeautifulSoup类

#soup 变量就代表了我们解析后的demo页面

>>> soup = BeautifulSoup(demo,"html.parser") #第一个参数是我们需要BeautifulSoup解析的一个html信息，可以用'data'来做个代替，也可以使用任何变量，第二个参数是解析这锅汤所用的解析器（html.parser解析demo的解析器，对demo进行html的解析）

>>> print(soup.prettify())

BeautifulSoup库成功的解析了我们给出的demo页面

二、Beautiful Soup库的基本元素

BeautifulSoup库的引用

BeautifulSoup库,也叫beautifulsoup4库或bs4库

from bs4 import BeautifulSoup （从bs4引用一个类型BeautifulSoup）

import bs4 （对BeautifulSoup库里的一些变量进行判断）

BeautifulSoup库本身解析html、xml文档，这个文档与标签树一一对应，经过了BeautifulSoup类的处理，可以把标签树（可以理解为字符串）转换成BeautifulSoup类，BeautifulSoup类就是一个能代表标签树的类型，实际上，可以认为HTML文档<---------->标签树<---------->BeautifulSoup类三者是等价的

通过BeautifulSoup类使得标签树形成了一个变量，而对这个变量的处理，就是对标签树的相关处理

简单讲，我们可以把BeautifulSoup类当做对应一个HTML/XML文档的全部内容

Beautiful Soup库的解析器

解析器　　　　　　　　　　　　使用方法　　　　　　　　　　　　　　条件

bs4的HTML解析器　　　　BeautifulSoup(mk,'html.parser')　　　　　　安装bs4库

lxml的HTML解析器　　　　BeautifulSoup(mk,'lxml')　　　　　　　　pip install lxml

lxml的xml解析器　　　　　BeautifulSoup(mk,'html.xml')　　　　　　pip install lxml

html5lib的解析器　　　　　BeautifulSoup(mk,'html5lib')　　　　　　pip install html5lib

Beautiful Soup类的基本元素

基本元素　　　　　　　　　　　　说明

Tag　　　　　　　　　　　　　　标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾

Name　　　　　　　　　　　　　标签的名字，...的名字是‘p’，格式：<tag>.name

Attributes　　　　　　　　　　　标签的属性，字典形式组织，格式：<tag>.attrs，无属性会返回空字典

NavigableString　　　　　　　　标签内非属性字符串，<>...</>中字符串，格式：<tag>.string

Comment 　　　　　　　　　　标签内字符串的注释部分，一种特殊的Comment类型

看页面title

>>> soup.title
<title>This is a python demo page</title>

>>> tag=soup.a #有多个，只能获得第一个a标签的信息
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

>>> soup.a.name　　#获得a标签的名字，字符串类型
'a'

>>> soup.a.parent.name　　#获得a标签的父标签的名字
'p'

>>> soup.a.parent.parent.name
'body'

>>> tag=soup.a
>>> tag.attrs #获得标签属性，字典类型
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>

>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
'Basic Python'

>>> soup.p
The demo python introduces several python courses.
>>> soup.p.string
'The demo python introduces several python courses.'　　#没有打印b标签，说明NavigableString是可以跨越多个标签层次的
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>#bs4库定义的类型

>>> newsoup = BeautifulSoup("This is not a comment","html.parser")　　#<--表示一个注释的开始
>>> newsoup.b.string　　#不需要提取注释信息，需要对相关类型进行判断
'This is a comment'
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>> newsoup.p.string
'This is not a comment'
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>

三、基于bs4库的HTML内容遍历方法

HTML基本格式

下行遍历：

　　属性　　　　　　说明

　　.contents　　　子节点的列表，将<tag>所有儿子节点存入列表

　　.children　　　　子节点的迭代类型，与.contents类似，用于循环遍历儿子节点

　　.descendants　　子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents　　#对于一个标签的儿子节点，不仅仅包括标签节点，也包括字符串节点，比如像' '的回车，他也是一个body标签的儿子节点类型
[' ', The demo python introduces several python courses., ' ', Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>., ' ']
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
The demo python introduces several python courses.

for child in soup.body.children:

　　print(child)

for child in soup.body.children:

　　print(child)

上行遍历：

　　属性　　　　　　说明

　　.parent　　　　　节点的父亲标签

　　.parents　　　　节点先辈标签的迭代类型，用于循环遍历先辈节点

>>> soup = BeautifulSoup(demo,'html.parser')
>>> for parent in soup.a.parents:
... if parent is None:
... print(parent)
... else:
... print(parent.name)
...
p
body
html
[document]

#在遍历一个标签的所有先辈标签时，会遍历到soup本身，而soup的先辈并不存在.name的信息，在这种情况下需要做一个区分，如果先辈是None就不能打印这部分信息了

平行遍历：

　　属性　　　　　　说明

.next_sibling　　　　返回按照HTML文本顺序的下一个平行节点标签

.previous_sibling　　返回按照HTML文本顺序的上一个平行节点标签

.next_siblings　　　迭代类型，返回按照HTML文本顺序的后续所有平行节点标签

.previous_siblings　　迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

>>> soup.a.next_sibling　　#a标签的下一个平行节点是一个字符串and，这里注意一下，在标签树中，尽管树形结构采用的是标签的形式来组织，但是标签之间的NavigableString 也构成了标签的节点，也就是说，任何一个节点，他的平行标签，他的儿子标签是可能存在NavigableString 类型的，所以并不能想当然的认为，平行遍历获得的节点一定是标签类型。
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: '
>>> soup.a.previous_sibling.previous_sibling　　#空信息
>>> soup.a.parent
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.

#遍历后续节点
>>> for sibling in soup.a.next_siblings:
... print(sibling)
...
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
.#遍历前续节点
>>> for sibling in soup.a.previous_siblings:
... print(sibling)
...
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

>>>

四、基于bs4库的HTML格式化和编码

>>> soup.prettify()　　#每一个标签后面加了一个换行符
'<html> <head> <title> This is a python demo page </title> </head> <body> The demo python introduces several python courses. Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> Basic Python </a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2"> Advanced Python </a> . </body> </html>'
>>> print(soup.prettify())　　#每一个标签以及相关内容都分行显示
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>


The demo python introduces several python courses.



Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.

</body>
</html>
>>>

prettify这个方法能够为html文本的标签和内容增加换行符，他也可以对每一个标签进行相关处理

>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>

bs4库将任何读入的html文件或字符串都转换成了utf8编码，utf8编码是国际通用的标准编码格式，他能够很好的支持中文等第三国的语言，由于py3.x默认支持编码是utf8，因此在做相关解析的时候，使用bs4库并没有相关障碍

>>> soup = BeautifulSoup("中文","html.parser")
>>> soup.p.string
'中文'
>>> print(soup.p.prettify())

中文

>>>

总结：BeautifulSoup是用来解析html、xml文档的功能库，可以使用from bs4 import BeautifulSoup引入BeautifulSoup类型，并用这个类型加载相关的解析器，来解析一个变量出来，这个变量就是用来提取信息和遍历信息的BeautifulSoup的类型

　

查看全文

相关阅读:
Elasticsearch之下载源码
 Elasticsearch之settings和mappings（图文详解）
Editplus下载、安装并最佳配色方案（强烈推荐）
在CentOS下安装tomcat并配置环境变量（改默认端口8080为8081）
Elasticsearch之中文分词器插件es-ik的自定义热更新词库
 Elasticsearch之中文分词器插件es-ik的自定义词库
 Elasticsearch之IKAnalyzer的过滤停止词
 md5增加指定的加密规则，进行加密
 unity中怎样获取全部子物体的组件
 Plus One

原文地址：https://www.cnblogs.com/suitcases/p/11200898.html