zoukankan html css js c++ java

python 第三方库BeautifulSoup4文档学习（6）

输出

格式化输出，使用prettify()方法将BeautifulSoup文档树格式化以后以Unicode编码输出，每个XML/HTML标签单独占一行

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'

soup = BeautifulSoup(markup,"html.parser")

print(soup.prettify())

"""
<a href="http://example.com/">
 I linked to
 <i>
  example.com
 </i>
</a>
"""

上面是使用soup对象调用prettify，还可以使用tag节点，例如：

a_tag = soup.a

print(a_tag.prettify())

"""
<a href="http://example.com/">
 I linked to
 <i>
  example.com
 </i>
</a>
"""

压缩输出

对于只想输出结果为字符串，不在乎格式的可以使用BeautifulSoup对象或者tag对象的unicode()或str()方法，注意unicode是python2.x中的方法,在python3中已取消str()与unicode方法的区别

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'

soup = BeautifulSoup(markup,"html.parser")

print(str(soup))

# <a href="http://example.com/">I linked to <i>example.com</i></a>

输出格式

Beautiful Soup输出是会将HTML中的特殊字符转换成Unicode,比如“&lquot;”:

soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")

print(unicode(soup))
# <html><head></head><body>\u201cDammit!\u201d he said.</body></html>

get_text()方法

获取tag包含的文本内容，可以调用get_text()方法，这个方法获取到tag中包含的所有内容包括子tag节点

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'

soup = BeautifulSoup(markup,"html.parser")

print(soup.get_text())

# I linked to example.com

指定文档解析器

如果不指定文档解析器，BeautifulSoup会自动找到一个最合适的解析器，但是我们没法保证相同的代码在不同系统下运行结果，所以最好还是指定好解析器；BeautifulSoup()的第一个参数就是要解析的文档字符串或是文档句柄，第二个参数用来标识怎样来解析文档

支持解析的类型：html、xml、html5
指定哪种解析器：lxml、html5lib、html.parser

解析器之间的区别,这里需要先安装lxml库

解析成html结构

soup = BeautifulSoup('<a><b /></a>')

print(soup)

# <html><body><a><b></b></a></body></html>

同样的文档解析成xml结构

soup = BeautifulSoup('<a><b /></a>',"xml")

print(soup)

# <?xml version="1.0" encoding="utf-8"?>

# <a><b/></a>

HTML解析器之间也有区别,如果被解析的HTML文档是标准格式,那么解析器之间没有任何差别,只是解析速度不同,结果都会返回正确的文档树；但是如果被解析文档不是标准格式,那么不同的解析器返回结果可能不同.下面例子中,使用lxml解析错误格式的文档,结果
标签被直接忽略掉了

soup = BeautifulSoup("<a></p>", "lxml")

print(soup)

# <html><body><a></a></body></html>

使用html5lib库解析相同文档会得到不同的结果

soup = BeautifulSoup("<a></p>","html5lib")

print(soup)

# <html><head></head><body><a><p></p></a></body></html>)

使用python内置库解析相同文档结果

soup = BeautifulSoup("<a></p>","html.parser")

print(soup)

# <a></a>

编码

任何html或xml文档被BeautifulSoup解析后都会变成Unicode编码格式，通过BeautifulSoup对象的.original_encoding属性记录了自动识别编码的结果，我们可以在创建BeautifulSoup 对象的时候设置 from_encoding 参数来指定编码格式

输出编码

通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码

markup = b'''

<html>

 <head>

    <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />

 </head>

 <body>

    <p>Sacr\xe9 bleu!</p>

 </body>

</html>

'''
soup = BeautifulSoup(markup,"html5lib")

print(soup.prettify())

"""
<html>

 <head>

  <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>

 </head>

 <body>

  <p>

  Sacré bleu!

 </p>

 </body>

</html>
"""

如果不想用UTF-8编码输出,可以将编码方式传入 prettify() 方法

比较对象是否相同

两个 NavigableString 或 Tag 对象具有相同的HTML或XML结构时, Beautiful Soup就判断这两个对象相同

markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"

soup = BeautifulSoup(markup, 'html.parser')

first_b, second_b = soup.find_all('b')

print(first_b == second_b)

# True

如果要严格判断两个对象是否完全指向一个对象,可以使用is

print(first_b is second_b)

# False

解析部分文档

如果仅仅因为想要查找文档中的标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把标签以外的东西都忽略掉.

SoupStrainer 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 SoupStrainer 中定义过的文档.

创建一个 SoupStrainer 对象并作为 parse_only 参数给 BeautifulSoup 的构造方法即可.

SoupStrainer 对象

from bs4 import SoupStrainer

# 三种SoupStrainer对象
only_a_tags = SoupStrainer("a")
only_tags_with_id_link2 = SoupStrainer(id="link2")
def is_short_string(string):
    return len(string) < 10

# 传入对象参数
only_short_strings = SoupStrainer(string=is_short_string)
html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())
# <a class="sister" href="http://example.com/elsie" id="link1">
#  Elsie
# </a>
# <a class="sister" href="http://example.com/lacie" id="link2">
#  Lacie
# </a>
# <a class="sister" href="http://example.com/tillie" id="link3">
#  Tillie
# </a>

print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify())
# <a class="sister" href="http://example.com/lacie" id="link2">
#  Lacie
# </a>

print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify())
# Elsie
# ,
# Lacie
# and
# Tillie
# ...
#

bs4.4.0文档指路

查看全文

相关阅读:
浅谈工作流的作用
 WinForm上播放Flash文件
 C#反序列化 “在分析完成之前就遇到流结尾”
UML类图详解
 关于C#中Thread.Join()的一点理解
 c# 反射的用法
 C#多线程学习笔记之(abort与join配合使用)
UML用例图总结
 asp.net 发布到IIS中出现”处理程序“PageHandlerFactoryIntegrated”在其模块列表中有一个错误模块“ManagedPipelineHandler”“错误的解决方法
 C#序列化和反序列化

原文地址：https://www.cnblogs.com/pufa/p/15810389.html