python 第三方库BeautifulSoup4文档学习（3）

zoukankan html css js c++ java

python 第三方库BeautifulSoup4文档学习（3）
遍历文档树

一个html或者是xml格式的文档经过bs处理后会变成一个文档树，顶级节点为一个tag，这个tag里面包含了很多个子节点，这些子节点可以是字符串也可以是tag，接下来以一段示例文档来学习遍历这个文档树。
```
html_doc = """<html>
 <head>
 <title>The Dormouse's story</title>
 </head>
 <body>
 The Dormouse's story
 Once upon a time there were three little sisters; and their names were
 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 and they lived at the bottom of a well.
 
 </body>
</html>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
```
子节点

子节点可以是字符串或tag，bs中提供了很多操作和遍历子节点的属性，但字符串本身不支持继续遍历。

通过tag的名字遍历

例如，上面的示例文档中要获取第一个a标签，直接soup.a即可，如果子节点下还有子节点，例如a标签下还有字符串子节点，那么可以通过soup.a.string的方式获取soup对象下的a标签子节点下的字符串子节点。

注意：上面通过.方式获取的子节点是在文档中找到的第一个子节点

如果需要获取当前文档中的所有a子节点，可以使用BeautifulSoup对象的find_all()方法，结果返回一个list
```
list_a = soup.find_all('a')
print(lsit_a)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```
通过.contents和.children遍历

通过.contents属性可以将一个tag的子节点以list形式输出，如：
```
list_tag = soup.a.contents
print(list_tag)

# ['Elsie']
```
记得之前说过本身BeautifulSoup对象也是一个文档，那么它也存在子节点：
```
list_soup = soup.contents
print(list_soup[0].name)

# html
```
通过tag.children属性可以对tag的子节点进行for循环
```
a_tag = soup.a
for i in a_tag:
 print(i)

# Elsie
```
.descendants属性

通过.descendants可以递归循环一个父级tag节点下的所有子孙节点：
```
for i in soup.head.descendants:
 print(i)

# <title>The Dormouse's story</title>
# The Dormouse's story
```
.string属性

如果tag只有一个NavigableString类型的子节点或者仅有一个子节点，那么就可以使用.string获取其中的其中唯一的字符串，例如：
```
title_tag = soup.title
print(title_tag.string)
# The Dormouse's story

head_tag = soup.head
print(head_tag)
print(head_tag.string)
# <head><title>The Dormouse's story</title></head>
# The Dormouse's story
```
注意：如果一个tag下面不止一个子节点时（包括空格、换行符等），那么.string就会输出None，

.strings和stripped_strings

如果tag包含多个字符串，可以使用.strings来循环获取：
```
for stri in soup.strings:
 print(repr(stri))
 
```
父节点

每个tag或者字符串都有父节点

.parent

通过tag或字符串的.parent属性可以获取这个tag或者字符串的父节点，例：
```
tag = soup.title
print(tag.parent)
# <head><title>The Dormouse's story</title></head>
print(soup.parent)
# None
```
.parents

.parents顾名思义就是可以遍历一个tag或者字符串的所有父节点，直至None
```
tag = soup.a
for ele in tag.parents:
	print(ele.name)

# 'p'
# 'body'
# 'html'
# '[document]'
```
兄弟节点

具有相同的父节点且位于同一层级的子节点我们称他们为兄弟节点

.next_sibling和.previous_sibling
```
sibling_soup = BeautifulSoup("<a>text1<c>text2</c></a>","html.parser")

print(sibling_soup.b.next_sibling)
print(sibling_soup.b.previous_sibling)
print(sibling_soup.c.previous_sibling)
print(sibling_soup.c.next_sibling)

# <c>text2</c>
# None
# text1
# None
```
.next_siblings与.previous_siblings

可以对当前节点的兄弟节点进行迭代输出，例如
```
for sibling in soup.a.next_siblings:
	print(repr(sibling))

# ',
'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# ' and
'
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# ';
and they lived at the bottom of a well.'
```
回退与前进

我们看一段html文档：
```
<html>
 <head>...</head>
 <body>
 <a>...</a>
 </body>
</html>
```
通过bs解析以后这段文档应该是开启一个标签，开启一个标签，写入head中的内容，再关闭一个标签，之后再开启标签等等。

.next_element 和 .previous_element

.next_element 是用来指向该元素在解析过程中的下一个对象元素（一个tag或者是字符串），指向的结果有可能与.next_sibling相同，但大多数时候都是不同的。
```
last_a_tag = soup.find("a", id="link3")
print(repr(last_a_tag.next_element))
print(repr(last_a_tag.next_sibling))

# 'Tillie'
# ';
and they lived at the bottom of a well.'
```
.previous_element与.next_element相反，它指向当前解析对象的前一个对象
```
last_a_tag = soup.find("a", id="link3")
print(repr(last_a_tag.previous_element))
print(repr(last_a_tag.previous_element.next_element))

# ' and
'
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
```
.next_elements 和 .previous_elements

通过.next_elements 和.previous_elements可以对当前元素的之后、之前要解析的对象进行迭代遍历
```
last_a_tag = soup.find("a", id="link3")
for element in last_a_tag.next_elements:
 print(repr(element))

# 'Tillie'
# ';
and they lived at the bottom of a well.'
# '
'
# ...
# '...'
# '
'
```
查看全文

相关阅读:
iOS block的用法
 ios-AutoLayout(自动布局代码控制)简单总结
 iOS动画浅汇
 AutoLayout的那些事儿
 ffmpeg合并多个视频
 Win7下安装配置Java
Linux + Apache + PHP 环境搭建
 Python操作excel文件
 Python文件打包成EXE文件
 Vim插件管理 -- Vundle

原文地址：https://www.cnblogs.com/pufa/p/15541859.html

python 第三方库BeautifulSoup4文档学习（3）

遍历文档树

子节点

通过tag的名字遍历

通过.contents和.children遍历

.descendants属性

.string属性

.strings和stripped_strings

父节点

.parent

.parents

兄弟节点

.next_sibling和.previous_sibling

.next_siblings与.previous_siblings

回退与前进

.next_element 和 .previous_element

.next_elements 和 .previous_elements