爬虫库之BeautifulSoup学习（三）

zoukankan html css js c++ java

爬虫库之BeautifulSoup学习（三）

遍历文档树：

　　1、查找子节点

　　.contents　　

　　tag的.content属性可以将tag的子节点以列表的方式输出。

　　print soup.body.contents

　　print type(soup.body.contents)

　　运行结果：

[u' ', The Dormouse's story, u' ', Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well., u' ', ..., u' ']

<type 'list'>
[Finished in 0.2s]

.children

它返回的不是一个list，不过我们可以通过它来遍历获取所有子节点。

我们可以打印输出，可以发现它返回的是一个list生成器对象

print soup.body.children

<listiterator object at 0x0294DE90>

我们怎样获得里面的内容呢？遍历一下就ok了：

for child in soup.boyd.children:

　　print child

运行返回内容：

The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...

[Finished in 0.2s]

2、所有子孙节点

.descendants

.contents 和 .children 属性仅包含tag的直接子节点，.descendants 属性可以对所有tag的子孙节点进行递归循环，和 children类似，我们也需要遍历获取其中的内容。

for child in soup.descendants:
　　print child

运行结果如下，可以发现，所有的节点都被打印出来了，先生最外层的 HTML标签，其次从 head 标签一个个剥离，以此类推。

3、节点内容

.string

如果一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容。

果tag包含了多个子节点,tag就无法确定，string 方法应该调用哪个子节点的内容, .string 的输出结果是 None

print soup.head.string
print soup.title.string
print soup.body.string

#The Dormouse's story
#The Dormouse's story
#None
[Finished in 0.2s]

4、多个内容

.strings

获取多个内容，不过需要遍历获取

for string in soup.strings:

　　print repr(string)

　　.stripped_strings

　　输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容

for string in soup.stripped_strings:
　　print repr(string)

运行结果：

u"The Dormouse's story"
u"The Dormouse's story"
u'Once upon a time there were three little sisters; and their names were'
u','
u'Lacie'
u'and'
u'Tillie'
u'; and they lived at the bottom of a well.'
u'...'
[Finished in 0.2s]

5、父节点

.parent

print soup.p.parent.name

print soup.head.title.string.parent.name

#body

#title

6、兄弟节点、前后节点等略

查看全文

相关阅读:
Prometheus-node-exporter
Prometheus基础
 普通函数与回调函数的区别
 HTML转义字符大全
 使用 Chrome DevTools 模拟缓慢的 3G 网络速度
 Chrome 浏览器如何修改 User-Agent
服务器上的 Git
Mac配置go环境变量
 Linux和Mac环境变量设置
 Cloudflare DNS设置中子域委派不成功的问题

原文地址：https://www.cnblogs.com/yu2000/p/6847039.html