python爬虫-html解析器beautifulsoup

zoukankan html css js c++ java

python爬虫-html解析器beautifulsoup
看排版更好的原文地址

BeautifulSoup库是解析、遍历、维护“标签树”的功能库

安装

sudo pip install beautifulsoup4

使用

# coding: UTF-8 import requests url="http://www.baidu.com" r=requests.get(url) r.encoding=r.apparent_encoding print r.text

结果：

上面的代码以前写过，就是获取百度的源代码。现在我们就通过这个源代码来学习beautifulsoup库的使用吧

soup.prettify()

对源代码进行美化（格式化） # coding: UTF-8 import requests from bs4 import BeautifulSoup url="http://www.baidu.com" r=requests.get(url) r.encoding=r.apparent_encoding #将源代码保存到变量s里面 s=r.text soup=BeautifulSoup(s,"html.parser") s=soup.prettify() print s

结果：（确实好看多了）

from bs4 import
引入BeautifulSoup类

代码中构建了一个BeautifulSoup类型的对象soup，参数为网页源代码和”html.parser”，表明是解析html的。

soup=BeautifulSoup(s,"html.parser")

上面的代码是通过字符串里面的源代码构建BeautifulSoup类对象,还可以像下面这样直接使用本地html文件创建BeautifulSoup类对象。

soup=BeautifulSoup(open("a.html"),"html.parser")

基本元素说明

例子

title
标题标签

print soup.title

结果：

a
链接标签

print soup.a

结果：

tips：有多个时只返回第一个
* name
显示标签的名字

print soup.a.name

parent
得到标签的父标签，是一个bs4.element.Tag对象

int soup.a.parent.name print soup.a.parent.parent.name print soup.a.parent.parent.parent.parent.parent.name

attrs
得到标签属性，是一个字典

print soup.a.attrs

如果要获取字典中的一个值，可以通过：

print soup.a.attrs["class"]

class是字典的一个key,返回它对应的value

print soup.a.attrs["href"] ```获取链接 ```tips:在python里面可以用type()获取变量的类型``` * string 获取尖括号之间的字符串 <div class="se-preview-section-delimiter"></div>

print soup.a.string
“`

print soup.a.string

是一个bs4.element.NavigableString类型的对象

为了便于比较，附a的图：

小结

html的遍历

contents

子节点的列表，list类型

soup=BeautifulSoup("<body><p>1111</p><p>2222</p></body>","html.parser") print soup.body.contents

得到列表元素：

list=soup.body.contents print list[1]

tips:index从0开始，list[0],list[1]

children

得到标签的子节点，为listiterator（迭代）类型

for child in soup.body.children: print child

遍历儿子节点

parent

节点的父亲标签

parents

先辈标签的迭代类型

平行遍历

（同一个父节点的标签之间）

总结
http://hjwblog.com/2018/03/22/%E5%AE%89%E5%8D%93/%E5%AE%89%E5%8D%93%E5%BC%80%E5%8F%91-intent%E5%9C%A8Activity%E4%B9%8B%E9%97%B4%E6%95%B0%E6%8D%AE%E4%BC%A0%E9%80%92/
查看全文

相关阅读:
在Asp.Net中使用jQueryEasyUI(转)
easyui简单使用
 0mq 入门 (转)
windows钩子(转)
Windbg简明教程(转)
复合文档学习笔记（转）
解析eml文件
 强制windows系统重启at命令
 pygame 入门实例
 python 回溯法子集树模板系列 —— 18、马踏棋盘