zoukankan html css js c++ java

Python BeautifulSoup

获得HTML

html = urlopen('http://example.com')

获得 BeautifulSoup对象 (完整的DOM Tree)

bsObj = BeautifulSoup(html.read)									#过时
bsObj = BeautifulSoup(html,'html.parser',from_encoding='utf8')		#新标准(参数三非必须)

获得 Tag对象 (可以一直点下去)

#最好使用try包围起来，否则读到不存在的标签会报错
body = bsObj.body 	//title包含tag之间的所有内容

find 和 find_all ( 可以使用以上两种对象 )

findAll(tag, attributes, recursive, text, limit, keywords)
#标签名    用于匹配的字典    是否迭代(默认是)    使用text = '关键字'匹配，无需参数12    限制前n项    keyword 是冗余设计，相当于bsObj.findAll("", {"id":"text"})
find(tag, attributes, recursive, text, keywords)

查找标签(对于Python中的关键字，如：class 可以使用 class_ 或者 'class')

# <span class="red">
nameList = bsObj.find_all(name='span', attrs={
    'class': 'red'})  # 得到的是一个bs4的自定义对象[<span class="red"> 我是中间的文字1 </span>, <span class="red"> 我是中间的文字2 </span>]
for name in nameList:

    print(name.get_text()) # 获得标签间的内容
    print('-------------------')

子，后代，兄弟标签（使用find会舍得代码更健壮，此处仅为展示家族关系）

# 子标签
for child in bsObj.children:
    print(child)

# 后代标签
for descendant in bsObj.descendants:
    print(descendant)、

# 兄弟标签 next_sibling previous_sibling 一个；next_siblings previous_siblings 一组
for next_sibling in bsObj.body.h1.next_siblings:
    print(next_sibling)

# 父标签 parent 和 parents
for parent in bsObj.body.h1.parents:
    print('===========')
    print(parent)

正则表达式(可以在任何一个需要参数的地方传递正则)

bsObj.find_all(re.compile('<span class="red">'))
images = bsObj.findAll("img",{"src":re.compile("../img/gifts/img.*.jpg")})

获取 Tag 的属性

myTag.attrs

查看全文

相关阅读:
linux 学习笔记1
IIS请求筛选模块被配置为拒绝超过请求内容长度的请求
 ipod锁定后的恢复
 HTTP报文
 数据仓库概念
 数据挖掘概念
 大数据处理工具
 eclipse 4.3 汉化
 在CentOS中安装输入法
 编译Hadoop1.1.2eclipse插件并测试

原文地址：https://www.cnblogs.com/cenzhongman/p/7357508.html