PYTHON网络爬虫与信息提取[BeautifulSoup](单元四)

zoukankan html css js c++ java

PYTHON网络爬虫与信息提取[BeautifulSoup](单元四)

1 简介

from bs4 import BeautifulSoup

soup=BeautifulSoup(data,'html.parser')

2 基本元素

解析，遍历，维护，标签树的库

 ... 　　tag对

名称 (属性 attributes)

beautifulsoup 或bs4

from bs4 import BeautifulSoup

import bs4

beautifulSoup 雷

html--------标签树（字符串）转换为beautifulsoup类

from bs4 import BeautifulSoup

soup=

注：解析器（4种）

html.parser 安装bs4库

lxml　　　　　 pip install lxml

xml 同上

html5lib　　　 pipinstall html5lib

beautiful 类的基本元素

Tag 标签尖括号开头结尾

Name 格式:<tag>.name 的名字是 ''p''

Attributes 　　标签的属性，字典形式组织 <tag>.attrs

NavigableString 标签内非属性字符串表示尖括号之间的内容

soup.a.string 就可以了

Comment 　　　标签内字符串的注释部分

用string 也可以得出这个类型

3 标签树的遍历

.contents 获得子节点的列表

.children 获得子节点的迭代形式

.descendants 获得子孙的迭代形式

儿子节点不管包括标签还包括

soup.body.contents

.parent 节点的父亲标签

.parnets 节点的先辈形式迭代版的

平行遍历（返回按照html文本顺序的节点标签）

平行遍历时实在同一个父标签下的遍历

.next_sibling

.previous_sibling

.next_siblings 迭代版

.next_previous_siblings 迭代版

4 基于bs4显示html的内容

from bs4 import BeautifulSoup

soup=BeautifulSoup(demo,"html.parser") //加载解析器的语句

soup.prettify() //soup 是 BeautifulSoup类型用以解析html 或者遍历html

"prettify()方法非常好用"

#增加换行符

print(soup.prettify（）)

查看全文

相关阅读:
【转】VirtualBox虚拟机网络设置（四种方式）
笔、面试时OS常见题目【转】【Updating】
【转】Vim教程
 【部分原创】刚安装好的linux中aptget配置代理的方法
 【原创长文】openstack 版本D安装配置及本次安装中遇到的问题
 单链表反转函数
 摆正心态
 C# XmlSerializer实现序列化浅析
 C#中虚函数和抽象函数的区别
 在Win7（64位）系统下运行World Wind源码程序出现“未处理BadImageFormatException”错误解决方法(另：附加信息)

原文地址：https://www.cnblogs.com/sfzyk/p/6516683.html

PYTHON网络爬虫与信息提取[BeautifulSoup](单元四)

1 简介

2 基本元素

注：解析器（4种）

3 标签树的遍历

4 基于bs4显示html的内容