学习内容:
1.BeautifulSoup详解
2.PyQuery讲解
1.python3安装BeautifulSoup库:pip3 install BeautifulSoup4
2.BeautifulSoup:灵活又方便的网页解析库,处理高效,支持多种解析器,利用它不用编写正则表达式即可方便地实现网页信息的提取。
3.基本使用:
html=""" <html><head><title>The Dormouse's story</title> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.prettify()) print(soup.title.string)
4.标签选择器选择元素
html=""" <html><head><title>The Dormouse's story</title> <body> <p class="title" name="dromouse" ><b>The Dormouse's story</b></p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.title) print(type(soup.title)) print(soup.head) print(soup.p) //只输出第一个p标签
获取名称
print(soup.title.name)
获取属性
print(soup.p.attrs['name'])
print(soup.p['name'])
获取内容
print(soup.p.string)
子节点和子孙节点:以列表list形式返回
print(soup.p.contents)
子节点和子孙节点:以迭代器返回索引和节点
print(soup.p.children) for i,child in enumerate(soup.p.children): print(i,child)
父节点
print(soup.a.parent)
祖先节点
print(list(enumerate(soup.a.parents)))
兄弟节点
print(list(enumerate(soup.a.next_siblings))) print(list(enumerate(soup.a.previous_siblings)))
5.标准选择器
find_all(name,sttrs,recursive,text,**kwargs)可根据标签名、属性、内容查找文档
name
print(soup.find_all('ui')) print(type(soup.find_all('ui')[0]))
for ul in soup.find_all('ul'): print(ul.find_all('li'))
attrs
print(soup.find_all(attrs={"id":"list-1"})) print(soup.find_all(attrs={"name":"elements"}))
print(soup.find_all(id="list-1")) print(soup.find_all(class_="element "))
find(name,sttrs,recursive,text,**kwargs)返回单个元素,find_all返回所有元素
6.css选择器
通过select()直接传入css选择器即可完成选择
class 前面用. 用空格分隔
标签不需要加东西
id 前面用#
7.PyQuery库:网页解析库
安装:pip3 install pyquery
8.pyquery初始化
字符串初始化
html=''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link5.html">second item</a></li> <li class="item-0"><a href="link5.html">third item</a></li> <li class="item-1"><a href="link5.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) print(doc('li'))
URL初始化
doc = pq(url='http://www.baidu.com') print(doc('head'))
文件初始化
doc = pq(filename = 'demo.html') print(doc('li'))
9.基本css选择器
doc = pq(html) print(doc(#container .list li))
获取文本
doc = pq(html) a = doc('.item-0.active a') print(a) print(a.text())
10.DOM操作
addClass removeClass
doc = pq(html) li = doc('.item-0.active') print(li) li.removeClass('active') print(li) li.addClass('active') print(li)
attr css
doc = pq(html) li = doc('.item-0.active') print(li) li.attr('name','link') print(li) li.css('font-size','14px') print(li)
remove
doc = pq(html) wrap = doc(".wrap") print(wrap.text()) wrap.find('p').remove() print(wrap.text())