xpath模块解析
Xpath是一门在 XML 文档中查找信息的语言。 Xpath可用来在 XML文档中对元素和属性进行遍历。而我们熟知的HTML恰巧属于XML的一个子集。所以完全可以用xpath去查找html中的内容。
一、安装lxml模块
pip install lxml
用法:1、将要解析的html内容构造出etree对象。
2、使用etree对象的xpath方法配合xpath表达式来完成对数据的提取。
简单案例:
from lxml import etree xml=''' <book> <id>1</id> <name>野花遍地⾹</name> <price>1.23</price> <nick>臭⾖腐</nick> <author> <nick id="10086">周⼤强</nick> <nick id="10010">周芷若</nick> <nick class="joy">周杰伦</nick> <nick class="jolin">蔡依林</nick> <div> <nick>热了</nick> </div> <span> <nick>热了哦</nick> </span> </author> <partner> <nick id="ppc">胖胖陈</nick> <nick id="ppbc">胖胖不陈</nick> </partner> </book> ''' tree=etree.XML(xml) res=tree.xpath('/book/name/text()') #text() 拿文本 print(res) # ['野花遍地⾹'] res=tree.xpath('/book/author/nick/text()') print(res) # ['周⼤强', '周芷若', '周杰伦', '蔡依林'] res=tree.xpath('/book/author//nick/text()') # // 后代 print(res) # ['周⼤强', '周芷若', '周杰伦', '蔡依林', '热了', '热了哦'] res=tree.xpath('/book/author/*/nick/text()') # * 任意一个节点 print(res) # ['热了', '热了哦']
案例2:
有一html文件,文件名1.html
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8" /> <title>Title</title> </head> <body> <ul> <li><a href="http://www.baidu.com">百度</a></li> <li><a href="http://www.google.com">⾕歌</a></li> <li><a href="http://www.sogou.com">搜狗</a></li> </ul> <ol> <li><a href="feiji">飞机</a></li> <li><a href="dapao">⼤炮</a></li> <li><a href="huoche">⽕车</a></li> </ol> <div class="job">李嘉诚</div> <div class="common">胡辣汤</div> </body> </html>
解析如下:
from lxml import etree tree = etree.parse('1.html') result = tree.xpath('/html/body/ul/li/a/text()') print(result) # ['百度', '谷歌', '搜狗'] result = tree.xpath('/html/body/ul/li[2]/a/text()') # xpath的顺序从1开始 print(result) # ['谷歌'] result = tree.xpath('/html/body/ol/li/a[@href="dapao"]/text()') # [@xxx=xxx] 属性的筛选 print(result) # ['大炮'] ol_li_list = tree.xpath('/html/body/ol/li') for li in ol_li_list: res = li.xpath('./a/text()') # 在li中继续查找,相对查找 print(res) # ['飞机'] # ['大炮'] # ['火车'] res2 = li.xpath('./a/@href') # 属性值:@属性 print(res2) # ['feiji'] # ['dapao'] # ['huoche'] print(tree.xpath('/html/body/ul/li/a/@href')) # ['http://www.baidu.com', 'http://www.google.com', 'http://www.sogou.com']
案例3:爬取猪八戒网信息
import requests from lxml import etree url = 'https://beijing.zbj.com/search/f/?type=new&kw=前端开发' resp = requests.get(url) #解析 html = etree.HTML(resp.text) divs = html.xpath('/html/body/div[6]/div/div/div[2]/div[4]/div[1]/div') #每个服务商信息 for div in divs: price=div.xpath("./div/div/a/div[2]/div[1]/span[1]/text()") title=div.xpath("./div/div/a/div[2]/div[2]/p/text()") print(price,title)