zoukankan      html  css  js  c++  java
  • Python Xpath的解析,应用

    1. Xpath解析库介绍:

    # Xpath解析库介绍: 
    	数据解析的过程中使用过正则表达式, 但正则表达式想要进准匹配难度较高, 一旦正则表达式书写错误, 匹配的数据也会出错. 
    	网页由三部分组成: HTML, Css, JavaScript, HTML页面标签存在层级关系, 即DOM树, 在获取目 标数据时可以根据网页层次关系定位标签, 再获取标签的文本或属性.
            
    # xpath解析库解析数据原理: 
    1. 根据网页DOM树定位节点标签 
    2. 获取节点标签的正文文本或属性值
    
    # xpath安装, 初体验 --> 使用步骤: 
    1.xpath安装: pip install lxml 
    2.requests模块爬取糗事百科热门的标题: 
        
    import requests 
    from lxml import etree 
    
    url = 'https://www.qiushibaike.com/' 
    headers = { "User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' }
    res = requests.get(url=url, headers=headers) 
    #实例化对象
    tree = etree.HTML(res.text) 
    #解析数据
    title_lst = tree.xpath('//ul/li/div/a/text()') 
    for item in title_lst: 
        print(item) 
        
    3.xpath使用步骤: 
    from lxml import etree
    tree = etree.HTML(res.text) 
    tree = etree.parse(res.html, etree.HTMLParse()) # 示例如下, 了解内容 tag_or_attr = tree.xpath('xpath表达式')
    

    2. xpath语法

    # xpath语法: 
    1.常用规则: 
    	1. nodename: 节点名定位 
    	2. //: 从当前节点选取子孙节点 
    	3. /: 从当前节点选取直接子节点 
    	4. nodename[@attribute="..."] 根据属性定位标签 '//div[@class="ui-main"]' 
    	5. @attributename: 获取属性 
    	6. text(): 获取文本 
    2.属性匹配两种情况: 多属性匹配 & 单属性多值匹配 
    	2.1 多属性匹配 示例: tree.xpath('//div[@class="item" and @name="test"]/text()') 
    	2.2 单属性多值匹配 示例: tree.xpath('//div[contains(@class, "dc")]/text()') 
    3.按序选择: 
    	3.1 索引定位: 从1开始(牢记, 牢记, 牢记) 
    	3.2 last()函数 
    	3.3 position()函数 ---代表几个以内>,<  位置
    

    3. xpath代码演示

    from lxml import etree
    
    # 1.实例化一个etree对象
    # tree = etree.HTML('文本数据')    # 解析直接从网络上爬取内容
    # reel = etree.parse('文本数据',etree.HTMLParser())   # 解析本地的HTML文本
    reel = etree.parse('./test.html',etree.HTMLParser())   # 解析本地的HTML文本
    
    #2.调用 xpath 表达式定位标签及获取其属性与文本
    #2.1根据节点定位
    
    title = reel.xpath('//title/text()')   #xpath匹配出来是一个列表
    # print(title)
    
    # 3. 定位id为007的标签,去直接文本
    div_oo7 = reel.xpath('//div[@id="007"]/text()')
    # print(div_oo7)
    
    div_008 = reel.xpath('//div[@id=007]//text()')
    # print(div_008)
    
    # 4.获取节点的属性值
    a_tag = reel.xpath('//a/@href')
    # print(a_tag)
    
    # 5.多属性匹配和单属性多值匹配
    # 多属性匹配
    div_009 = reel.xpath('//div[@class="c1" and @name="laoda"]/text()')
    # print(div_009)
    
    # 单属性多值匹配
    div_010 = reel.xpath('//div[contains(@class,"c3")]/text()')
    # print(div_010)
    
    #6、按序匹配
    div_011 = reel.xpath('//div[@class="divtag"]/ul/li/text()')
    # print(div_011)
    
    div_012 = reel.xpath('//div[@class="divtag"]/ul/li[4]/text()')
    # print(div_012)
    
    # div_013 = reel.xpath('//div[@class="divtag"]/ul/li[last()-1]/text()')
    # print(div_013)
    
    div_014 = reel.xpath('//div[@class="divtag"]/ul/li[position()<4]/text()')
    print(div_014)
    
    

    4. 豆瓣案例

    import requests
    from lxml import etree
    url = 'https://movie.douban.com/chart'
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
    }
    res = requests.get(url=url,headers=headers)
    tree = etree.HTML(res.text)
    ret = tree.xpath('//div[@class="pl2"]')
    for i in ret:
        title = i.xpath('./a//text()')
        title_full = ''
        for j in title:
            c = j.replace('
    ','').replace(' ','')
            title_full += c
        author = i.xpath('./p//text()')
        pj = i.xpath('./div/span[2]/text()')
        pf = i.xpath('./div/span[3]/text()')
        print(title_full)
        print(author[0])
        print(pj[0])
        print(pf[0])
    
    
  • 相关阅读:
    ammap demo
    sql批量新增和修改
    js右键菜单
    C# 索引器
    NUnit使用体会
    js拖动效果
    Js 原型对象与原型链(转)
    sql for xml子句
    ASP.NET应用程序生命周期
    HttpWebRequest和HttpWebResponse
  • 原文地址:https://www.cnblogs.com/xinzaiyuan/p/12382200.html
Copyright © 2011-2022 走看看