zoukankan      html  css  js  c++  java
  • 数据解析

    一.数据解析

    1.xpath解析(各个爬虫语言通用)

    (1)环境安装

    pip install lxml

    (2)解析原理

    - 获取页面原码数据 
    - 实例化etree对象,将页面原码数据加载到该对象中
    - 调用该对象的xpath方法进行指定标签的定位(xparh函数必须结合着xpath表达式进行标签的定位和内容的捕获)

    (3)xpath语法(返回值是一个列表)

    ## 一.数据解析
    
    ### 1.xpath解析(各个爬虫语言通用)
    
    #### (1)环境安装
    
    ```
    pip install lxml
    ```
    
    #### (2)解析原理
    
    ```
    - 获取页面原码数据 
    - 实例化etree对象,将页面原码数据加载到该对象中
    - 调用该对象的xpath方法进行指定标签的定位(xparh函数必须结合着xpath表达式进行标签的定位和内容的捕获)
    ```
    
    #### (3)xpath语法(返回值是一个列表)
    
    ```
    属性定位
        / 相当于 > (在开头一定从根节点开始)
        // 相当于  ' '
        @ 表示属性
        例://div[@class="song"]
    索引定位(索引从1开始)
        //ul/li[2]
    逻辑运算
        //a[@href='' and @class='du'] 和
        //a[@href='' | @class='du'] 或
    模糊匹配
        //div[contains(@class,'ng')]
        //div[starts-with(@class,'ng')]    
    取文本
        //div/text() 直系文本内容
        //div//text() 非直系文本内容(返回列表)
    取属性
        //div/@href
    ```
    
    #### (4)案例
    
    ##### 案例一:58同城二手房数据爬取
    
    ```python
    import requests
    from  lxml import etree
    import os
    url='https://bj.58.com/changping/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0000-1cc0-306c-511ad17612b3&ClickID=1'
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    origin_data=requests.get(url=url,headers=headers).text
    tree=etree.HTML(origin_data)
    title_price_list=tree.xpath('//ul[@class="house-list-wrap"]/li/div[2]/h2/a/text() | //ul[@class="house-list-wrap"]/li/div[3]//text()')
    with open('./文件夹1/fangyuan.txt','w',encoding='utf-8') as f:
        for title_price in title_price_list:
            f.write(title_price)
        f.close()    
    print("over")
    ```
    
    ###### *注:区别解析的数据源是原码还是局部数据*
    
    ```
    原码数据
        tree.HTML('//ul...') 
    局部数据
        tree.HTML('./ul...') #以.开头
    ```
    
    ##### 测试xpath语法的正确性
    
    ###### 方式一:xpath.crx(xpath插件)
    
    ```
    找到浏览器的 更多工具>拓展程序
    开启开发者模式
    将xpath.crx拖动到浏览器中
    xpath插件启动快捷键:ctrl+shift+x
    作用:用于测试xpath语法的正确性
    ```
    
    ![1551257321487](C:UsersAdministratorAppDataRoamingTypora	ypora-user-images1551257321487.png)
    
    ###### 方式二:浏览器自带
    
    ![1551231018948](C:UsersAdministratorAppDataRoamingTypora	ypora-user-images1551231018948.png)
    
    
    
    ##### 案例二:4k网爬取图片
    
    ```
    import requests
    from  lxml import etree
    import urllib
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    page_num=int(input("请输入要爬取的页数:"))
    if page_num==1:
        url='http://pic.netbian.com/4kyingshi/index.html'
        origin_data=requests.get(url=url,headers=headers).text
        tree=etree.HTML(origin_data)
        a_list=tree.xpath('//ul[@class="clearfix"]/li/a')
        for a in a_list:
            name=a.xpath('./b/text()')[0]
            name=name.encode('iso-8859-1').decode('gbk')
            url='http://pic.netbian.com'+a.xpath('./img/@src')[0]
            picture=requests.get(url=url,headers=headers).content
            picture_name='./文件夹2/'+name+'.jpg'
            with open(picture_name,'wb') as f:
                f.write(picture)
        f.close()
        print('over!!!')
        
    else:
        for page in range(1,page_num+1):
            url='http://pic.netbian.com/4kyingshi/index_%d.html' % page
            origin_data=requests.get(url=url,headers=headers).text
            tree=etree.HTML(origin_data)
            a_list=tree.xpath('//ul[@class="clearfix"]/li/a')
            for a in a_list:
                name=a.xpath('./b/text()')[0]
                name=name.encode('iso-8859-1').decode('gbk')
                url='http://pic.netbian.com'+a.xpath('./img/@src')[0]
                picture=requests.get(url=url,headers=headers).content
                picture_name='./文件夹2/'+name+'.jpg'
                with open(picture_name,'wb') as f:
                    f.write(picture)
            f.close()
            print('over!!!')
    ```
    
    ###### 中文乱码问题
    
    ```
    方式一:
        response.encoding='gbk'
    方式二:
        name=name.encode('iso-8859-1').decode('utf-8')
    ```
    
    ###### 数据来源问题
    
    ```
    etree.HTML() #处理网络数据
    etree.parse() #处理本地数据
    ```
    
    
    
    ##### 案例3:爬取煎蛋网图片
    
    ```python
    import requests
    from  lxml import etree
    import urllib
    import base64
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    url='http://jandan.net/ooxx'
    origin_data=requests.get(url=url,headers=headers).text
    tree=etree.HTML(origin_data)
    span_list=tree.xpath('//span[@class="img-hash"]/text()')
    for span in span_list:
        src='http:'+base64.b64decode(span).decode("utf-8")
        picture_data=requests.get(url=src,headers=headers).content
        name='./文件夹3/'+src.split("/")[-1]
        with open(name,'wb') as f:
            f.write(picture_data)
            f.close()
    print('over!!!')
    ```
    
    
    
    ###### ##反爬机制3:base64
    
    在response返回数据中,图片的src都是相同的,每个图片都有一个span标签存储一串加密字符串,同时发现一个jandan_load_img函数,故猜测该加密字符串通过此函数可能得到图片地址.
    
    ![1551260850370](C:UsersAdministratorAppDataRoamingTypora	ypora-user-images1551260850370.png)
    
    全局搜索此函数
    
    ![1551261126014](C:UsersAdministratorAppDataRoamingTypora	ypora-user-images1551261126014.png)
    
    发现此函数中用到了jdtPGUg7oYxbEGFASovweZE267FFvm5aYz
    
    ![1551261205397](C:UsersAdministratorAppDataRoamingTypora	ypora-user-images1551261205397.png)
    
    全局搜索jdtPGUg7oYxbEGFASovweZE267FFvm5aYz
    
    ![1551261246264](C:UsersAdministratorAppDataRoamingTypora	ypora-user-images1551261246264.png)
    
    函数的最后用到了base64_decode
    
    ![1551261317520](C:UsersAdministratorAppDataRoamingTypora	ypora-user-images1551261317520.png)
    
    故断定该加密字符串用base64解密可得到图片地址
    
    
    
    ##### 案例4:站长素材简历爬取
    
    ```python
    import requests
    from  lxml import etree
    import random
    headers={
        'Connection':'close',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    url='http://sc.chinaz.com/jianli/free.html'
    origin_data=requests.get(url=url,headers=headers).text
    tree=etree.HTML(origin_data)
    src_list=tree.xpath('//div[@id="main"]/div/div/a/@href')
    for src in src_list:
        filename='./文件夹4/'+src.split('/')[-1].split('.')[0]+'.rar'
        print(filename)
        down_page_data=requests.get(url=src,headers=headers).text
        tree=etree.HTML(down_page_data)
        down_list=tree.xpath('//div[@id="down"]/div[2]/ul/li/a/@href')
        res=random.choice(down_list)
        print(res)
        jianli=requests.get(url=res,headers=headers).content
        with open(filename,'wb') as f:
            f.write(jianli)
            f.close()     
    print('over!!!')
    ```
    
    
    
    ###### ##反爬机制4:Connection
    
    经典错误
    
    ```
    HTTPConnectionPool(host:xx) Max retries exceeded with url
    ```
    
    原因
    
    ```
    1.每次数据传输前客户端都要和服务端建立TCP连接,为了节省传输消耗,默认为keep-alive,即连接一次传输多次,然而如果连接迟迟不断开的话,链接池满后,则无法产生新的链接对象,导致请求无法发送
    2.IP被封
    3.请求频率太频繁
    ```
    
    解决
    
    ```
    1.设置请求头中Connection的值为close,每次成功后断开连接
    2.更换请求IP
    3.每次请求之间使用sleep进行请求间隔
    ```
    
    
    
    ##### 案例5:解析所有的城市名称
    
    ```python
    import requests
    from  lxml import etree
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    url='https://www.aqistudy.cn/historydata/'
    origin_data=requests.get(url=url,headers=headers).text
    tree=etree.HTML(origin_data)
    hot_list=tree.xpath('//div[@class="row"]/div/div[1]/div/text() | //div[@class="row"]/div/div[1]/div[@class="bottom"]/ul[@class="unstyled"]/li/a/text()')
    with open('./文件夹1/city.txt','w',encoding='utf-8') as f:
        for hot in hot_list:
            f.write(hot.strip())
        common_list=tree.xpath('//div[@class="row"]/div/div[2]/div[1]/text() | //div[@class="row"]/div/div[2]/div[2]/ul//text()')
        for common in common_list:
            f.write(common.strip())
        f.close()
    print('over!!!')
    ```
    
    
    
    ##### 案例6:图片懒加载,站长素材婚纱照
    
    ```python
    import requests
    from  lxml import etree
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    url='http://sc.chinaz.com/tupian/hunsha.html'
    origin_data=requests.get(url=url,headers=headers).text
    tree=etree.HTML(origin_data)
    div_list=tree.xpath('//div[@id="container"]/div')
    
    for div in div_list:
        title=div.xpath('./p/a/text()')[0].encode('iso-8859-1').decode('utf-8')
        name='./文件夹1/'+title+'.jpg'
        photo_url=div.xpath('./div/a/@href')[0]
        
        origin_data=requests.get(url=photo_url,headers=headers).text
        tree=etree.HTML(origin_data)
        url_it=tree.xpath('//div[@class="imga"]/a/img/@src')[0]
    
        origin_data=requests.get(url=url_it,headers=headers).content
        with open(name,'wb') as f:
            f.write(origin_data)
        
    print('over!!!')
    ```
    
    ###### ##反爬机制5:代理IP
    
    使用
    
    ```python
    import requests
    from  lxml import etree
    import random
    proxie=[{'https':'116.197.134.153:80'},{'https':'103.224.100.43:8080'},{'https':'222.74.237.246:808'}]
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    url='https://www.baidu.com/s?wd=ip'
    origin_data=requests.get(url=url,headers=headers,proxies=random.choice(proxie)).text
    
    with open('./ip.html','w',encoding='utf-8') as f:
        f.write(origin_data)
        
    print('over!!!')
    ```
    
    常用代理网站
    
    ```
    www.goubanjia.com
    快代理
    西祠代理
    ```
    
    代理知识
    
    ```
    透明:对方知道使用了代理,且知道真实IP
    匿名:对方知道使用了代理,不知道真实IP
    高匿:对方不知道使用了代理,更不知道真实IP
    ```
    
    *注:代理IP的类型必须和请求url的协议头 保持一致*
    
    *https://www.55xia.com下载电影*
    
    *顺序:动态加载,url加密,element*

    (4)案例

    案例一:58同城二手房数据爬取
    import requests
    from  lxml import etree
    import os
    url='https://bj.58.com/changping/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0000-1cc0-306c-511ad17612b3&ClickID=1'
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    origin_data=requests.get(url=url,headers=headers).text
    tree=etree.HTML(origin_data)
    title_price_list=tree.xpath('//ul[@class="house-list-wrap"]/li/div[2]/h2/a/text() | //ul[@class="house-list-wrap"]/li/div[3]//text()')
    with open('./文件夹1/fangyuan.txt','w',encoding='utf-8') as f:
        for title_price in title_price_list:
            f.write(title_price)
        f.close()    
    print("over")
    注:区别解析的数据源是原码还是局部数据
    原码数据
        tree.HTML('//ul...') 
    局部数据
        tree.HTML('./ul...') #以.开头
    测试xpath语法的正确性
    方式一:xpath.crx(xpath插件)
    找到浏览器的 更多工具>拓展程序
    开启开发者模式
    将xpath.crx拖动到浏览器中
    xpath插件启动快捷键:ctrl+shift+x
    作用:用于测试xpath语法的正确性

    方式二:浏览器自带

     

    案例二:4k网爬取图片
    import requests
    from  lxml import etree
    import urllib
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    page_num=int(input("请输入要爬取的页数:"))
    if page_num==1:
        url='http://pic.netbian.com/4kyingshi/index.html'
        origin_data=requests.get(url=url,headers=headers).text
        tree=etree.HTML(origin_data)
        a_list=tree.xpath('//ul[@class="clearfix"]/li/a')
        for a in a_list:
            name=a.xpath('./b/text()')[0]
            name=name.encode('iso-8859-1').decode('gbk')
            url='http://pic.netbian.com'+a.xpath('./img/@src')[0]
            picture=requests.get(url=url,headers=headers).content
            picture_name='./文件夹2/'+name+'.jpg'
            with open(picture_name,'wb') as f:
                f.write(picture)
        f.close()
        print('over!!!')
        
    else:
        for page in range(1,page_num+1):
            url='http://pic.netbian.com/4kyingshi/index_%d.html' % page
            origin_data=requests.get(url=url,headers=headers).text
            tree=etree.HTML(origin_data)
            a_list=tree.xpath('//ul[@class="clearfix"]/li/a')
            for a in a_list:
                name=a.xpath('./b/text()')[0]
                name=name.encode('iso-8859-1').decode('gbk')
                url='http://pic.netbian.com'+a.xpath('./img/@src')[0]
                picture=requests.get(url=url,headers=headers).content
                picture_name='./文件夹2/'+name+'.jpg'
                with open(picture_name,'wb') as f:
                    f.write(picture)
            f.close()
            print('over!!!')
    中文乱码问题
    方式一:
        response.encoding='gbk'
    方式二:
        name=name.encode('iso-8859-1').decode('utf-8')
    数据来源问题
    etree.HTML() #处理网络数据
    etree.parse() #处理本地数据
    案例3:爬取煎蛋网图片
    import requests
    from  lxml import etree
    import urllib
    import base64
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    url='http://jandan.net/ooxx'
    origin_data=requests.get(url=url,headers=headers).text
    tree=etree.HTML(origin_data)
    span_list=tree.xpath('//span[@class="img-hash"]/text()')
    for span in span_list:
        src='http:'+base64.b64decode(span).decode("utf-8")
        picture_data=requests.get(url=src,headers=headers).content
        name='./文件夹3/'+src.split("/")[-1]
        with open(name,'wb') as f:
            f.write(picture_data)
            f.close()
    print('over!!!')
    ##反爬机制3:base64

    在response返回数据中,图片的src都是相同的,每个图片都有一个span标签存储一串加密字符串,同时发现一个jandan_load_img函数,故猜测该加密字符串通过此函数可能得到图片地址.

    全局搜索此函数

    发现此函数中用到了jdtPGUg7oYxbEGFASovweZE267FFvm5aYz

    全局搜索jdtPGUg7oYxbEGFASovweZE267FFvm5aYz

    函数的最后用到了base64_decode

    故断定该加密字符串用base64解密可得到图片地址

     

    案例4:站长素材简历爬取
    import requests
    from  lxml import etree
    import random
    headers={
        'Connection':'close',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    url='http://sc.chinaz.com/jianli/free.html'
    origin_data=requests.get(url=url,headers=headers).text
    tree=etree.HTML(origin_data)
    src_list=tree.xpath('//div[@id="main"]/div/div/a/@href')
    for src in src_list:
        filename='./文件夹4/'+src.split('/')[-1].split('.')[0]+'.rar'
        print(filename)
        down_page_data=requests.get(url=src,headers=headers).text
        tree=etree.HTML(down_page_data)
        down_list=tree.xpath('//div[@id="down"]/div[2]/ul/li/a/@href')
        res=random.choice(down_list)
        print(res)
        jianli=requests.get(url=res,headers=headers).content
        with open(filename,'wb') as f:
            f.write(jianli)
            f.close()     
    print('over!!!')

     

    ##反爬机制4:Connection

    经典错误

    HTTPConnectionPool(host:xx) Max retries exceeded with url

    原因

    1.每次数据传输前客户端都要和服务端建立TCP连接,为了节省传输消耗,默认为keep-alive,即连接一次传输多次,然而如果连接迟迟不断开的话,链接池满后,则无法产生新的链接对象,导致请求无法发送
    2.IP被封
    3.请求频率太频繁

    解决

    1.设置请求头中Connection的值为close,每次成功后断开连接
    2.更换请求IP
    3.每次请求之间使用sleep进行请求间隔
    案例5:解析所有的城市名称
    import requests
    from  lxml import etree
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    url='https://www.aqistudy.cn/historydata/'
    origin_data=requests.get(url=url,headers=headers).text
    tree=etree.HTML(origin_data)
    hot_list=tree.xpath('//div[@class="row"]/div/div[1]/div/text() | //div[@class="row"]/div/div[1]/div[@class="bottom"]/ul[@class="unstyled"]/li/a/text()')
    with open('./文件夹1/city.txt','w',encoding='utf-8') as f:
        for hot in hot_list:
            f.write(hot.strip())
        common_list=tree.xpath('//div[@class="row"]/div/div[2]/div[1]/text() | //div[@class="row"]/div/div[2]/div[2]/ul//text()')
        for common in common_list:
            f.write(common.strip())
        f.close()
    print('over!!!')
    案例6:图片懒加载,站长素材婚纱照
    import requests
    from  lxml import etree
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    url='http://sc.chinaz.com/tupian/hunsha.html'
    origin_data=requests.get(url=url,headers=headers).text
    tree=etree.HTML(origin_data)
    div_list=tree.xpath('//div[@id="container"]/div')
    ​
    for div in div_list:
        title=div.xpath('./p/a/text()')[0].encode('iso-8859-1').decode('utf-8')
        name='./文件夹1/'+title+'.jpg'
        photo_url=div.xpath('./div/a/@href')[0]
        
        origin_data=requests.get(url=photo_url,headers=headers).text
        tree=etree.HTML(origin_data)
        url_it=tree.xpath('//div[@class="imga"]/a/img/@src')[0]
    ​
        origin_data=requests.get(url=url_it,headers=headers).content
        with open(name,'wb') as f:
            f.write(origin_data)
        
    print('over!!!')
    ##反爬机制5:代理IP

    使用

    import requests
    from  lxml import etree
    import random
    proxie=[{'https':'116.197.134.153:80'},{'https':'103.224.100.43:8080'},{'https':'222.74.237.246:808'}]
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
    }
    url='https://www.baidu.com/s?wd=ip'
    origin_data=requests.get(url=url,headers=headers,proxies=random.choice(proxie)).text
    ​
    with open('./ip.html','w',encoding='utf-8') as f:
        f.write(origin_data)
        
    print('over!!!')

    常用代理网站

    www.goubanjia.com
    快代理
    西祠代理

    代理知识

    透明:对方知道使用了代理,且知道真实IP
    匿名:对方知道使用了代理,不知道真实IP
    高匿:对方不知道使用了代理,更不知道真实IP

    注:代理IP的类型必须和请求url的协议头 保持一致

    https://www.55xia.com下载电影

    顺序:动态加载,url加密,element

     

     

     

  • 相关阅读:
    童鞋,[HttpClient发送文件] 的技术实践请查收
    有关[Http持久连接]的一切,卷给你看
    浅谈MemoryCache的原生插值方式
    HTTP1.1 KeepAlive到底算不算长连接?
    C2 hits the assertion assert(base>is_AddP()) failed: should be addp but is Phi
    C2 EA
    OOM Hook
    C2 Loop predicate
    C2 Build IR
    C2 CCP
  • 原文地址:https://www.cnblogs.com/shanghongyun/p/10482432.html
Copyright © 2011-2022 走看看