zoukankan      html  css  js  c++  java
  • Selector提取数据1:XPath选择器

    1、XPath是什么?

    XPath即XML路径语言(XML Path Language),它是一种用来确定xml文档中某部分位置的语言。XPath本身遵循w3c标准。
    xml文档(html属于xml)是由一系列结点构成的树。例如从网络上爬取的一段html代码:

    <div class="post-114638 post type-post status-publish format-standard hentry category-it-tech tag-linux odd" id="post-114638">
    
        <!-- BEGIN .entry-header -->
        <div class="entry-header">
            <h1>能从远程获得乐趣的 Linux 命令</h1>
        </div>
    
        <div class="entry-meta">
            <p class="entry-meta-hide-on-mobile">
                2019/01/13 &middot;  <a href="http://blog.jobbole.com/category/it-tech/" rel="category tag">IT技术</a>
                &middot;  <a href="http://blog.jobbole.com/tag/linux/">Linux</a>
            </p>
        </div>
        <!-- END .entry-meta -->
    
        <div class="entry"></div>
        <div class="textwidget"></div>
    </div>
    
    2、利用Scrapy提供的shell进行XPath的调试
    自己构建Selector对象

    构建Selector对象有多种方式,这里我们只介绍一种简单易用的,便于我们进行XPath的调试、学习即可。

    1. 创建Selector对象:
    In [1]: from scrapy.selector import Selector
    In [2]: body = "<book><author>Tom John</author></book>"
    In [3]: selector = Selector(text=body)
    In [4]: selector
    Out[4]: <Selector xpath=None data='<html><body><book><author>Tom John</auth'>
    
    1. 选中&提取数据:
    In [5]: selector.xpath('//book/author/text()')
    Out[5]: [<Selector xpath='//book/author/text()' data='Tom John'>]
    
    In [40]: selector.xpath('string(//author)')
    Out[40]: [<Selector xpath='string(//author)' data='Tom John'>]
    

    正则表达式的使用:

    >>> response.xpath('//*[@id="post-114638"]/div[3]/div[5]/span[2]/text()').re('d*')
    ['', '1', '', '', '', '']
    >>> response.xpath('//*[@id="post-114638"]/div[3]/div[5]/span[2]/text()')
    [<Selector xpath='//*[@id="post-114638"]/div[3]/div[5]/span[2]/text()' data=' 1 收藏'>]
    >>> response.xpath('//*[@id="post-114638"]/div[3]/div[5]/span[2]/text()').re('d+')
    ['1']
    >>> response.xpath('//*[@id="post-114638"]/div[3]/div[5]/span[2]/text()').re('d+')[0]
    '1'
    >>> response.xpath('//*[@id="post-114638"]/div[3]/div[5]/span[2]/text()').re('.*(d+).*')[0]
    '1'
    >>>> response.xpath('//*[@id="post-114638"]/div[3]/div[5]/span[2]/text()').re('.*(d+).*').group(1)
    Traceback (most recent call last):
      File "<console>", line 1, in <module>
    AttributeError: 'list' object has no attribute 'group'
    >>>
    
    利用Scrapy提供的shell

    利用scrapy提供的shell调试scrapy shell http://blog.jobbole.com/114638/

    (Py3_spider) D:SpiderProjectspider_pjt1>scrapy shell http://blog.jobbole.com/114638/
    2019-01-31 10:37:25 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: spider_pjt1)
    2019-01-31 10:37:25 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17763-SP0
    2019-01-31 10:37:25 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'spider_pjt1', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'spider_pjt1.spiders', 'SPIDER_MODULES': ['spider_pjt1.spiders']}
    2019-01-31 10:37:25 [scrapy.extensions.telnet] INFO: Telnet Password: 4f8f06a70c3e7ec1
    2019-01-31 10:37:25 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole']
    2019-01-31 10:37:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2019-01-31 10:37:26 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2019-01-31 10:37:26 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2019-01-31 10:37:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    2019-01-31 10:37:26 [scrapy.core.engine] INFO: Spider opened
    2019-01-31 10:37:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://blog.jobbole.com/114638/> (referer: None)
    [s] Available Scrapy objects:
    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    [s]   crawler    <scrapy.crawler.Crawler object at 0x00000228574EFB70>
    [s]   item       {}
    [s]   request    <GET http://blog.jobbole.com/114638/>
    [s]   response   <200 http://blog.jobbole.com/114638/>
    [s]   settings   <scrapy.settings.Settings object at 0x00000228574EFA90>
    [s]   spider     <JobboleSpider 'jobbole' at 0x22857795d68>
    [s] Useful shortcuts:
    [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
    [s]   fetch(req)                  Fetch a scrapy.Request and update local objects
    [s]   shelp()           Shell help (print this help)
    [s]   view(response)    View response in a browser
    >>>
    

    接着就可以在这里进行调试,比如:

    >>> title = response.xpath('//*[@id="post-114638"]/div[1]/h1')
    >>> title
    [<Selector xpath='//*[@id="post-114638"]/div[1]/h1' data='<h1>能从远程获得乐趣的 Linux 命令</h1>'>]
    >>>
    >>> title.extract()
    ['<h1>能从远程获得乐趣的 Linux 命令</h1>']
    >>> title.extract()[0]
    '<h1>能从远程获得乐趣的 Linux 命令</h1>'
    >>>
    

    由于xpath返回一个selector对象,所以可以接着对其操作:

    >>> title.xpath('//div[@class="entry-header"]/h1/text()')
    [<Selector xpath='//div[@class="entry-header"]/h1/text()' data='能从远程获得乐趣的 Linux 命令'>]
    >>> title.xpath('//div[@class="entry-header"]/h1/text()').extract()
    ['能从远程获得乐趣的 Linux 命令']
    >>> title.xpath('//div[@class="entry-header"]/h1/text()').extract()[0]
    '能从远程获得乐趣的 Linux 命令'
    >>>
    

    注意text只会获取到标签内的文本,在遇到下一个标签时后面的文本都将被忽略,如:

    html代码:

    <div class="post-114638 post type-post status-publish format-standard hentry category-it-tech tag-linux odd" id="post-114638">
    
        <!-- BEGIN .entry-header -->
        <div class="entry-header">
            <h1>能从远程获得乐趣的 Linux 命令</h1>
        </div>
    
        <div class="entry-meta">
            <p class="entry-meta-hide-on-mobile">
                2019/01/13 &middot;  <a href="http://blog.jobbole.com/category/it-tech/" rel="category tag">IT技术</a>
                &middot;  <a href="http://blog.jobbole.com/tag/linux/">Linux</a>
            </p>
        </div>
        <!-- END .entry-meta -->
    
        <div class="entry"></div>
        <div class="textwidget"></div>
    </div>
    

    text()获取文本:

    >>> response.xpath('//*[@id="post-114638"]/div[2]/p').extract()[0]
    '<p class="entry-meta-hide-on-mobile">
    
                2019/01/13 ·  <a href="http://blog.jobbole.com/category/it-tech/" rel="category tag">IT技术</a>
                
                
    
                
                 ·  <a href="http://blog.jobbole.com/tag/linux/">Linux</a>
                
    </p>'
    >>>
    >>> response.xpath('//*[@id="post-114638"]/div[2]/p/text()').extract()[0]
    '
    
                2019/01/13 ·  '
    >>>
    >>> response.xpath('//*[@id="post-114638"]/div[2]/p/text()').extract()[0].replace('·','').strip()
    '2019/01/13'
    >>>
    

    扩展

    分析一下下面两个文件的代码:

    这是scrapy genspider jobbole blog.jobbole.com自动生成的文件(spider_pjt1spider_pjt1spidersjobbole.py):

    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class JobboleSpider(scrapy.Spider):
        name = 'jobbole'
        allowed_domains = ['blog.jobbole.com']
        start_urls = ['http://blog.jobbole.com/']
    
        def parse(self, response):
            pass
    

    这个类继承scrapy.Spider,查看一下scrapy.Spider的一段代码(EnvsPy3_spiderLibsite-packagesscrapyspiders\__init__.py):

        def start_requests(self):
            cls = self.__class__
            if method_is_overridden(cls, Spider, 'make_requests_from_url'):
                warnings.warn(
                    "Spider.make_requests_from_url method is deprecated; it "
                    "won't be called in future Scrapy releases. Please "
                    "override Spider.start_requests method instead (see %s.%s)." % (
                        cls.__module__, cls.__name__
                    ),
                )
                for url in self.start_urls:
                    yield self.make_requests_from_url(url)
            else:
                for url in self.start_urls:
                    yield Request(url, dont_filter=True)
    
        def make_requests_from_url(self, url):
            """ This method is deprecated. """
            return Request(url, dont_filter=True)
    

    scrapy下载器(DOWNLOADER)下载完成后会回到接着执行parse()parse(self, response)中的response和Django中的response相似。

    PyCharm中没有scrapy的模板,可以自己定义main.py文件来调用命令行完成调试,这里用到scrapy提供的内置函数,调用这个函数可以执行scrapy脚本。下面是main.py文件(spider_pjt1main.py):

    3、在PyCharm等IDE中进行调试

    我们选择PyCharm为例,其他IDE类似:

    引子:
    scrapy启动一个spider的命令scrapy crawl spider_name,spider_name和JobboleSpider中的name值一致,注意确保在scrapy.cfg所在路径执行命令。

    (Py3_spider) D:SpiderProject>cd spider_pjt1
    (Py3_spider) D:SpiderProjectspider_pjt1>scrapy crawl jobbole
    ...
    ModuleNotFoundError: No module named 'win32api'
    

    提示缺失win32api模块。
    我们安装pypiwin32,这个一般只是Windows环境下会出现这个问题。
    或者使用豆瓣镜像源:pip install -i https://pypi.douban.com/simple pypiwin32

    (Py3_spider) D:SpiderProjectspider_pjt1>pip install pypiwin32
    Collecting pypiwin32
      Downloading https://files.pythonhosted.org/packages/d0/1b/2f292bbd742e369a100c91faa0483172cd91a1a422a6692055ac920946c5/pypiwin32-223-py3-none-any.whl
    Collecting pywin32>=223 (from pypiwin32)
      Downloading https://files.pythonhosted.org/packages/a3/8a/eada1e7990202cd27e58eca2a278c344fef190759bbdc8f8f0eb6abeca9c/pywin32-224-cp37-cp37m-win_amd64.whl (9.0MB)
        100% |████████████████████████████████| 9.1MB 32kB/s
    Installing collected packages: pywin32, pypiwin32
    Successfully installed pypiwin32-223 pywin32-224
    

    接下来就可以正常启动对应的爬虫了:

    (Py3_spider) D:SpiderProjectspider_pjt1>scrapy crawl jobbole
    ...
    2019-01-31 08:13:48 [scrapy.core.engine] INFO: Spider closed (finished)
    
    (Py3_spider) D:SpiderProjectspider_pjt1>
    

    进入正题:
    然后我们main.py的代码如下:

    # -*- coding: utf-8 -*-
    # @Author  : One Fine
    # @File    : main.py
    
    from scrapy.cmdline import execute
    import sys
    import os
    
    # 设置项目工程的路径,使scrapy命令在项目工程里面运行
    # os.path.abspath(__file__)获取当前文件的路径
    # os.path.dirname(os.path.abspath(__file__))获取当前文件所在目录的父目录————即项目所在路径
    
    sys.path.append(os.path.dirname(os.path.abspath(__file__)))
    
    # 调用execute()函数来执行命令,此方法传递一个数组作为参数
    execute(["scrapy", "crawl", "jobbole"])
    

    接下来设置settings.py中的ROBOTSTXT_OBEY参数,将其改为False,让scrapy在爬取过程中不要读取网站的robots.txt文件,不过滤符合robots协议的url。

    ROBOTSTXT_OBEY = False
    

    接下来在spider_pjt1spider_pjt1spidersjobbole.py的parse方法内部打上断点就可以从main.py调试scrapy项目了。

    注意:F12是全部加载完之后的结构,和直接点击view page source可能不一样——里面的代码是http请求时就产生的。


    下面使用XPath提取网页数据:

    def parse_detail(self, response):
        #获取标题
        #可以用//*[@id="post-112614"]/div[1]/h1/text()获取标签里面的值
        title = response.xpath('//*[@class="entry-header"]/h1/text()').extract()[0]
        # print('title',title)
        # re1_selector = response.xpath('//div[@class="entry_header"]/h1/text()')
        #获取时间
        #获取字符串的话用time.extract()[0].strip().repalce("·","").strip()
        create_date = response.xpath('//*[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip().replace("·","").strip()
        #获取点赞数
        praise_nums = response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0]
        #获取收藏,此处包含'收藏数'和'收藏'两个字
        fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0].strip()
        match_re = re.match('.*?(d+).*',fav_nums)
        if match_re:
            #获取收藏数
            fav_nums = int(match_re.group(1))
        else:
            fav_nums = 0
        #获取评论数
        comment_nums = response.xpath('//*[@class="entry-meta-hide-on-mobile"]/a[2]/text()').extract()[0].strip()
        match_re = re.match('.*?(d+).*', comment_nums)
        if match_re:
            # 获取收藏数
            comment_nums = int(match_re.group(1))
        else:
            comment_nums = 0
        #获取文章分类标签
        tag_list = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract()
        tag_list = [element for element in tag_list if not element.strip().endswith('评论')]
        tag = ','.join(tag_list)
        content = response.xpath('//*[@class="entry"]').extract()[0]
    

  • 相关阅读:
    java web报错The superclass "javax.servlet.http.HttpServlet" was not found on the Java Build Path
    在cmd下执行mvn help:system 报错
    win10安装mysql5.7
    centos7装单机hadoop2.7.3
    win10装jdk
    oracle中批量修改年份和月份,但不修改时分秒
    Python 正则匹配网页内的IP地址及端口号
    python 爬取网页内的代理服务器列表(需调整优化)
    python 爬取百度url
    Python 爬取SeeBug poc
  • 原文地址:https://www.cnblogs.com/onefine/p/10499362.html
Copyright © 2011-2022 走看看