zoukankan      html  css  js  c++  java
  • 【Rollo的Python之路】Scrapy Selector选择器的学习

    选择器(Selectors)

    当抓取网页时,你做的最常见的任务是从HTML源码中提取数据。现有的一些库可以达到这个目的:

    • BeautifulSoup 是在程序员间非常流行的网页分析库,它基于HTML代码的结构来构造一个Python对象, 对不良标记的处理也非常合理,但它有一个缺点:慢。
    • lxml 是一个基于 ElementTree (不是Python标准库的一部分)的python化的XML解析库(也可以解析HTML)。

    Scrapy提取数据有自己的一套机制。它们被称作选择器(seletors),因为他们通过特定的 XPath 或者 CSS表达式来“选择” HTML文件中的某个部分。

    Scrapy选择器构建于 lxml 库之上,这意味着它们在速度和解析准确性上非常相似

    Scrapy selector是以 文字(text) 或 TextResponse 构造的 Selector 实例。 其根据输入的类型自动选择最优的分析方法(XML vs HTML):

    >>> from scrapy.selector import Selector
    >>> from scrapy.http import HtmlResponse

    用scrapy shell 来打一个网站来练习一下selectors选择器:(pycharm 的terminal可以练习)

    scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

    HTML源码:

    <html>
     <head>
      <base href='http://example.com/' />
      <title>Example website</title>
     </head>
     <body>
      <div id='images'>
       <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
       <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
       <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
       <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
       <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
      </div>
     </body>
    </html>

    1.1.1 选择title,用xpath的方法。selector选出来的是一个列表。

    In [1]: response.selector.xpath('//title/text()')
    
    >>>
    Out[1]: [<Selector xpath='//title/text()' data='Example website'>]
    Out[1]: [<Selector xpath='//title/text()' data='Example website'>]
    
    In [2]: response.selector.xpath('//title/text()').extract
    Out[2]: <bound method SelectorList.getall of [<Selector xpath='//title/text()' data='Example website'>]>
    
    In [3]: response.selector.xpath('//title/text()').extract_first()
    Out[3]: 'Example website

    选择title,用css的方法。

    In [4]: response.selector.css('title::text')
    Out[4]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]
    
    In [5]: response.selector.css('title::text').extract
    Out[5]: <bound method SelectorList.getall of [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]>
    
    In [6]: response.selector.css('title::text').extract_first
    Out[6]: <bound method SelectorList.get of [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]>
    
    In [7]: response.selector.css('title::text').extract_first()
    Out[7]: 'Example website'

    同样,scrapy为了response内置了selector这个参数,所有也可以不打个这selector,直接用response.xpath or response.css就可以提取

      用xpath的方法。

    In [8]: response.xpath('//title/text()')
    Out[8]: [<Selector xpath='//title/text()' data='Example website'>]
    
    In [9]: response.xpath('//title/text()').extract
    Out[9]: <bound method SelectorList.getall of [<Selector xpath='//title/text()' data='Example website'>]>
    
    In [10]: response.xpath('//title/text()').extract_first()
    Out[10]: 'Example website'

      用css的方法。

    In [11]: response.css('title::text')
    Out[11]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]
    
    In [12]: response.css('title::text').extract()
    Out[12]: ['Example website']
    
    In [13]: response.css('title::text').extract
    Out[13]: <bound method SelectorList.getall of [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]>

    注意:extract_first()有用来值,dextract_first(default)

    1.1.2 选择文本

    >>> response.xpath('//base/@href').extract()
    [u'http://example.com/']
    
    >>> response.css('base::attr(href)').extract()
    [u'http://example.com/']
    
    >>> response.xpath('//a[contains(@href, "image")]/@href').extract()
    [u'image1.html',
     u'image2.html',
     u'image3.html',
     u'image4.html',
     u'image5.html']
    
    >>> response.css('a[href*=image]::attr(href)').extract()
    [u'image1.html',
     u'image2.html',
     u'image3.html',
     u'image4.html',
     u'image5.html']
    
    >>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
    [u'image1_thumb.jpg',
     u'image2_thumb.jpg',
     u'image3_thumb.jpg',
     u'image4_thumb.jpg',
     u'image5_thumb.jpg']
    
    >>> response.css('a[href*=image] img::attr(src)').extract()
    [u'image1_thumb.jpg',
     u'image2_thumb.jpg',
     u'image3_thumb.jpg',
     u'image4_thumb.jpg',
     u'image5_thumb.jpg']

    1.1.3 可以使用RE来提取:

    In [18]: response.css('a::text').re('Name:(.*)')
    Out[18]: 
    [' My image 1 ',
     ' My image 2 ',
     ' My image 3 ',
     ' My image 4 ',
     ' My image 5 ']
    
    In [19]: response.css('a::text').re_first('Name:(.*)')
    Out[19]: ' My image 1 '
    In [21]: response.xpath('//a/text()')
    Out[21]: 
    [<Selector xpath='//a/text()' data='Name: My image 1 '>,
     <Selector xpath='//a/text()' data='Name: My image 2 '>,
     <Selector xpath='//a/text()' data='Name: My image 3 '>,
     <Selector xpath='//a/text()' data='Name: My image 4 '>,
     <Selector xpath='//a/text()' data='Name: My image 5 '>]
    
    In [22]: response.xpath('//a/text()').extract
    Out[22]: <bound method SelectorList.getall of [<Selector xpath='//a/text()' data='Name: My image 1 '>, <Selector xpath='//a/text()' data='Name: My image 2 '>, <Selector xpath='//a/text()' data='Name: My image 3 '>, <Selector xpath='//a/text()' data='Name:
    My image 4 '>, <Selector xpath='//a/text()' data='Name: My image 5 '>]>
    
    In [23]: response.xpath('//a/text()').extract()
    Out[23]: 
    ['Name: My image 1 ',
     'Name: My image 2 ',
     'Name: My image 3 ',
     'Name: My image 4 ',
     'Name: My image 5 ']
    
    In [26]: response.xpath('//a/text()').re('Name:(.*)')
    Out[26]: 
    [' My image 1 ',
     ' My image 2 ',
     ' My image 3 ',
     ' My image 4 ',
     ' My image 5 ']
    
    In [27]: response.xpath('//a/text()').re_first('Name:(.*)')
    Out[27]: ' My image 1 '

    1.1.3 可以提取属性:

    #xpath提取url
    
    In [28]: response.xpath('//base/@href').extract()
    Out[28]: ['http://example.com/']
    
    In [29]: response.xpath('//a/@href').extract()
    Out[29]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
    
    #css提取url
    
    In [32]: response.css('a::attr(href)').extract_first()
    Out[32]: 'image1.html'
    
    In [33]: response.css('a::attr(href)').extract()
    Out[33]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
    #xpath属性包括image的url
    
    In [42]: response.xpath('//a[contains(@href,"image")]/@href')
    Out[42]: 
    [<Selector xpath='//a[contains(@href,"image")]/@href' data='image1.html'>,
     <Selector xpath='//a[contains(@href,"image")]/@href' data='image2.html'>,
     <Selector xpath='//a[contains(@href,"image")]/@href' data='image3.html'>,
     <Selector xpath='//a[contains(@href,"image")]/@href' data='image4.html'>,
     <Selector xpath='//a[contains(@href,"image")]/@href' data='image5.html'>]
    
    In [43]: response.xpath('//a[contains(@href,"image")]/@href').extract()
    Out[43]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
    
    In [44]: response.xpath('//a[contains(@href,"image")]/@href').extract_first()
    Out[44]: 'image1.html'
    
    #css属性包括image的url
    
    In [45]: response.css('a[href*=image]::attr(href)')
    Out[45]: 
    [<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image1.html'>,
     <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image2.html'>,
     <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image3.html'>,
     <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image4.html'>,
     <Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image5.html'>]
    
    
    In [46]: response.css('a[href*=image]::attr(href)').extract()
    Out[46]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
    
    In [47]: response.css('a[href*=image]::attr(href)').extract_first()
    Out[47]: 'image1.html'
    In [48]: response.xpath('//a[contains(@href,"image")]/img/@src').extract()
    Out[48]: 
    ['image1_thumb.jpg',
     'image2_thumb.jpg',
     'image3_thumb.jpg',
     'image4_thumb.jpg',
     'image5_thumb.jpg']
    
    In [50]: response.css('a[href*=image] img::attr(src)').extract()
    Out[50]: 
    ['image1_thumb.jpg',
     'image2_thumb.jpg',
     'image3_thumb.jpg',
     'image4_thumb.jpg',
     'image5_thumb.jpg']

    css: 要取文本就用text,要属性就用::attr

    xpath:text()取文本,@属性名,可以取属性

    In [53]: response.css('a img::attr(src)').extract()
    Out[53]: 
    ['image1_thumb.jpg',
     'image2_thumb.jpg',
     'image3_thumb.jpg',
     'image4_thumb.jpg',
     'image5_thumb.jpg']
    
    
    In [55]: response.xpath('//a/img/@src').extract()
    Out[55]: 
    ['image1_thumb.jpg',
     'image2_thumb.jpg',
     'image3_thumb.jpg',
     'image4_thumb.jpg',
     'image5_thumb.jpg']
  • 相关阅读:
    Cs Round#54 E Late Edges
    Cs Round#54 D Spanning Trees
    python装饰器的理解
    java序列化,二进制和数据流
    netty的理解
    CenterOS卸载和安装MYSQL
    oracle的一些问题
    tomcat优化方案(转)
    Selector
    Buffer
  • 原文地址:https://www.cnblogs.com/rollost/p/10917172.html
Copyright © 2011-2022 走看看