zoukankan      html  css  js  c++  java
  • Selector提取数据2:CSS选择器

    CSS选择器

    CSS即层叠样式表,其选择器是一种用来确定HTML文档中某部分位置的语法。
    CSS选择器的语法比XPath简单一点,但功能不如XPath强大。实际上,当我们调用Selector的CSS方式时,在其内部会使用Python库cssselect将CSS选择器表达式翻译成XPath表达式,然后调用Selector对象的XPath方法。

    略去基本语法,直接利用scrapy提供的shell调试工具,以http://blog.jobbole.com/114638/ 这个页面为例:

    (Py3_spider) D:SpiderProjectspider_pjt1>scrapy shell http://blog.jobbole.com/114638/
    ...
    >>>
    >>>> response.css('.entry-header h1')
    [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' entry-header ')]/descendant-or-self::*/h1" data='<h1>能从远程获得乐趣的 Linux 命令</h1>'>]
    >>> response.css('.entry-header h1').extract()
    ['<h1>能从远程获得乐趣的 Linux 命令</h1>']
    >>> response.css('.entry-header h1::text').extract()
    ['能从远程获得乐趣的 Linux 命令']
    >>> response.css('.entry-header h1::text').extract()[0]
    '能从远程获得乐趣的 Linux 命令'
    >>>
    
    提取发布日期:

    正则表达式的使用

    >>> response.css('p.entry-meta-hide-on-mobile::text').extract()[0]
    '
    
                2019/01/13 ·  '
    >>> response.css('p.entry-meta-hide-on-mobile::text').re('d{4}/d{2}/d{2}')[0]
    '2019/01/13'
    >>>
    
    接着提取点赞数:
    >>> response.css('[id="114638votetotal"]')
    [<Selector xpath="descendant-or-self::*[@id = '114638votetotal']" data='<h10 id="114638votetotal">1</h10>'>]
    >>> response.css('[id="114638votetotal"]').extract()[0]
    '<h10 id="114638votetotal">1</h10>'
    >>> response.css('[id="114638votetotal"]::text').extract()[0]
    '1'
    >>>
    ...或者下面的
    >>> response.css('.vote-post-up h10::text').extract()[0]
    '1'
    >>>
    
    提取收藏数:
    >>> response.css('span[data-book-type]::text').extract()[0]
    ' 1 收藏'
    >>> response.css('span[data-book-type]::text').re("d+")[0]
    '1'
    >>>
    ...或者下面的
    >>> response.css('span.bookmark-btn::text').re("d+")[0]
    '1'
    >>>
    
    提取评论数:
    >>> response.css('a[href="#article-comment"] span::text').re("d*")[0]
    ''
    >>> response.css('a[href="#article-comment"] span::text').extract()[0]
    '  评论'
    >>>
    
    提取正文内容
    >>> response.css('div.entry').extract()[0]
    ...内容过多,已省略结果
    >>>
    
    提取文章所属类别
    >>> response.css('.entry-meta a[rel]::text').extract()[0]
    'IT技术'
    >>>
    
  • 相关阅读:
    LeetCode 326. Power of Three
    LeetCode 324. Wiggle Sort II
    LeetCode 322. Coin Change
    LeetCode 321. Create Maximum Number
    LeetCode 319. Bulb Switcher
    LeetCode 318. Maximum Product of Word Lengths
    LeetCode 310. Minimum Height Trees (DFS)
    个人站点大开发!--起始篇
    LeetCode 313. Super Ugly Number
    LeetCode 309. Best Time to Buy and Sell Stock with Cooldown (DP)
  • 原文地址:https://www.cnblogs.com/onefine/p/10499361.html
Copyright © 2011-2022 走看看