zoukankan      html  css  js  c++  java
  • 第七篇 css选择器实现字段解析

    CSS选择器的作用实际和xpath的一样,都是为了定位具体的元素 

    举例我要爬取下面这个页面的标题

    In [20]: title = response.css(".entry-header h1")
    
    In [21]: title
    Out[21]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' entry-header ')]/descendant-or-self::*/h1" data='<h1>谷歌用两年时间研究了 180 个团队,发现高效团队有这五个特征</h1>'>]
    
    In [22]: title = response.css(".entry-header h1").extract()
    
    In [23]: title
    Out[23]: ['<h1>谷歌用两年时间研究了 180 个团队,发现高效团队有这五个特征</h1>']
    
    In [24]: ##可以使用css的::text取到内容
    
    In [25]: title = response.css(".entry-header h1::text").extract()
    
    In [26]: title
    Out[26]: ['谷歌用两年时间研究了 180 个团队,发现高效团队有这五个特征']

    获取文章创建日期:

    In [38]: date_text = response.css(".entry-meta-hide-on-mobile").extract()
    
    In [39]: date_text
    Out[39]: ['<p class="entry-meta-hide-on-mobile">
    
                2017/08/23 ·  <a href="http://blog.jobbole.com/category/career/" rel="category tag">职场</a>
                
                                · <a href="#article-comment"> 7 评论 </a>
                
    
                
                 ·  <a href="http://blog.jobbole.com/tag/google/">Google</a>, <a href="http://blog.jobbole.com/tag/%e5%9b%a2%e9%98%9f/">团队</a>
                
    </p>']
    
    In [40]: date_text = response.css(".entry-meta-hide-on-mobile::text").extract()
    
    In [41]: date_text
    Out[41]: 
    ['
    
                2017/08/23 ·  ',
     '
                
                                · ',
     '
                
    
                
                 ·  ',
     ', ',
     '
                
    ']
    
    In [42]: date_text = response.css(".entry-meta-hide-on-mobile::text").extract()[
        ...: 0]
    
    In [43]: date_text
    Out[43]: '
    
                2017/08/23 ·  '
    
    In [44]: date_text = response.css(".entry-meta-hide-on-mobile::text").extract()[
        ...: 0].strip()
    
    In [45]: date_text
    Out[45]: '2017/08/23 ·'
    
    In [46]: date_text = response.css(".entry-meta-hide-on-mobile::text").extract()[
        ...: 0].strip().replace("·","").strip()
    
    In [47]: date_text
    Out[47]: '2017/08/23'

     获取评论数

    In [49]: comment_num = response.css("a[href='#article-comment']")
    
    In [50]: comment_num
    Out[50]: 
    [<Selector xpath="descendant-or-self::a[@href = '#article-comment']" data='<a href="#article-comment"> 7 评论 </a>'>,
     <Selector xpath="descendant-or-self::a[@href = '#article-comment']" data='<a href="#article-comment"><span class="'>]
    
    In [51]: comment_num = response.css("a[href='#article-comment'] span::text").ext
        ...: ract()
    
    In [52]: comment_num
    Out[52]: [' 7 评论']
    
    In [53]: comment_num = response.css("a[href='#article-comment'] span::text").ext
        ...: ract().strip()
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-53-18ae8761867f> in <module>()
    ----> 1 comment_num = response.css("a[href='#article-comment'] span::text").extract().strip()
    
    AttributeError: 'list' object has no attribute 'strip'
    
    In [54]: comment_num = response.css("a[href='#article-comment'] span::text").ext
        ...: ract()[0]
    
    In [55]: comment_num
    Out[55]: ' 7 评论'
    
    In [56]: 
    View Code

     PS:css选择器里,不同标签使用空格隔开

  • 相关阅读:
    python excel导入到数据库
    ubuntu14.04修改mysql默认编码
    python 向MySQL里插入中文数据
    hbase框架原理
    hive框架原理
    Hadoop的MapReduce模型基本原理
    机器学习模型效果评价
    spark架构原理
    Hadoop架构原理
    特征工程
  • 原文地址:https://www.cnblogs.com/laonicc/p/7581463.html
Copyright © 2011-2022 走看看