CSS选择器的作用实际和xpath的一样,都是为了定位具体的元素
举例我要爬取下面这个页面的标题
In [20]: title = response.css(".entry-header h1") In [21]: title Out[21]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' entry-header ')]/descendant-or-self::*/h1" data='<h1>谷歌用两年时间研究了 180 个团队,发现高效团队有这五个特征</h1>'>] In [22]: title = response.css(".entry-header h1").extract() In [23]: title Out[23]: ['<h1>谷歌用两年时间研究了 180 个团队,发现高效团队有这五个特征</h1>'] In [24]: ##可以使用css的::text取到内容 In [25]: title = response.css(".entry-header h1::text").extract() In [26]: title Out[26]: ['谷歌用两年时间研究了 180 个团队,发现高效团队有这五个特征']
获取文章创建日期:
In [38]: date_text = response.css(".entry-meta-hide-on-mobile").extract() In [39]: date_text Out[39]: ['<p class="entry-meta-hide-on-mobile"> 2017/08/23 · <a href="http://blog.jobbole.com/category/career/" rel="category tag">职场</a> · <a href="#article-comment"> 7 评论 </a> · <a href="http://blog.jobbole.com/tag/google/">Google</a>, <a href="http://blog.jobbole.com/tag/%e5%9b%a2%e9%98%9f/">团队</a> </p>'] In [40]: date_text = response.css(".entry-meta-hide-on-mobile::text").extract() In [41]: date_text Out[41]: [' 2017/08/23 · ', ' · ', ' · ', ', ', ' '] In [42]: date_text = response.css(".entry-meta-hide-on-mobile::text").extract()[ ...: 0] In [43]: date_text Out[43]: ' 2017/08/23 · ' In [44]: date_text = response.css(".entry-meta-hide-on-mobile::text").extract()[ ...: 0].strip() In [45]: date_text Out[45]: '2017/08/23 ·' In [46]: date_text = response.css(".entry-meta-hide-on-mobile::text").extract()[ ...: 0].strip().replace("·","").strip() In [47]: date_text Out[47]: '2017/08/23'
获取评论数
In [49]: comment_num = response.css("a[href='#article-comment']") In [50]: comment_num Out[50]: [<Selector xpath="descendant-or-self::a[@href = '#article-comment']" data='<a href="#article-comment"> 7 评论 </a>'>, <Selector xpath="descendant-or-self::a[@href = '#article-comment']" data='<a href="#article-comment"><span class="'>] In [51]: comment_num = response.css("a[href='#article-comment'] span::text").ext ...: ract() In [52]: comment_num Out[52]: [' 7 评论'] In [53]: comment_num = response.css("a[href='#article-comment'] span::text").ext ...: ract().strip() --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-53-18ae8761867f> in <module>() ----> 1 comment_num = response.css("a[href='#article-comment'] span::text").extract().strip() AttributeError: 'list' object has no attribute 'strip' In [54]: comment_num = response.css("a[href='#article-comment'] span::text").ext ...: ract()[0] In [55]: comment_num Out[55]: ' 7 评论' In [56]:
PS:css选择器里,不同标签使用空格隔开