zoukankan      html  css  js  c++  java
  • scrapy中的xpath中的re使用

    第一种:

    例子:这里我使用"http://www.simple-style.com/page/1"这个网站的爬虫

    >>>scrapy shell  http://www.simple-style.com/page/1

    进入交互环境后,我想找到当前网页的所有src

     1 >>> response.xpath('//@src').extract()
     2 ['http://www.simple-style.com/wp-includes/js/jquery/jquery.js?ver=1.12.4', 'http://www.simple-style.com/wp-includes/js/jquery/jquery-migrate.m
     3 in.js?ver=1.4.1', 'http://www.simple-style.com/wp-content/plugins/to-top/public/js/to-top-public.js?ver=1.0', 'http://www.simple-style.com/wp-
     4 content/uploads/2017/03/simple-logo.gif', '//v.qq.com/iframe/player.html?vid=e0386mjreck&tiny=0&auto=0', 'http://www.simple-style.com/wp-conte
     5 nt/uploads/2017/03/END_OF_LOVE_MICHAL_NAROZNY_001.jpg', 'http://www.simple-style.com/wp-content/uploads/2017/03/ali_bosworth_01.jpg', 'http://
     6 www.simple-style.com/wp-content/uploads/2017/03/xiaoxuan_01.jpg', 'http://www.simple-style.com/wp-content/uploads/2017/03/the_warehouse_hotel_
     7 01.jpg', 'http://www.simple-style.com/wp-content/uploads/2017/02/ahndraya_parlato_01.jpg', 'http://www.simple-style.com/wp-content/uploads/201
     8 6/07/inner_self_04.jpg', 'http://www.simple-style.com/wp-content/uploads/2016/07/Yuanghua-Chen-01.jpg', 'http://www.simple-style.com/wp-conten
     9 t/uploads/2016/07/01-alicephoebelou.jpg', 'http://www.simple-style.com/wp-content/uploads/2016/06/02-Tim_Gao_Photography_Invisible_Theatre_17.
    10 jpg', 'http://www.simple-style.com/wp-content/uploads/2016/05/4.png', 'http://www.simple-style.com/wp-content/uploads/2016/05/01-Remona.jpg',
    11 'http://www.simple-style.com/wp-content/uploads/2016/05/Nbr-h000-1.jpg', 'http://www.simple-style.com/wp-content/uploads/2016/04/0501.jpg', 'h
    12 ttp://www.simple-style.com/wp-content/uploads/2016/04/01.jpg', 'http://www.simple-style.com/wp-content/plugins/smartideo/static/smartideo.js?v
    13 er=2.2.5', 'http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/js/skip-link-focus-fix.js?ver=1.0', 'http://www.simple-style.
    14 com/wp-content/themes/twentyseventeen/assets/js/navigation.js?ver=1.0', 'http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/
    15 js/global.js?ver=1.0', 'http://www.simple-style.com/wp-content/themes/twentyseventeen/assets/js/jquery.scrollTo.js?ver=2.1.2', 'http://www.sim
    16 ple-style.com/wp-includes/js/wp-embed.min.js?ver=4.7.3']

    得到很多个src后,我想只取到"/2017/03"日上传的jpg的src,则可以使用正则

    这里xpath后的对象不用extract(), re后会返回一个字符串列表,否则会报错

    1 response.xpath('//@src').re('.*/2017/03/.*.jpg')
    2 ['http://www.simple-style.com/wp-content/uploads/2017/03/END_OF_LOVE_MICHAL_NAROZNY_001.jpg', 'http://www.simple-style.com/wp-content/uploads/
    3 2017/03/ali_bosworth_01.jpg', 'http://www.simple-style.com/wp-content/uploads/2017/03/xiaoxuan_01.jpg', 'http://www.simple-style.com/wp-conten
    4 t/uploads/2017/03/the_warehouse_hotel_01.jpg']

    第二种:

     1 from scrapy.selector import Selector
     2 from scrapy.http import HtmlResponse
     3 html = """<!DOCTYPE html>
     4 <html>
     5 <head lang="en">
     6     <meta charset="UTF-8">
     7     <title></title>
     8 </head>
     9 <body>
    10     <li class="item-"><a href="link.html">first item</a></li>
    11     <li class="item-0"><a href="link1.html">first item</a></li>
    12     <li class="item-1"><a href="link2.html">second item</a></li>
    13 </body>
    14 </html>
    15 """
    16 response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
    17 ret = Selector(response=response).xpath('//li[re:test(@class, "item-d*")]//@href').extract()
    18 print(ret)
    19 
    20 正则选择器
  • 相关阅读:
    ASP.NET AJAX Beta 1 发布 (转载)
    ASP.NET里常用的JS (转贴)
    让您的Ajax应用加载数据时界面友好 (转贴)
    模态窗口 javascript html
    最亲密接触Dhtml&JScript开发细节 (转贴)
    Hashtable的使用
    2009年全国年节及纪念日放假办法
    详解.NET中的动态编译
    CSS2.0样式手册_说明_SDK下载chm
    [转]DISTINCT 和 ORDER BY 使用第三个字段进行排序
  • 原文地址:https://www.cnblogs.com/Garvey/p/6697162.html
Copyright © 2011-2022 走看看