zoukankan      html  css  js  c++  java
  • 【python爬虫】scrapy入门5--xpath等后面接正则

    比如我们要调试某网页:https://g.widora.cn/

    shell不依赖工程环境

    scrapy shell https://g.widora.cn/

    类似页面F12,可用对象都列出来了,一般常用response

    前面省略
    
    2020-05-08 21:07:18 [asyncio] DEBUG: Using selector: KqueueSelector
    [s] Available Scrapy objects:
    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    [s]   crawler    <scrapy.crawler.Crawler object at 0x1118626d0>
    [s]   item       {}
    [s]   request    <GET https://g.widora.cn/>
    [s]   response   <200 https://g.widora.cn/>
    [s]   settings   <scrapy.settings.Settings object at 0x111bd7890>
    [s]   spider     <DefaultSpider 'default' at 0x112103250>
    [s] Useful shortcuts:
    [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
    [s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
    [s]   shelp()           Shell help (print this help)
    [s]   view(response)    View response in a browser
    2020-05-08 21:07:18 [asyncio] DEBUG: Using selector: KqueueSelector

    查找某群号:xpath等支持re,extract、get等后面不支持re

    In [1]: response.xpath("/html/body/div/div[5]/p/a").extract()                   
    
    Out[1]: ['<a target="_blank" href="//shang.qq.com/wpa/qunwpa?idkey=f65cb90612db81ef9bee771440adb40c004933a18b7c0466a279486936aedc79" src="title=" style="color:#00a1d6">G.widora.cn 群(1031687050)</a>']
    
    In [2]: response.xpath("/html/body/div/div[5]/p/a/text()").extract()            
    
    Out[2]: ['G.widora.cn 群(1031687050)']
    
    In [3]: response.xpath("/html/body/div/div[5]/p/a/text()")                      
    
    Out[3]: [<Selector xpath='/html/body/div/div[5]/p/a/text()' data='G.widora.cn 群(1031687050)'>]
    
    In [4]: response.xpath("/html/body/div/div[5]/p/a/text()").re('d+')            
    
    Out[4]: ['1031687050']

    终端写这个很麻烦,还是在浏览器上先调试通过再写代码 

     

  • 相关阅读:
    flask综合整理1
    flask
    linux
    用户登录权限汇总
    DRF之注册响应分页组件
    MVC 过滤器 构建会员是否登录
    压缩文本、字节或者文件的压缩辅助类-GZipHelper
    MVC 构建图片/文件选择器 参考其它CMS功能
    MVC5+EF6 简易版CMS(非接口) 第四章:使用业务层方法,以及关联表解决方案
    MVC5+EF6 简易版CMS(非接口) 第三章:数据存储和业务处理
  • 原文地址:https://www.cnblogs.com/hightech/p/12853158.html
Copyright © 2011-2022 走看看