zoukankan      html  css  js  c++  java
  • LinkExtractor

    wljdeMacBook-Pro:~ wlj$ scrapy shell "http://www.bjrbj.gov.cn/mzhd/detail_29974.htm"
    scrapy shell发送请求
    scrapy shell "http://www.bjrbj.gov.cn/mzhd/detail_29974.htm"
    wljdeMacBook-Pro:~ wlj$ scrapy shell "http://www.bjrbj.gov.cn/mzhd/detail_29974.htm"

    响应文件

    response.body
    response.text
    response.url
    >>> response.url
    'https://item.btime.com/m_9b62d3a9239a9473c'

    导入LinkExtractor,匹配整个html文档中的链接

    from scrapy.linkextractors import LinkExtractor

    
    
    >>> from scrapy.linkextractors import LinkExtractor
    >>> response.xpath('//div[@class="xx_neirong"]/h1/text()').extract()[0] 

    '北京社保开户流程是怎么个流程'


    demo
     1 wljdeMacBook-Pro:Desktop wlj$ scrapy shell "http://hr.tencent.com/position.php?"
     2 2018-06-21 21:12:40 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
     3 2018-06-21 21:12:40 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.5 (default, Apr 25 2018, 14:23:58) - [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit
     4 2018-06-21 21:12:40 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
     5 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled extensions:
     6 ['scrapy.extensions.corestats.CoreStats',
     7  'scrapy.extensions.telnet.TelnetConsole',
     8  'scrapy.extensions.memusage.MemoryUsage']
     9 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
    10 ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
    11  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
    12  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
    13  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
    14  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
    15  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
    16  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
    17  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
    18  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
    19  'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
    20  'scrapy.downloadermiddlewares.stats.DownloaderStats']
    21 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled spider middlewares:
    22 ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
    23  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
    24  'scrapy.spidermiddlewares.referer.RefererMiddleware',
    25  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
    26  'scrapy.spidermiddlewares.depth.DepthMiddleware']
    27 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled item pipelines:
    28 []
    29 2018-06-21 21:12:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    30 2018-06-21 21:12:40 [scrapy.core.engine] INFO: Spider opened
    31 2018-06-21 21:12:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://hr.tencent.com/position.php> from <GET http://hr.tencent.com/position.php>
    32 2018-06-21 21:12:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://hr.tencent.com/position.php> (referer: None)
    33 [s] Available Scrapy objects:
    34 [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    35 [s]   crawler    <scrapy.crawler.Crawler object at 0x107617c18>
    36 [s]   item       {}
    37 [s]   request    <GET http://hr.tencent.com/position.php>
    38 [s]   response   <200 https://hr.tencent.com/position.php>
    39 [s]   settings   <scrapy.settings.Settings object at 0x10840e748>
    40 [s]   spider     <DefaultSpider 'default' at 0x1086c6ba8>
    41 [s] Useful shortcuts:
    42 [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
    43 [s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
    44 [s]   shelp()           Shell help (print this help)
    45 [s]   view(response)    View response in a browser
    46 >>> response.url
    47 'https://hr.tencent.com/position.php'
    48 >>> from scrapy.linkextractors import LinkExtractor
    49 >>> link_list=LinkExtractor(allow=("start=d+"))
    50 >>> link_list.extract_links(response)
    51 [Link(url='https://hr.tencent.com/position.php?&start=10#a', text='2', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=20#a', text='3', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=30#a', text='4', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=40#a', text='5', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=50#a', text='6', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=60#a', text='7', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=70#a', text='...', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=3800#a', text='381', fragment='', nofollow=False)]
    52 >>> 


  • 相关阅读:
    高程第五章(引用类型)
    第四章(变量、作用域、内存问题)
    label语句和break continue的使用(高程第三章)
    高级程序设计第三章
    max取得数组的最大值
    使用bind()扩充作用域
    函数
    数据类型、字符编码、文件处理
    Python入门
    8.8每日作业系列之循环模块运用
  • 原文地址:https://www.cnblogs.com/wanglinjie/p/9211013.html
Copyright © 2011-2022 走看看