zoukankan      html  css  js  c++  java
  • scarpy 爬虫

    基本指令

    全局指令

    • scrapy fetch(直接爬取某个网页)
    • scrapy runspider(运行某个爬虫,并且这个爬虫可以不属于项目里)
    • scrapy settings(设置)
    • scrapy shell(进入交互模式)
    D:>scrapy shell
    2018-12-12 19:25:33 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
    2018-12-12 19:25:33 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0j  20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.17134-SP0
    2018-12-12 19:25:33 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
    2018-12-12 19:25:33 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole']
    2018-12-12 19:25:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2018-12-12 19:25:33 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2018-12-12 19:25:33 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2018-12-12 19:25:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    [s] Available Scrapy objects:
    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    [s]   crawler    <scrapy.crawler.Crawler object at 0x000002A069AFB9B0>
    [s]   item       {}
    [s]   settings   <scrapy.settings.Settings object at 0x000002A069AFB940>
    [s] Useful shortcuts:
    [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
    [s]   fetch(req)                  Fetch a scrapy.Request and update local objects
    [s]   shelp()           Shell help (print this help)
    [s]   view(response)    View response in a browser
    In [1]: print("hehe")
    hehe
    
    • scrapy version(查看版本)
    D:>scrapy version
    Scrapy 1.5.1
    
    D:>
    
    • scarpy startproject 项目名(创建一个项目)
    D:>scrapy startproject spiders
    New Scrapy project 'spiders', using template directory 'e:\development\python\lib\site-packages\scrapy\templates\project', created in:
        D:spiders
    
    You can start your first spider with:
        cd spiders
        scrapy genspider example example.com
    
    • scrapy bench(测试电脑性能)
    D:>scrapy bench
    2018-12-12 19:24:10 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
    2018-12-12 19:24:10 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0j  20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.17134-SP0
    2018-12-12 19:24:10 [scrapy.crawler] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'}
    2018-12-12 19:24:11 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.closespider.CloseSpider',
     'scrapy.extensions.logstats.LogStats']
    2018-12-12 19:24:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2018-12-12 19:24:11 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2018-12-12 19:24:11 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2018-12-12 19:24:11 [scrapy.core.engine] INFO: Spider opened
    2018-12-12 19:24:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2018-12-12 19:24:12 [scrapy.extensions.logstats] INFO: Crawled 69 pages (at 4140 pages/min), scraped 0 items (at 0 items/min)
    2018-12-12 19:24:13 [scrapy.extensions.logstats] INFO: Crawled 150 pages (at 4860 pages/min), scraped 0 items (at 0 items/min)
    2018-12-12 19:24:14 [scrapy.extensions.logstats] INFO: Crawled 214 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
    2018-12-12 19:24:15 [scrapy.extensions.logstats] INFO: Crawled 278 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
    2018-12-12 19:24:16 [scrapy.extensions.logstats] INFO: Crawled 334 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
    2018-12-12 19:24:17 [scrapy.extensions.logstats] INFO: Crawled 382 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
    2018-12-12 19:24:18 [scrapy.extensions.logstats] INFO: Crawled 430 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
    2018-12-12 19:24:19 [scrapy.extensions.logstats] INFO: Crawled 478 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
    2018-12-12 19:24:20 [scrapy.extensions.logstats] INFO: Crawled 526 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
    2018-12-12 19:24:21 [scrapy.extensions.logstats] INFO: Crawled 574 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
    2018-12-12 19:24:21 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
    2018-12-12 19:24:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 263306,
     'downloader/request_count': 590,
     'downloader/request_method_count/GET': 590,
     'downloader/response_bytes': 1815754,
     'downloader/response_count': 590,
     'downloader/response_status_count/200': 590,
     'finish_reason': 'closespider_timeout',
     'finish_time': datetime.datetime(2018, 12, 12, 11, 24, 22, 225496),
     'log_count/INFO': 17,
     'request_depth_max': 20,
     'response_received_count': 590,
     'scheduler/dequeued': 590,
     'scheduler/dequeued/memory': 590,
     'scheduler/enqueued': 11801,
     'scheduler/enqueued/memory': 11801,
     'start_time': datetime.datetime(2018, 12, 12, 11, 24, 11, 368009)}
    2018-12-12 19:24:22 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)
    
    D:>
    

    项目指令(只能进入项目里才能使用)

    • scrapy list (打开已有的爬虫列表)
    D:>cd he
    D:he>scrapy list
    tianshan
    
    • scrapy gensprider -l (爬虫模板)
    D:he>scrapy genspider -l
    Available templates:
      basic
      crawl
      csvfeed
      xmlfeed
    
    • scarpy genspider -t 模板 爬虫名 域名 (创建一个爬虫,注意要进入爬虫项目)
    D:>cd spiders
    
    D:spiders>scrapy genspider -t basic bd baidu.com
    Created spider 'bd' using template 'basic' in module:
      spiders.spiders.bd
    
    • scrapy crawl 爬虫名 (运行该爬虫)
    D:spiders>scrapy crawl bd
    2018-12-12 19:21:12 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: spiders)
    2018-12-12 19:21:12 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0j  20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.17134-SP0
    2018-12-12 19:21:12 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'spiders', 'NEWSPIDER_MODULE': 'spiders.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['spiders.spiders']}
    2018-12-12 19:21:12 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.logstats.LogStats']
    2018-12-12 19:21:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2018-12-12 19:21:13 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2018-12-12 19:21:13 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2018-12-12 19:21:13 [scrapy.core.engine] INFO: Spider opened
    2018-12-12 19:21:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2018-12-12 19:21:13 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    2018-12-12 19:21:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://baidu.com/robots.txt> (failed 1 times): DNS lookup failed: no results for hostname lookup: baidu.com.
    2018-12-12 19:21:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://baidu.com/robots.txt> (failed 2 times): DNS lookup failed: no results for hostname lookup: baidu.com.
    2018-12-12 19:21:13 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://baidu.com/robots.txt> (failed 3 times): DNS lookup failed: no results for hostname lookup: baidu.com.
    2018-12-12 19:21:13 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://baidu.com/robots.txt>: DNS lookup failed: no results for hostname lookup: baidu.com.
    Traceback (most recent call last):
      File "e:developmentpythonlibsite-packages	wistedinternetdefer.py", line 1416, in _inlineCallbacks
        result = result.throwExceptionIntoGenerator(g)
      File "e:developmentpythonlibsite-packages	wistedpythonfailure.py", line 491, in throwExceptionIntoGenerator
        return g.throw(self.type, self.value, self.tb)
      File "e:developmentpythonlibsite-packagesscrapycoredownloadermiddleware.py", line 43, in process_request
        defer.returnValue((yield download_func(request=request,spider=spider)))
      File "e:developmentpythonlibsite-packages	wistedinternetdefer.py", line 654, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "e:developmentpythonlibsite-packages	wistedinternetendpoints.py", line 975, in startConnectionAttempts
        "no results for hostname lookup: {}".format(self._hostStr)
    twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: baidu.com.
    2018-12-12 19:21:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://baidu.com/> (failed 1 times): DNS lookup failed: no results for hostname lookup: baidu.com.
    2018-12-12 19:21:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://baidu.com/> (failed 2 times): DNS lookup failed: no results for hostname lookup: baidu.com.
    2018-12-12 19:21:13 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://baidu.com/> (failed 3 times): DNS lookup failed: no results for hostname lookup: baidu.com.
    2018-12-12 19:21:13 [scrapy.core.scraper] ERROR: Error downloading <GET http://baidu.com/>
    Traceback (most recent call last):
      File "e:developmentpythonlibsite-packages	wistedinternetdefer.py", line 1416, in _inlineCallbacks
        result = result.throwExceptionIntoGenerator(g)
      File "e:developmentpythonlibsite-packages	wistedpythonfailure.py", line 491, in throwExceptionIntoGenerator
        return g.throw(self.type, self.value, self.tb)
      File "e:developmentpythonlibsite-packagesscrapycoredownloadermiddleware.py", line 43, in process_request
        defer.returnValue((yield download_func(request=request,spider=spider)))
      File "e:developmentpythonlibsite-packages	wistedinternetdefer.py", line 654, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "e:developmentpythonlibsite-packages	wistedinternetendpoints.py", line 975, in startConnectionAttempts
        "no results for hostname lookup: {}".format(self._hostStr)
    twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: baidu.com.
    2018-12-12 19:21:13 [scrapy.core.engine] INFO: Closing spider (finished)
    2018-12-12 19:21:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 6,
     'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 6,
     'downloader/request_bytes': 1278,
     'downloader/request_count': 6,
     'downloader/request_method_count/GET': 6,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2018, 12, 12, 11, 21, 13, 651125),
     'log_count/DEBUG': 7,
     'log_count/ERROR': 2,
     'log_count/INFO': 7,
     'retry/count': 4,
     'retry/max_reached': 2,
     'retry/reason_count/twisted.internet.error.DNSLookupError': 4,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'start_time': datetime.datetime(2018, 12, 12, 11, 21, 13, 187234)}
    2018-12-12 19:21:13 [scrapy.core.engine] INFO: Spider closed (finished)
    
    D:spiders>
    
    • scrapy edit 爬虫名(直接编辑某个爬虫代码)

    scrapy主要文件

    items.py

    确定需要爬取的数据

    spider.py

    网页解析,进行数据提取,返回数据给piplines,返回url给调度器

    item_pipelines.py

    爬后处理,进行存储

    settings.py

    设置文件

    聚焦爬虫的编写步骤

    item编写

    先在item里面确定需要爬取的数据

    spider编写

    • 先导入item的类,再实例化
    • 从网页中提取数据并存入item
    • 返回item到item_pipeline

    settings的编写

    • 打开pipeline的注释并且更改pipeline中真实的类名到settings

    pipeline的编写

    进行数据的存入

  • 相关阅读:
    vue 之 vuex
    vue中this.$router.push() 传参
    ES6新特性
    css优先级
    创建第一个vue工程
    对Vue.js的认知
    前端的认知与见解
    Web前端常见问题
    数据库如何进行索引优化
    Python FAQ
  • 原文地址:https://www.cnblogs.com/c-aha/p/10110438.html
Copyright © 2011-2022 走看看