zoukankan      html  css  js  c++  java
  • scrapy--分布式爬虫

    14.3 使用scrapy-redis进行分布式爬取
    了解了scrapy-redis的原理后,我们学习使用scrapy + scrapyredis进行分布式爬取。
    14.3.1 搭建环境
    首先搭建scrapy-redis分布式爬虫环境,当前我们有3台Linux 主机。 云服务器(A):116.29.35.201 (Redis Server) 云服务器(B):123.59.45.155 本机(C):1.13.41.127 在3台主机上安装scrapy和scrapy-redis:
    $ pip install scrapy $ pip install scrapy-redis
    选择其中一台云服务器搭建供所有爬虫使用的Redis数据库,步
    247
    骤如下:
    步骤 01 在云服务器上安装redis-server。
    步骤 02 在Redis配置文件中修改服务器的绑定地址(确保数 据库可被所有爬虫访问)。
    步骤 03 启动/重启Redis服务器。 登录云服务器(A),在bash中完成上述步骤:
    116.29.35.201$ sudo apt-get install redis-server 116.29.35.201$ sudo vi /etc/redis/redis.conf ... # bind 127.0.0.1 bind 0.0.0.0 ... 116.29.35.201$ sudo service redis-server restart
    最后,在3台主机上测试能否访问云服务器(A)上的Redis数据 库:
    $ redis-cli -h 116.29.35.201 ping PONG
    到此,Scrapy分布式爬虫环境搭建完毕。
    14.3.2 项目实战
    本章的核心知识点是分布式爬取,因此本项目不再对分析页面、 编写Spider等大家熟知的技术进行展示。我们可以任意挑选一个在 之前章节中做过的项目,将其改为分布式爬取的版本,这里以第8章 的toscrape_book项目(爬取books.toscrape.com中的书籍信息) 为例进行讲解。
    248
    复制toscrape_book项目,得到新项目 toscrape_book_distributed:
    $ cp -r toscrape_book toscrape_book_distributed $ cd toscrape_book_distributed
    在配置文件settings.py中添加scrapy-redis的相关配置:
    # 必选项 # ================================================================= # 指定爬虫所使用的Redis数据库(在云服务器116.29.35.201 上) REDIS_URL = 'redis://116.29.35.201:6379'
    # 使用scrapy_redis的调度器替代Scrapy 原版调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 使用scrapy_redis的RFPDupeFilter作为去重过滤器 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 启用scrapy_redis的RedisPipeline将爬取到的数据汇总到Redis数据库 ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 300, } # 可选项 # ================================================================= #爬虫停止后,保留/清理Redis中的请求队列以及去重集合 # True:保留,False:清理,默认为False SCHEDULER_PERSIST = True
    将单机版本的Spider改为分布式版本的Spider,只需做如下简 单改动:
    from scrapy_redis.spiders import RedisSpider
    # 1.更改基类 # class BooksSpider(spider.Spider): class BooksSpider(RedisSpider): ...
    249
    # 2.注释start_urls #start_urls = ['http://books.toscrape.com/'] ...
    上述改动针对“如何为多个爬虫设置起始爬取点”这个问题,解 释如下:
    ● 在分布式爬取时,所有主机上的代码是相同的,如果使用之 前单机版本的Spider代码,那么每一台主机上的Spider都通 过start_urls属性定义了起始爬取点,在构造起始爬取点的 Request对象时,dont_filter参数设置为了True,即忽略去 重过滤器的过滤。因此多个(数量等于爬虫数量)重复请求 将强行进入Redis中的请求队列,这可能导致爬取到重复数 据。
    ● 为了解决上述问题,scrapy-redis提供了一个新的Spider基 类RedisSpider,RedisSpider重写了start_requests方法, 它尝试从Redis数据库的某个特定列表中获取起始爬取点, 并构造Request对象(dont_filter=False),该列表的键可 通过配置文件设置(REDIS_START_URLS_KEY),默认为 <spider_name>:start_urls。在分布式爬取时,用户运行所 有爬虫后,需手动使用Redis命令向该列表添加起始爬取 点,之后只有其中一个爬虫能获取到起始爬取点,因此对应 的请求也就只有一个,从而避免了重复。 到此,分布式版本的项目代码已经完成了,分发到各个主机:
    $ scp -r toscrape_book_distributed liushuo@116.29.35.201:~/scrapy_b $ scp -r toscrape_book_distributed liushuo@123.59.45.155:~/scrapy_b
    分别在3台主机使用相同命令运行爬虫:
    $ scrapy crawl books
    250
    2017-05-14 17:56:42 [scrapy.utils.log] INFO: Scrapy 1.3.3 start 2017-05-14 17:56:42 [scrapy.utils.log] INFO: Overridden setting 'scrapy_redis.dupefilter.RFPDupeFilter', 'FEED_EXPORT_FIELDS': ['up 'review_rating', 'review_num'], 'SCHEDULER': 'scrapy_redis.schedule 'toscrape_book', 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'toscr 'SPIDER_MODULES': ['toscrape_book.spiders']} 2017-05-14 17:56:42 [scrapy.middleware] INFO: Enabled extension ['scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2017-05-14 17:56:42 [books] INFO: Reading start URLs from redis 16, encoding: utf-8 2017-05-14 17:56:42 [scrapy.middleware] INFO: Enabled downloade ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutM 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMid 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionM 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-05-14 17:56:42 [scrapy.middleware] INFO: Enabled spider mi ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-05-14 17:56:42 [scrapy.middleware] INFO: Enabled item pipe ['scrapy_redis.pipelines.RedisPipeline'] 2017-05-14 17:56:42 [scrapy.core.engine] INFO: Spider opened 2017-05-14 17:56:42 [scrapy.extensions.logstats] INFO: Crawled items (at 0 items/min) 2017-05-14 17:56:42 [scrapy.extensions.telnet] DEBUG: Telnet c ...阻塞在此处...
    运行后,由于Redis中的起始爬取点列表和请求队列都是空的, 3个爬虫都进入了暂停等待的状态,因此在任意主机上使用Redis客 户端设置起始爬取点:
    $ redis-cli -h 116.29.35.201 116.29.35.201:6379> lpush books:start_urls 'http://books.toscrape.c (integer) 1
    251
    随后,其中一个爬虫(本实验中是云服务器A)从起始爬取点列 表中获取到了url,在其log中观察到如下信息:
    2017-05-14 17:57:18 [books] DEBUG: Read 1 requests from 'books:star
    该爬虫用起始爬取点url构造的Request对象最终被添加到Redis 中的请求队列之后。各个爬虫相继开始工作了,可在各爬虫的log中 观察到类似于如下的信息:
    2017-05-14 18:00:42 [scrapy.core.scraper] DEBUG: Scraped from http://books.toscrape.com/catalogue/arena_587/index.html> {'name': 'Arena', 'price': '£21.36', 'review_num': '0', 'review_rating': 'Four', 'stock': '11', 'upc': '2c34f9432069b52b'} 2017-05-14 18:00:42 [scrapy.core.engine] DEBUG: Crawled (200) http://books.toscrape.com/catalogue/page-21.html) 2017-05-14 18:00:42 [scrapy.core.scraper] DEBUG: Scraped from http://books.toscrape.com/catalogue/adultery_586/index.html> {'name': 'Adultery', 'price': '£20.88', 'review_num': '0', 'review_rating': 'Five', 'stock': '11', 'upc': 'bb967277222e689c'} 2017-05-14 18:00:42 [scrapy.core.engine] DEBUG: Crawled (200) http://books.toscrape.com/catalogue/page-21.html) 2017-05-14 18:00:42 [scrapy.core.scraper] DEBUG: Scraped from http://books.toscrape.com/catalogue/a-mothers-reckoning-living-in-t l> {'name': "A Mother's Reckoning: Living in the Aftermath of Tra 'price': '£19.53', 'review_num': '0', 'review_rating': 'Three', 'stock': '11', 'upc': '2b69dec0193511d9'} 2017-05-14 18:00:43 [scrapy.core.scraper] DEBUG: Scraped from http://books.toscrape.com/catalogue/112263_583/index.html> {'name': '11/22/63', 'price': '£48.48', 'review_num': '0',
    252
    'review_rating': 'Three', 'stock': '11', 'upc': 'a9d7b75461084a26'} 2017-05-14 18:00:43 [scrapy.core.engine] DEBUG: Crawled (200) http://books.toscrape.com/catalogue/page-21.html) 2017-05-14 18:00:43 [scrapy.core.scraper] DEBUG: Scraped from http://books.toscrape.com/catalogue/10-happier-how-i-tamed-the-voic losing-my-edge-and-found-self-help-that-actually-works_582/index.ht {'name': '10% Happier: How I Tamed the Voice in My Head, Reduc 'Without Losing My Edge, and Found Self-Help That Actu 'price': '£24.57', 'review_num': '0', 'review_rating': 'Two', 'stock': '10', 'upc': '34669b2e9d407d3a'}
    等待全部爬取完成后,在Redis中查看爬取到的数据:
    116.29.35.201:6379> keys * 1) "books:items" 2) "books:dupefilter" 116.29.35.201:6379> llen books:items (integer) 1000 116.29.35.201:6379> LRANGE books:items 0 4 1) "{"stock": "22", "review_num": "0", "upc": "a897 the Attic", "review_rating": "Three", "price": "\u00a351.7 2) "{"stock": "20", "review_num": "0", "upc": "e00e Objects", "review_rating": "Four", "price": "\u00a347.82" 3) "{"stock": "20", "review_num": "0", "upc": "90fa Velvet", "review_rating": "One", "price": "\u00a353.74"}" 4) "{"stock": "20", "review_num": "0", "upc": "6957 "Soumission", "review_rating": "One", "price": "\u00a350. 5) "{"stock": "19", "review_num": "0", "upc": "2597 Little Secrets of Getting Your Dream Job", "review_rating": "Fo 116.29.35.201:6379> LRANGE books:items -5 -1 1) "{"name": "Shameless", "price": "\u00a358.35", "r "c068c013d6921fea", "review_num": "0", "stock": "1"}" 2) "{"stock": "1", "review_num": "0", "upc": "19fec Devotion (The Regency Spies of London #1)", "review_rating": "F 3) "{"stock": "1", "review_num": "0", "upc": "f684a (Women's Murder Club #1)", "review_rating": "One", "price": 4) "{"stock": "1", "review_num": "0", "upc": "228ba to See Before You Die", "review_rating": "Five", "price": " 5) "{"name": "Girl in the Blue Coat", "price": "\u00a3 "upc": "41fc5dce044f16f5", "review_num": "0", "stock": "
    253
    如上所示,我们成功地爬取到了1000项数据(由各爬虫最后的 log信息得知,爬虫A:514项,爬虫B:123项,爬虫C:363项)。每一 项数据以json形式存储在Redis的列表中,需要使用这些数据时,可 以编写Python程序将它们从Redis中读出,代码框架如下:
    import redis import json
    ITEM_KEY = 'books:items'
    def process_item(item): # 添加处理数据的代码 ...
    def main(): r = redis.StrictRedis(host='116.29.35.201', port=6379) for _ in range(r.llen(ITEM_KEY)): data = r.lpop(ITEM_KEY) item = json.loads(data.decode('utf8')) process_item(item)
    if __name__ == '__main__': main()
    到此,我们完成了分布式爬取的项目。

  • 相关阅读:
    读书笔记之理想设计的特征
    一些javascript 变量声明的 疑惑
    LINQ 使用方法
    Google MySQL tool releases
    读书笔记之设计的层次
    EF之数据库连接问题The specified named connection is either not found in the configuration, not intended to be used with the Ent
    转载 什么是闭包
    javascript面向对象起步
    Tips
    数据结构在游戏中的应用
  • 原文地址:https://www.cnblogs.com/duanlinxiao/p/10711749.html
Copyright © 2011-2022 走看看