分布式爬虫
安装:
pip3 install scrapy-redis
1. 修改原来的爬虫继承和start_urls
from scrapy_redis.spiders import RedisSpider class CnblogsSpider(RedisSpider): #start_urls = ['http://www.cnblogs.com/'] redis_key = 'myspider:start_urls'
2. 在settings中配置
# 2 在setting中配置 SCHEDULER = "scrapy_redis.scheduler.Scheduler" DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 这里可以不配就走每一个的数据库,配置了就走公用的数据库 ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 300 } # REDIS_HOST = 'localhost' # REDIS_PORT = 6379 # REDIS_ENCODING = 'utf8' REDIS_PARAMS = {'password':'2694'}
3. 多台机器启动爬虫
4. 通过命名向redis中发送起始url
redis-cli
auth password
lpush myspider:start_urls https://www.cnblogs.com