功能点:如何发送携带cookie访问登录后的页面,如何发送post请求登录
简单介绍:
- 安装:pip3 install scrapy_redis
- 在scrapy的基础上实现了更多的功能:如request去重(增量爬虫),爬虫持久化,实现分布式
- 工作流程:通过redis实现调度器的队列和指纹集合;每个request生成一个指纹,在存入redis之前,首先判断这个指纹是否已经存在,如果不存在则存入。
- 配置:
-
# 确保所有的爬虫通过Redis去重
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
# 启用Redis调度存储请求队列
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
# 不清除Redis队列、这样可以暂停/恢复 爬取
SCHEDULER_PERSIST = True
# 保存item到redis
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 400
}
# reids服务器地址
REDIS_URL = 'redis://192.168.3.20:6379'
-
- redis中访问:
-
keys *里有3个键
爬虫名:requests Scheduler队列,存放待请求的request对象,获取的过程是pop操作,即获取一个会去除一个
爬虫名:dupefilter 指纹集合,存放的是已经进入scheduler队列的request对象的指纹,指纹默认由请求方法、url和请求体组成
爬虫名:items 存放的是获取到的item信息,在 pipeline中 开启才会存入。
-
- request对象什么时候入队?
-
1、dont_filter=True,构造请求的时候,把dont_filter设置为True,该url会被反复抓取(url地址对应的内容会更新的情况,类似百度贴吧)
2、一个全新的url地址被抓到的时候,构造request请求
3、url地址在start_urls中的时候,会入队,不管之前是否请求过(原因:构造start_url地址的请求的时候,dont_filter=True)
4、代码,scheduler.py的enqueue_request方法:def enqueue_request(self, request): if not request.dont_filter and self.df.request_seen(request): self.df.log(request, self.spider) return False self.queue.push(request) return True
-
- scrapy_redis的去重方法
-
使用sha1加密request得到指纹
把指纹存到redis的集合中
下次新来的request,同样的方式生成指纹,判断指纹是否存在redis的集合中
-
- 生成指纹
-
fp = hashlib.sha1() fp.update(to_bytes(request.method)) fp.update(to_bytes(canonicalize_url(request.url))) fp.update(request.body or b'') return fp.hexdigest()
-
- 判断数据是否存在redis的集合中,不存在则插入
-
dupefilter.py的request_seen方法 added = self.server.sadd(self.key, fp) return added == 0
-
- 在引入scrapy_redis之前,scrapy是怎么去重的?
-
scrapy.dupefilters.py的 RFPDupeFilter,request_seen方法: # 首先也会生成一个指纹,判断是否在集合中,如果存在,说明已经抓取过 # requests.seen文件 def request_seen(self, request): fp = self.request_fingerprint(request) if fp in self.fingerprints: return True self.fingerprints.add(fp) if self.file: self.file.write(fp + os.linesep)
-
爬取网站:dmoz
完整代码:https://files.cnblogs.com/files/bookwed/dmoztools.zip
主要代码:
dmoz.py
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class DmozSpider(CrawlSpider): name = 'dmoz' allowed_domains = ['dmoztools.net'] start_urls = ['http://www.dmoztools.net'] rules = ( Rule(LinkExtractor( restrict_css=('.top-cat', '.sub-cat', 'cat-item') ), callback='parse_item', follow=True), ) def parse_item(self, response): for div in response.css(".title-and-desc"): yield { 'name': div.css(".site-title::text").extract_first(), 'desc': div.css(".site-descr::text").extract_first().strip(), 'link': div.css("a::attr(href)").extract_first(), }
settings.py
# -*- coding: utf-8 -*- # Scrapy settings for dmoztools project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'dmoztools' SPIDER_MODULES = ['dmoztools.spiders'] NEWSPIDER_MODULE = 'dmoztools.spiders' # 确保所有的爬虫通过Redis去重 DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' # 启用Redis调度存储请求队列 SCHEDULER = 'scrapy_redis.scheduler.Scheduler' # 不清除Redis队列、这样可以暂停/恢复 爬取;队列中的内容是否持久化 SCHEDULER_PERSIST = True # 使用优先级调度请求队列 (默认使用) # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # 可选用的其它队列 # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue' # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue' # 最大空闲时间防止分布式爬虫因为等待而关闭 # SCHEDULER_IDLE_BEFORE_CLOSE = 10 # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'dmoztools (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'dmoztools.middlewares.DmoztoolsSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'dmoztools.middlewares.DmoztoolsDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'dmoztools.pipelines.DmoztoolsPipeline': 300, 'scrapy_redis.pipelines.RedisPipeline': 400 } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' REDIS_URL = 'redis://192.168.3.20:6379'