策略一:设置download_delay
设置下载的等待时间,减少被ban的几率
通过在setting.py文件中设置DOWNLOAD_DELAY参数,可以限制爬虫的访问频度。
DOWNLOAD_DELAY =0.25 # 250 ms of delay
通过启用RANDOMIZE_DOWNLOAD_DELAY参数(默认为开启状态),可以使爬取时间间隔随机化,随机时长控制在0.5-1.5倍的DOWNLOAD_DELAY之间,这也可以降低爬虫被墙掉的几率。
download_delay可以设置在settings.py中,也可以在spider中设置
策略二:禁止cookies
cookies是指某些网站为了辨别用户身份而存储在用户本地终端上的数据(通常经过加密),禁止cookies也就防止了可能使用cookies识别爬虫轨迹的网站得逞.
在setting.py中设置COOKIES_ENABLES= False 也就是不启用cookies.middleware,不向服务器发送cokkies
策略三:使用useragent代理池
useragent是指包含浏览器信息,操作系统信息等的一个字符串,也成为一种特殊的网络协议. 服务器通过它判断当前访问对象是浏览器,邮件客户端还是网络爬虫. 在request.header中可以查看user agent
scrapy shell url
request.headers
接下来在spiders目录下新建rotate_useragent.py
贴一下代码:
#coding:utf-8 from scrapy import log from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware import random class RotateUserAgentMiddleware(UserAgentMiddleware): def __init__(self, useragent=''): self.user_agent = useragent self.user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 ", "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 ", "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 ", "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 ", "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 ", "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 ", "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 ", "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ", "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 ", "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 ", "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ", "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ", "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ", "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ", "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 ", "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ", "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 ", "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 ", "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", ] def process_request(self, request, spider): ua = random.choice(self.user_agent_list) if ua: print('************Current UserAgent:%s***********' % ua) log.msg('Current UserAgent: ' + ua, level=3)
要在settings.py(配置文件)中禁用默认的useragent并启用重新实现的User Agent。配置方法如下:
#取消默认的useragent,使用新的useragent DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None, 'myProject.spiders.rotate_useragent.RotateUserAgentMiddleware' :400 }
策略四:使用IP池
策略五:使用分布式爬取