zoukankan      html  css  js  c++  java
  • scrapy之防ban策略

    策略一:设置download_delay

      设置下载的等待时间,减少被ban的几率

    通过在setting.py文件中设置DOWNLOAD_DELAY参数,可以限制爬虫的访问频度。

    DOWNLOAD_DELAY =0.25    # 250 ms of delay

    通过启用RANDOMIZE_DOWNLOAD_DELAY参数(默认为开启状态),可以使爬取时间间隔随机化,随机时长控制在0.5-1.5倍的DOWNLOAD_DELAY之间,这也可以降低爬虫被墙掉的几率。

    download_delay可以设置在settings.py中,也可以在spider中设置

    策略二:禁止cookies

    cookies是指某些网站为了辨别用户身份而存储在用户本地终端上的数据(通常经过加密),禁止cookies也就防止了可能使用cookies识别爬虫轨迹的网站得逞.

      在setting.py中设置COOKIES_ENABLES= False   也就是不启用cookies.middleware,不向服务器发送cokkies

    策略三:使用useragent代理池

      useragent是指包含浏览器信息,操作系统信息等的一个字符串,也成为一种特殊的网络协议. 服务器通过它判断当前访问对象是浏览器,邮件客户端还是网络爬虫. 在request.header中可以查看user agent 

    scrapy    shell   url

    request.headers

    接下来在spiders目录下新建rotate_useragent.py

    贴一下代码:

    #coding:utf-8
    from scrapy import log
    from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
    import random
    class RotateUserAgentMiddleware(UserAgentMiddleware):
        def __init__(self, useragent=''):
            self.user_agent = useragent
            self.user_agent_list = [
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 ",
                "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
                "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 ",
                "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 ",
                "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
                "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 ",
                "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
                "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 ",
                "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 ",
                "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
                "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 ",
                "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
                "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
                "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
                "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
                "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
                "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ",
                "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 ",
                "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
                "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 ",
                "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            ]
        def process_request(self, request, spider):
            ua = random.choice(self.user_agent_list)
            if ua:
                print('************Current UserAgent:%s***********' % ua)
                log.msg('Current UserAgent: ' + ua, level=3)

    要在settings.py(配置文件)中禁用默认的useragent并启用重新实现的User Agent。配置方法如下:

    #取消默认的useragent,使用新的useragent
    DOWNLOADER_MIDDLEWARES = {
            'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
            'myProject.spiders.rotate_useragent.RotateUserAgentMiddleware' :400
        }

    策略四:使用IP池

    策略五:使用分布式爬取



  • 相关阅读:
    【BZOJ-4592】脑洞治疗仪 线段树
    【BZOJ-1369】Gem 树形DP
    【BZOJ-3696】化合物 树形DP + 母函数(什么鬼)
    【BZOJ-2435】道路修建 (树形DP?)DFS
    【BZOJ-4590】自动刷题机 二分 + 判定
    【BZOJ-4591】超能粒子炮·改 数论 + 组合数 + Lucas定理
    【BZOJ-1492】货币兑换Cash DP + 斜率优化 + CDQ分治
    【BZOJ-1324】Exca王者之剑 最小割
    【BZOJ-1017】魔兽地图DotR 树形DP + 背包
    【BZOJ-1131】Sta 树形DP
  • 原文地址:https://www.cnblogs.com/Garvey/p/6689138.html
Copyright © 2011-2022 走看看