zoukankan      html  css  js  c++  java
  • scrapy 简单防封

    设置爬取间隔

    setting.py

    from random import random
    DOWNLOAD_DELAY = random()*5

    ps:此次的爬取间隔,在读取seeting文件确定,并非每次随机

    禁用缓存

    # Disable cookies (enabled by default)
    COOKIES_ENABLED = False
    COOKIES_ENABLES = False

    ps: enabled,enables就不纠结哪个对了,全写

    设置随机访问头

    setting.py加入头列表,并启用中间件

    USER_AGENT_LIST=[
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    
    DOWNLOADER_MIDDLEWARES = {
       # 'tj_spider.middlewares.TjSpiderDownloaderMiddleware': 543,
       'tj_spider.middlewares.RandomUserAgentMiddleware': 400,
    }

    middlewares.py加入设置随机头

    from scrapy import signals
    from settings import USER_AGENT_LIST
    import random
    
    class RandomUserAgentMiddleware(object):
        def process_request(self, request, spider):
            rand_use  = random.choice(USER_AGENT_LIST)
            if rand_use:
                request.headers.setdefault('User-Agent', rand_use)
  • 相关阅读:
    线程同步的方法
    为什么HashMap中key是引用类型而不是基本数据类型?为什么有了基本数据类型还有包装类型?
    使用MyBatis的mapper接口(动态代理对象)调用时的注意点
    redis的aof持久化模式
    redis的RDB持久化方式的优缺点
    快排算法
    JAVA8新特性
    NIO中Buffer的capacity,position和limit含义
    ArrayBlockingQueue与LinkedBlockingQueue对比
    写加锁但读没有加锁造成的脏读问题
  • 原文地址:https://www.cnblogs.com/lurenjia1994/p/9621094.html
Copyright © 2011-2022 走看看