zoukankan      html  css  js  c++  java
  • scrapy去重

    自定义去重

          -类。自定义一个类。DUPEFILTER_CLASS = 'sp2.rep.RepeatUrl'

          -配置文件中指定 :scrapy.dupefilter.RFPDupeFilter

    scrapy默认使用 scrapy.dupefilter.RFPDupeFilter 进行去重,相关配置有:

    from scrapy.dupefilter import RFPDupeFilter

    UPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' DUPEFILTER_DEBUG = False JOBDIR = "保存范文记录的日志路径,如:/root/" # 最终路径为 /root/requests.seen

    #DUPEFILTER_CLASS = 'sp2.rep.RepeatUrl'
    class RepeatUrl:
        def __init__(self):
            self.visited_url = set()
    
        @classmethod
        def from_settings(cls, settings):
            """
            初始化时,调用
            :param settings: 
            :return: 
            """
            return cls()
    
        def request_seen(self, request):
            """
            检测当前请求是否已经被访问过
            :param request: 
            :return: True表示已经访问过;False表示未访问过
            """
            if request.url in self.visited_url:
                return True
            self.visited_url.add(request.url)
            return False
    
        def open(self):
            """
            开始爬去请求时,调用
            :return: 
            """
            print('open replication')
    
        def close(self, reason):
            """
            结束爬虫爬取时,调用
            :param reason: 
            :return: 
            """
            print('close replication')
    
        def log(self, request, spider):
            """
            记录日志
            :param request: 
            :param spider: 
            :return: 
            """
            print('repeat', request.url)
    
    自定义URL去重操作
    View Code
  • 相关阅读:
    mysql集群架构
    mysql数据库版本不同所引起的问题
    linux下mysql安装
    windows下mysql的安装
    JMeter操作mysql
    Tomcat 配置下载服务器
    JMeter接口测试——关联
    JMeter接口测试——断言
    JMeter接口测试——参数化(从文件中读取参数)
    JMeter接口测试示例(三)——添加HTTP认证
  • 原文地址:https://www.cnblogs.com/catherine007/p/8632865.html
Copyright © 2011-2022 走看看