zoukankan      html  css  js  c++  java
  • Scrapy去重

    一、原生

    1、模块

    from scrapy.dupefilters import RFPDupeFilter

    2、RFPDupeFilter方法

    a、request_seen

    核心:爬虫每执行一次yield Request对象,则执行一次request_seen方法

    作用:用来去重,相同的url只能访问一次

    实现:将url值变成定长、唯一的值,如果这个url对象存在,则返回True表名已经访问过,若url不存在则添加该url到集合

    1)、request_fingerprint

    作用:对request(url)变成定长唯一的值,如果使用md5的话,下面的两个url值不一样

    注意:request_fingerprint() 只接收request对象

    from scrapy.utils.request import request_fingerprint
    from scrapy.http import Request
    
    #
    url1 = 'https://test.com/?a=1&b=2'
    url2 = 'https://test.com/?b=2&a=1'
    request1 = Request(url=url1)
    request2 = Request(url=url2)
    
    # 只接收request对象
    rfp1 = request_fingerprint(request=request1)
    rfp2 = request_fingerprint(request=request2)
    print(rfp1)
    print(rfp2)
    
    if rfp1 == rfp2:
        print('url相同')
    else:
        print('url不同')

    2)、request_seen

    def request_seen(self, request):
        # request_fingerprint 将request(url) -> 唯一、定长
        fp = self.request_fingerprint(request)
        if fp in self.fingerprints:
            return True        # 返回True,表明已经执行过一次
        self.fingerprints.add(fp)

    b、open

    父类BaseDupeFilter中的方法,爬虫开始时,执行

    def open(self):
        # 爬虫开始
        pass

    c、close

    爬虫结束时执行

    def close(self, reason):
        # 关闭爬虫时执行
        pass

    d、log

    记录日志

    def log(self, request, spider):
        # 记录日志
        pass

    e、from_settings

    原理及作用:和pipelines中的from_crawler一致

    @classmethod
    def from_settings(cls, settings):
        return cls()

    二、自定义

    待续

    1、配置文件(settings.py)

    # 原生
    # DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
    DUPEFILTER_CLASS = 'toscrapy.dupefilters.MyDupeFilter'

    2、自定义去重类(继承BaseDupeFilter)

    from scrapy.dupefilters import BaseDupeFilter
    from scrapy.utils.request import request_fingerprint
    #
    
    
    class MyDupeFilter(BaseDupeFilter):
        def __init__(self):
            self.visited_fp = set()
    
        @classmethod
        def from_settings(cls, settings):
            return cls()
    
        def request_seen(self, request):
            # 判断当前的request对象是否,在集合中,若在则放回True,表明已经访问,否则,访问该request的url并将该url添加到集合中
            if request_fingerprint(request) in self.visited_fp:
                return True
            self.visited_fp.add(request_fingerprint(request))
    
        def open(self):  # can return deferred
            print('开启爬虫')
    
        def close(self, reason):  # can return a deferred
            print('结束爬虫')
    
        def log(self, request, spider):  # log that a request has been filtered
            pass

    3、前提条件

    yield request的对象

    yield scrapy.Request(url=_next, callback=self.parse, dont_filter=True)

    dont_filter不能为True,这个值默认为False

  • 相关阅读:
    pipelinewise 学习二 创建一个简单的pipeline
    pipelinewise 学习一 docker方式安装
    Supercharging your ETL with Airflow and Singer
    ubuntu中使用 alien安装rpm包
    PipelineWise illustrates the power of Singer
    pipelinewise 基于singer 指南的的数据pipeline 工具
    关于singer elt 的几篇很不错的文章
    npkill 一个方便的npm 包清理工具
    kuma docker-compose 环境试用
    kuma 学习四 策略
  • 原文地址:https://www.cnblogs.com/wt7018/p/11741458.html
Copyright © 2011-2022 走看看