zoukankan      html  css  js  c++  java
  • scrapy 增量式

    scrapy 增量式

    前言

    首先 通常我们的爬虫都是一次性。应用场景并不多

    • 概念
    用于检测网站的数据跟新并实时抓取数据
    
    • 核心机制 (去除重复数据)
    去重 利用reids的set实现去重
    

    一、创建项目

    scrapy startproject zlsPro
    
    cd zlsPro
    
    scrapy genspider -t crawl zls www.xxx.com
     
    scrapy crawl zls
    

    二、配置文件settings文件

    ROBOTSTXT_OBEY = False
    LOG_LERVER='ERROR'
    USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
    

    三、爬虫文件代码实现

    爬虫文件

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from redis import Redis
    from zlsPro.items import ZlsproItem
    from scrapy import Request
    
    class ZlsSpider(CrawlSpider):
        conn = Redis(host='127.0.0.1',port=6379)
        name = 'zls'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.4567kan.com/index.php/vod/show/id/6.html']
    
        rules = (
            Rule(LinkExtractor(allow=r'page/d+.html'), callback='parse_item', follow=False),
        )
        def parse_item(self, response):
            li_list = response.xpath("/html/body/div[1]/div/div/div/div[2]/ul/li")
            for li in li_list:
                name = li.xpath('./div/a/@title').extract_first()
                detail_url = li.xpath('./div/a/@href').extract_first()
                itme = ZlsproItem()
                itme["name"] = name
                # 利用redis set 去除重复链接 
                if self.conn.sadd("set_detail_url",detail_url) == 1:
    
                    yield Request("https://www.4567kan.com"+detail_url,callback=self.parse_detail,meta={"itme":itme})
                else:
                    pass
    
        def parse_detail(self,response):
            itme = response.meta["itme"]
            desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first()
            itme["desc"] = desc
            yield itme
    

    管道文件pipelines.py

     # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class ZlsproPipeline(object):
        def process_item(self, item, spider):
            print(item)
            conn = spider.conn
            conn.lpush("data",item)
            return item
    

    Items.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class ZlsproItem(scrapy.Item):
        # define the fields for your item here like:
        name = scrapy.Field()
        desc = scrapy.Field()
        print(desc)
    

    settings.py

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for zlsPro project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://docs.scrapy.org/en/latest/topics/settings.html
    #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'zlsPro'
    
    SPIDER_MODULES = ['zlsPro.spiders']
    NEWSPIDER_MODULE = 'zlsPro.spiders'
    
    USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'zlsPro (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    LOG_LEVEL = "ERROR"
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'zlsPro.middlewares.ZlsproSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'zlsPro.middlewares.ZlsproDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://docs.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'zlsPro.pipelines.ZlsproPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
  • 相关阅读:
    Working with macro signatures
    Reset and Clear Recent Items and Frequent Places in Windows 10
    git分支演示
    The current .NET SDK does not support targeting .NET Core 2.1. Either target .NET Core 2.0 or lower, or use a version of the .NET SDK that supports .NET Core 2.1.
    Build website project by roslyn through devenv.com
    Configure environment variables for different tools in jenkins
    NUnit Console Command Line
    Code Coverage and Unit Test in SonarQube
    头脑王者 物理化学生物
    头脑王者 常识,饮食
  • 原文地址:https://www.cnblogs.com/jiangchunsheng/p/12554143.html
Copyright © 2011-2022 走看看