zoukankan      html  css  js  c++  java
  • python---Scrapy实现使用Splash进行网页信息爬取

    一:回顾Scrapy的使用

    python---Scrapy模块的使用(一)

    二:爬取网址

    (一)需求

    最近想看漫画,奈何需要注册并支付...,想着爬取图片到本地进行浏览

    (二)页面源码

    我们可以知道图片网址存在一部分规则,我们可以按照这部分规则进行网站爬取。

    但是我们可以知道在Img标签前面有<script>脚本信息,是用来对图片信息进行js渲染显示的,所以我们直接对该网页进行源码匹配是无法获取图片信息的。

    这里我们就需要用到Splash技术

    (三)Splash技术

    https://www.jianshu.com/p/41e0a7e40824

    三:代码实现

    (一)主业务实现zymk.py

    import scrapy
    from scrapy_splash import SplashRequest
    
    from zymkPro.items import ZymkproItem
    
    
    class ZymkSpider(scrapy.Spider):
        name = 'zymk'
        start_chapter = 700
        allowed_domains = []
        start_urls = ['http://www.zymk.cn/2/']
    
        def start_requests(self):
            for url in self.start_urls:
                yield scrapy.Request(url=url, callback=self.parse,
                                    dont_filter=True)
    
        def parse(self,response):
            # 获取首页中所有的章节
            Cp_a = response.xpath('//ul[@id="chapterList"]/li') #解析实体
            for cp in Cp_a:
                cp_url = cp.xpath("./a/@href").extract_first()  #获取链接
                cp_title = cp.xpath("./a/text()").extract_first()   #获取文本
                try:
                    if int(cp_title.split("")[0]) < self.start_chapter:
                        continue
                except ValueError:
                    print("异常番号")
                    continue
                if not cp_url.startswith("http"):
                    cp_url = "https://www.zymk.cn/2/%s" % cp_url
    
                yield SplashRequest(url=cp_url,callback=self.parseNextClsPage,  #对于这里URL,我们是要去获取图片网址,而这里是js动态渲染的,所以需要是要Splash技术
                                    args={'timeout': 3600,'wait':1})
    
    
        def parseNextClsPage(self,response):
            # 下面是所有章节去查找所有的图片
            xh_img = response.xpath('//img[@class="comicimg"]')
            xh_img_samp = xh_img.xpath("./@src").extract_first()
    
            ch_name = response.xpath('//div[@id="readEnd"]/div/p/strong/text()').extract_first()
            href_list = xh_img_samp.split(".jpg")
            xh_img_href_list = []
    
            xh_pages = response.xpath("//select[@class='selectpage']")
            xh_pagesC = xh_pages.xpath("./option[1]/text()").extract_first()
    
            xh_pagesCount = int(xh_pagesC.split("/")[1][:-1])
            xh_href_form = (href_list[0][:-1]).split("//")[1]
    
            #新操作
            xh_href_form_l = xh_href_form.split("/")
            xh_href_form_l[0] = "mhpic.xiaomingtaiji.net"
            xh_href_form = "/".join(xh_href_form_l)
    
            xh_href_latt = href_list[1]
    
            for i in range(1,xh_pagesCount+1):
                new_img_href= "http://"+xh_href_form+("%d.jpg"%i)+xh_href_latt    #获取所有图片信息
                item_obj = ZymkproItem(title=ch_name, img_url=new_img_href,img_number=i)
                yield item_obj

    (二)item.py设置字段

    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class ZymkproItem(scrapy.Item):
        # define the fields for your item here like:
        title = scrapy.Field()
        img_url = scrapy.Field()
        img_number = scrapy.Field()

    (三)中间件middlewares.py实现动态UA

    # Define here the models for your spider middleware
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    
    from scrapy import signals
    
    # useful for handling different item types with a single interface
    from itemadapter import is_item, ItemAdapter
    from fake_useragent import UserAgent
    
    
    
    class ZymkproSpiderMiddleware:
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the spider middleware does not modify the
        # passed objects.
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        def process_spider_input(self, response, spider):
            # Called for each response that goes through the spider
            # middleware and into the spider.
    
            # Should return None or raise an exception.
            return None
    
        def process_spider_output(self, response, result, spider):
            # Called with the results returned from the Spider, after
            # it has processed the response.
    
            # Must return an iterable of Request, or item objects.
            for i in result:
                yield i
    
        def process_spider_exception(self, response, exception, spider):
            # Called when a spider or process_spider_input() method
            # (from other spider middleware) raises an exception.
    
            # Should return either None or an iterable of Request or item objects.
            pass
    
        def process_start_requests(self, start_requests, spider):
            # Called with the start requests of the spider, and works
            # similarly to the process_spider_output() method, except
            # that it doesn’t have a response associated.
    
            # Must return only requests (not items).
            for r in start_requests:
                yield r
    
        def spider_opened(self, spider):
            spider.logger.info('Spider opened: %s' % spider.name)
    
    
    class ZymkproDownloaderMiddleware:
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        def process_request(self, request, spider):
            # Called for each request that goes through the downloader
            # middleware.
    
            # Must either:
            # - return None: continue processing this request
            # - or return a Response object
            # - or return a Request object
            # - or raise IgnoreRequest: process_exception() methods of
            #   installed downloader middleware will be called
            return None
    
        def process_response(self, request, response, spider):
            # Called with the response returned from the downloader.
    
            # Must either;
            # - return a Response object
            # - return a Request object
            # - or raise IgnoreRequest
            return response
    
        def process_exception(self, request, exception, spider):
            # Called when a download handler or a process_request()
            # (from other downloader middleware) raises an exception.
    
            # Must either:
            # - return None: continue processing this exception
            # - return a Response object: stops process_exception() chain
            # - return a Request object: stops process_exception() chain
            pass
    
        def spider_opened(self, spider):
            spider.logger.info('Spider opened: %s' % spider.name)
    
    
    class MyUseragent(object):
        def __init__(self):
            self.ua = UserAgent()
    
        def process_request(self,request,spider):
            referer=request.url
            if referer:
                request.headers["referer"] = referer
            request.headers.setdefault("User-Agent", self.ua.random)

    (四)持久化操作pipelines.py

    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    # useful for handling different item types with a single interface
    from itemadapter import ItemAdapter
    import requests,os
    from fake_useragent import UserAgent
    
    class ZymkproPipeline:
        def process_item(self, item, spider):
            file_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'upload', item['title'])
            ua = UserAgent().random
    
            if not os.path.isdir(file_path):
                os.makedirs(file_path)
    
            header = {
                'User-Agent': ua,
                'Referer': item['img_url']
            }
    
            response = requests.get(item['img_url'], stream=False,headers=header)
            with open(os.path.join(file_path,"%d.jpg"%item['img_number']), "wb") as fp:
                fp.write(response.content)
    
            return item

    (五)配置文件setting.py

    # Scrapy settings for zymkPro project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://docs.scrapy.org/en/latest/topics/settings.html
    #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'zymkPro'
    
    SPIDER_MODULES = ['zymkPro.spiders']
    NEWSPIDER_MODULE = 'zymkPro.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'zymkPro (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'zh-CN,zh;q=0.9',
      'USER_AGENT': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
    }
    
    # Enable or disable spider middlewares
    # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'zymkPro.middlewares.ZymkproSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    DOWNLOADER_MIDDLEWARES = {
       'zymkPro.middlewares.ZymkproDownloaderMiddleware': 543,
       'zymkPro.middlewares.MyUseragent': 543,
    }
    
    # Enable or disable extensions
    # See https://docs.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'zymkPro.pipelines.ZymkproPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    SPLASH_URL = 'http://192.168.58.139:8050'
    
    DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }
    
    SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    }
    
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    View Code

    四:结果显示

    五:项目链接

    https://github.com/viewmountain/Scrapy-Instance

    六:疑惑---图片防盗链如果处理??太久了,忘了

    这里出现的情况是,我们直接通过scrapy获取的图片网址和实际的网址有所出入,所以我们在上面修改了前面的域名,才成功的。是因为进行了反爬操作??

    动态修改了我们获取的图片链接?? 

  • 相关阅读:
    hdu 3342 Legal or Not 拓排序
    hdu 1596 find the safest road Dijkstra
    hdu 1874 畅通工程续 Dijkstra
    poj 2676 sudoku dfs
    poj 2251 BFS
    poj Prime Path BFS
    poj 3278 BFS
    poj 2387 Dijkstra 模板
    poj 3083 DFS 和BFS
    poj 1062 昂贵的聘礼 dijkstra
  • 原文地址:https://www.cnblogs.com/ssyfj/p/13510637.html
Copyright © 2011-2022 走看看