zoukankan      html  css  js  c++  java
  • 爬虫框架之Scrapy(四 ImagePipeline)

    ImagePipeline

    使用scrapy框架我们除了要下载文本,还有可能需要下载图片,scrapy提供了ImagePipeline来进行图片的下载。

    ImagePipeline还支持以下特别的功能:

    1 生成缩略图:通过配置IMAGES_THUMBS = {'size_name': (width_size,heigh_size),}

    2 过滤小图片:通过配置IMAGES_MIN_HEIGHTIMAGES_MIN_WIDTH来过滤过小的图片。

    具体其他功能可以看下参考官网手册:https://docs.scrapy.org/en/latest/topics/media-pipeline.html.

    ImagePipelines的工作流程

    1 在spider中爬取需要下载的图片链接,将其放入item中的image_urls.

    2 spider将其传送到pipieline

    3 当ImagePipeline处理时,它会检测是否有image_urls字段,如果有的话,会将url传递给scrapy调度器和下载器

    4 下载完成后会将结果写入item的另一个字段images,images包含了图片的本地路径,图片校验,和图片的url。

    示例 爬取巴士lol的英雄美图

    只爬第一页

    http://lol.tgbus.com/tu/yxmt/

    第一步:items.py

    import scrapy
    
    
    class Happy4Item(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        image_urls = scrapy.Field()
        images = scrapy.Field()

     爬虫文件lol.py

    # -*- coding: utf-8 -*-
    import scrapy
    from happy4.items import Happy4Item
    
    class LolSpider(scrapy.Spider):
        name = 'lol'
        allowed_domains = ['lol.tgbus.com']
        start_urls = ['http://lol.tgbus.com/tu/yxmt/']
    
        def parse(self, response):
            li_list = response.xpath('//div[@class="list cf mb30"]/ul//li')
            for one_li in li_list:
                item = Happy4Item()
                item['image_urls'] =one_li.xpath('./a/img/@src').extract()
                yield item

     最后 settings.py

    BOT_NAME = 'happy4'
    
    SPIDER_MODULES = ['happy4.spiders']
    NEWSPIDER_MODULE = 'happy4.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    ITEM_PIPELINES = {
       'scrapy.pipelines.images.ImagesPipeline': 1,
    }
    IMAGES_STORE = 'images'

     不需要操作管道文件,就可以爬去图片到本地

    极大的减少了代码量.

    注意:因为图片管道会尝试将所有图片都转换成JPG格式的,你看源代码的话也会发现图片管道中文件名类型直接写死为JPG的。所以如果想要保存原始类型的图片,就应该使用文件管道。

    示例 爬取mm131美女图片

    http://www.mm131.com/xinggan/

    要求爬取的就是这个网站

    这个网站是有反爬的,当你直接去下载一个图片的时候,你会发现url被重新指向了别处,或者有可能是直接报错302,这是因为它使用了referer这个请求头里的字段,当你打开一个图片的url的时候,你的请求头里必须有referer,不然就会被识别为爬虫文件,禁止你的爬取,那么如何解决呢?

    手动在爬取每个图片的时候添加referer字段。

    xingan.py

    # -*- coding: utf-8 -*-
    import scrapy
    from happy5.items import Happy5Item
    import re
    
    class XinganSpider(scrapy.Spider):
        name = 'xingan'
        allowed_domains = ['www.mm131.com']
        start_urls = ['http://www.mm131.com/xinggan/']
    
        def parse(self, response):
            every_html = response.xpath('//div[@class="main"]/dl//dd')
            for one_html in every_html[0:-1]:
                item = Happy5Item()
                # 每个图片的链接
                link = one_html.xpath('./a/@href').extract_first()
                # 每个图片的名字
                title = one_html.xpath('./a/img/@alt').extract_first()
                item['title'] = title
                # 进入到每个标题里面
                request = scrapy.Request(url=link, callback=self.parse_one, meta={'item':item})
                yield request
    
        # 每个人下面的图集
        def parse_one(self, response):
            item = response.meta['item']
            # 找到总页数
            total_page = response.xpath('//div[@class="content-page"]/span[@class="page-ch"]/text()').extract_first()
            num = int(re.findall('(d+)', total_page)[0])
            # 找到当前页数
            now_num = response.xpath('//div[@class="content-page"]/span[@class="page_now"]/text()').extract_first()
            now_num = int(now_num)
            # 当前页图片的url
            every_pic = response.xpath('//div[@class="content-pic"]/a/img/@src').extract()
            # 当前页的图片url
            item['image_urls'] = every_pic
            # 当前图片的refer
            item['referer'] = response.url
            yield item
    
            # 如果当前数小于总页数
            if now_num < num:
                if now_num == 1:
                    url1 = response.url[0:-5] + '_%d'%(now_num+1) + '.html'
                elif now_num > 1:
                    url1 = re.sub('_(d+)', '_' + str(now_num+1), response.url)
                headers = {
                    'referer':self.start_urls[0]
                }
                # 给下一页发送请求
                yield scrapy.Request(url=url1, headers=headers, callback=self.parse_one, meta={'item':item})

    items.py

    import scrapy
    
    
    class Happy5Item(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        image_urls = scrapy.Field()
        images = scrapy.Field()
        title = scrapy.Field()
        referer = scrapy.Field()

    pipelines.py

    from scrapy.pipelines.images import ImagesPipeline
    from scrapy.exceptions import DropItem
    from scrapy.http import Request
    
    class Happy5Pipeline(object):
        def process_item(self, item, spider):
            return item
    
    class QiushiImagePipeline(ImagesPipeline):
    
        # 下载图片时加入referer请求头
        def get_media_requests(self, item, info):
            for image_url in item['image_urls']:
                headers = {'referer':item['referer']}
                yield Request(image_url, meta={'item': item}, headers=headers)
                # 这里把item传过去,因为后面需要用item里面的书名和章节作为文件名
    
        # 获取图片的下载结果, 控制台查看
        def item_completed(self, results, item, info):
            image_paths = [x['path'] for ok, x in results if ok]
            if not image_paths:
                raise DropItem("Item contains no images")
            return item
    
        # 修改文件的命名和路径
        def file_path(self, request, response=None, info=None):
            item = request.meta['item']
            image_guid = request.url.split('/')[-1]
            filename = './{}/{}'.format(item['title'], image_guid)
            return filename

    settings.py

    BOT_NAME = 'happy5'
    
    SPIDER_MODULES = ['happy5.spiders']
    NEWSPIDER_MODULE = 'happy5.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    ITEM_PIPELINES = {
       # 'scrapy.pipelines.images.ImagesPipeline': 1,
       'happy5.pipelines.QiushiImagePipeline': 2,
    }
    IMAGES_STORE = 'images'

    得到图片文件:

    这种图还是要少看。

  • 相关阅读:
    github使用
    部署flask
    docker部署路飞学城
    centos7安装dnsmasq局域网dns
    消息队列rabbitmq
    记录腾讯云中矿机病毒处理过程(重装系统了fu*k)
    Golang基础
    git协同开发
    gitlab与pycharm结合
    github与gitlab与git三个基佬的故事
  • 原文地址:https://www.cnblogs.com/xiaozx/p/10776694.html
Copyright © 2011-2022 走看看