zoukankan      html  css  js  c++  java
  • scrapy实战3利用fiddler对手机app进行抓包爬虫图片下载(重写ImagesPipeline):

    关于fiddler的使用方法参考(http://jingyan.baidu.com/article/03b2f78c7b6bb05ea237aed2.html)

    本案例爬取斗鱼 app

    先利用fiddler分析抓包json数据如下图

    通过分析发现变化的只有offset  确定item字段 开始编写代码

    items.py

     1 import scrapy
     2 
     3 
     4 class DouyuItem(scrapy.Item):
     5     # define the fields for your item here like:
     6     # name = scrapy.Field()
     7     # 存储照片的名字
     8     nickname=scrapy.Field()
     9     # 照片的url路径
    10     imagelink=scrapy.Field()
    11     # 照片保存在本地的路径
    12     imagepath=scrapy.Field()
    View Code

    spider/Douyu.py

     1 import scrapy
     2 import json
     3 from douyu.items import DouyuItem
     4 
     5 class DouyuSpider(scrapy.Spider):
     6     name = "Douyu"
     7     allowed_domains = ["capi.douyucdn.cn"]
     8     offset=0
     9     url="http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset="
    10     start_urls = [url+str(offset)]
    11 
    12     def parse(self, response):
    13         # 将从json里获取的数据转换成python对象 data段数据集合 response.text获取内容
    14         data=json.loads(response.text)["data"]
    15         for each in data:
    16             item=DouyuItem()
    17             item["nickname"]=each["nickname"]
    18             item["imagelink"]=each["vertical_src"]
    19             yield item
    20         self.offset+=100
    21         yield scrapy.Request(self.url+str(self.offset),callback=self.parse)
    View Code

    pipelines.py

    import scrapy
    from scrapy.pipelines.images import ImagesPipeline
    from douyu.items import DouyuItem
    from scrapy.utils.project import get_project_settings
    import os
    class DouyuPipeline(object):
        def process_item(self, item, spider):
            return item
    class ImagesPipelines(ImagesPipeline):
    
        IMAGES_STORE=get_project_settings().get("IMAGES_STORE")
        def get_media_requests(self, item, info):
            # get_media_requests的作用就是为每一个图片链接生成一个Request对象,这个方法的输出将作为item_completed的输入中的results,results是一个元组,
            # 每个元组包括(success, imageinfoorfailure)。如果success=true,imageinfoor_failure是一个字典,包括url/path/checksum三个key。
            image_url=item["imagelink"]
            yield scrapy.Request(image_url)
        def item_completed(self, results, item, info):
            # 固定写法,获取图片路径,同时判断这个路径是否正确,如果正确,就放到 image_path里,ImagesPipeline源码剖析可见
            image_path=[x["path"] for ok,x in results if ok]
            print(image_path)
            os.rename(self.IMAGES_STORE+'/'+image_path[0],self.IMAGES_STORE+"/"+item["nickname"]+".jpg")
            item["imagepath"]=self.IMAGES_STORE+"/"+item["nickname"]
    View Code

    settints.py

     1 # -*- coding: utf-8 -*-
     2 
     3 # Scrapy settings for douyu project
     4 #
     5 # For simplicity, this file contains only settings considered important or
     6 # commonly used. You can find more settings consulting the documentation:
     7 #
     8 #     http://doc.scrapy.org/en/latest/topics/settings.html
     9 #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    10 #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    11 import os
    12 BOT_NAME = 'douyu'
    13 
    14 SPIDER_MODULES = ['douyu.spiders']
    15 NEWSPIDER_MODULE = 'douyu.spiders'
    16 
    17 
    18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
    19 #USER_AGENT = 'douyu (+http://www.yourdomain.com)'
    20 
    21 # Obey robots.txt rules
    22 ROBOTSTXT_OBEY = False
    23 
    24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
    25 #CONCURRENT_REQUESTS = 32
    26 
    27 # Configure a delay for requests for the same website (default: 0)
    28 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
    29 # See also autothrottle settings and docs
    30 #DOWNLOAD_DELAY = 3
    31 # The download delay setting will honor only one of:
    32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    33 #CONCURRENT_REQUESTS_PER_IP = 16
    34 
    35 # Disable cookies (enabled by default)
    36 #COOKIES_ENABLED = False
    37 
    38 # Disable Telnet Console (enabled by default)
    39 #TELNETCONSOLE_ENABLED = False
    40 
    41 # Override the default request headers:
    42 DEFAULT_REQUEST_HEADERS = {
    43     "USER_AGENT" : 'DYZB/2.290 (iPhone; iOS 9.3.4; Scale/2.00)'
    44 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    45 #   'Accept-Language': 'en',
    46 }
    47 # Enable or disable spider middlewares
    48 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    49 #SPIDER_MIDDLEWARES = {
    50 #    'douyu.middlewares.DouyuSpiderMiddleware': 543,
    51 #}
    52 
    53 # Enable or disable downloader middlewares
    54 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    55 #DOWNLOADER_MIDDLEWARES = {
    56 #    'douyu.middlewares.MyCustomDownloaderMiddleware': 543,
    57 #}
    58 
    59 # Enable or disable extensions
    60 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
    61 #EXTENSIONS = {
    62 #    'scrapy.extensions.telnet.TelnetConsole': None,
    63 #}
    64 
    65 # Configure item pipelines
    66 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
    67 ITEM_PIPELINES = {
    68     # 'scrapy.pipelines.images.ImagesPipeline': 1,
    69      'douyu.pipelines.ImagesPipelines': 300,
    70 }
    71 # Images 的存放位置,之后会在pipelines.py里调用
    72 project_dir=os.path.abspath(os.path.dirname(__file__))  
    73 IMAGES_STORE=os.path.join(project_dir,'images') #images可以随便取名
    74 # Enable and configure the AutoThrottle extension (disabled by default)
    75 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
    76 #AUTOTHROTTLE_ENABLED = True
    77 # The initial download delay
    78 #AUTOTHROTTLE_START_DELAY = 5
    79 # The maximum download delay to be set in case of high latencies
    80 #AUTOTHROTTLE_MAX_DELAY = 60
    81 # The average number of requests Scrapy should be sending in parallel to
    82 # each remote server
    83 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    84 # Enable showing throttling stats for every response received:
    85 #AUTOTHROTTLE_DEBUG = False
    86 
    87 # Enable and configure HTTP caching (disabled by default)
    88 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    89 #HTTPCACHE_ENABLED = True
    90 #HTTPCACHE_EXPIRATION_SECS = 0
    91 #HTTPCACHE_DIR = 'httpcache'
    92 #HTTPCACHE_IGNORE_HTTP_CODES = []
    93 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    View Code

     数据:

  • 相关阅读:
    myBatis源码解析-二级缓存的实现方式
    手写mybatis框架-增加缓存&事务功能
    手写mybatis框架
    myBatis源码解析-配置文件解析(6)
    myBatis源码解析-类型转换篇(5)
    myBatis源码解析-反射篇(4)
    myBatis源码解析-数据源篇(3)
    myBatis源码解析-缓存篇(2)
    Linux Centos下SQL Server 2017安装和配置
    VS2019 查看源码,使用F12查看源码
  • 原文地址:https://www.cnblogs.com/huwei934/p/6993993.html
Copyright © 2011-2022 走看看