zoukankan      html  css  js  c++  java
  • 第二十节:Scrapy爬虫框架之使用Pipeline存储

    在上两节当中,我们爬取了360图片,但是我们需要将图片下载下来,这将如何下载和存储呢?

    下边叙述一下三种情况:1、将图片下载后存储到MongoDB数据库;2、将图片下载后存储在MySQL数据库;3、将图片下载到本地文件

    话不多说,直接上代码:

    1、通过item定义存储字段

     1 # item.py
     2 import scrapy
     3 
     4 class Bole_mode(scrapy.Item):
     5     collection = "images"     # collection为MongoDB储表名名称
     6     table = "images"           # table为MySQL的存储表名名称
     7     id    = scrapy.Field()      # id
     8     url   = scrapy.Field()      # 图片链接
     9     title = scrapy.Field()      # 标题
    10     thumb = scrapy.Field()  # 缩略图

    2、配置settings文件获取数据库信息

      1 # -*- coding: utf-8 -*-
      2 
      3 # Scrapy settings for bole project
      4 #
      5 # For simplicity, this file contains only settings considered important or
      6 # commonly used. You can find more settings consulting the documentation:
      7 #
      8 #     https://doc.scrapy.org/en/latest/topics/settings.html
      9 #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
     10 #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
     11 
     12 BOT_NAME = 'bole'
     13 
     14 SPIDER_MODULES = ['BLZX.spiders']
     15 NEWSPIDER_MODULE = 'BLZX.spiders'
     16 
     17 
     18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
     19 #USER_AGENT = 'bole (+http://www.yourdomain.com)'
     20 
     21 # Obey robots.txt rules
     22 ROBOTSTXT_OBEY = False
     23 USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
     24 
     25 # Configure maximum concurrent requests performed by Scrapy (default: 16)
     26 #CONCURRENT_REQUESTS = 32
     27 
     28 # Configure a delay for requests for the same website (default: 0)
     29 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
     30 # See also autothrottle settings and docs
     31 #DOWNLOAD_DELAY = 3
     32 # The download delay setting will honor only one of:
     33 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
     34 #CONCURRENT_REQUESTS_PER_IP = 16
     35 
     36 # Disable cookies (enabled by default)
     37 #COOKIES_ENABLED = False
     38 
     39 # Disable Telnet Console (enabled by default)
     40 #TELNETCONSOLE_ENABLED = False
     41 
     42 # Override the default request headers:
     43 #DEFAULT_REQUEST_HEADERS = {
     44 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
     45 #   'Accept-Language': 'en',
     46 #}
     47 
     48 # Enable or disable spider middlewares
     49 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
     50 #SPIDER_MIDDLEWARES = {
     51 #    'bole.middlewares.BoleSpiderMiddleware': 543,
     52 #}
     53 
     54 # Enable or disable downloader middlewares
     55 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
     56 
     57 
     58 # DOWNLOADER_MIDDLEWARES = {
     59 #    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':None,
     60 #    'bole.middlewares.ProxyMiddleware':125,
     61 #    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware':None
     62 # }
     63 
     64 
     65 
     66 # Enable or disable extensions
     67 # See https://doc.scrapy.org/en/latest/topics/extensions.html
     68 #EXTENSIONS = {
     69 #    'scrapy.extensions.telnet.TelnetConsole': None,
     70 #}
     71 
     72 
     73 # Configure item pipelines
     74 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
     75 ITEM_PIPELINES = {
     76     "bole.pipelines.BoleImagePipeline":2,
     77     "bole.pipelines.ImagePipeline":300,
     78     "bole.pipelines.MongoPipeline":301,
     79     "bole.pipelines.MysqlPipeline":302,
     80 }
     81 
     82 
     83 # Enable and configure the AutoThrottle extension (disabled by default)
     84 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
     85 #AUTOTHROTTLE_ENABLED = True
     86 # The initial download delay
     87 #AUTOTHROTTLE_START_DELAY = 5
     88 # The maximum download delay to be set in case of high latencies
     89 #AUTOTHROTTLE_MAX_DELAY = 60
     90 # The average number of requests Scrapy should be sending in parallel to
     91 # each remote server
     92 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
     93 # Enable showing throttling stats for every response received:
     94 #AUTOTHROTTLE_DEBUG = False
     95 
     96 # Enable and configure HTTP caching (disabled by default)
     97 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
     98 #HTTPCACHE_ENABLED = True
     99 #HTTPCACHE_EXPIRATION_SECS = 0
    100 #HTTPCACHE_DIR = 'httpcache'
    101 #HTTPCACHE_IGNORE_HTTP_CODES = []
    102 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    103 
    104 
    105 # 爬取最大页数
    106 MAX_PAGE = 50
    107 
    108 # mongodb配置
    109 MONGODB_URL = "localhost"
    110 MONGODB_DB = "Images360"
    111 
    112 # MySQL配置
    113 MYSQL_HOST = "localhost"
    114 MYSQL_DATABASE = "images360"
    115 MYSQL_PORT = 3306
    116 MYSQL_USER = "root"
    117 MYSQL_PASSWORD = "123456"
    118 
    119 # 本地配置
    120 IMAGES_STORE = r"D:spideroleimage"

    3、此处的Middlewares没有做任何修改

      1 # -*- coding: utf-8 -*-
      2 
      3 # Define here the models for your spider middleware
      4 #
      5 # See documentation in:
      6 # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
      7 
      8 from scrapy import signals
      9 
     10 
     11 class BoleSpiderMiddleware(object):
     12     # Not all methods need to be defined. If a method is not defined,
     13     # scrapy acts as if the spider middleware does not modify the
     14     # passed objects.
     15 
     16     @classmethod
     17     def from_crawler(cls, crawler):
     18         # This method is used by Scrapy to create your spiders.
     19         s = cls()
     20         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
     21         return s
     22 
     23     def process_spider_input(self, response, spider):
     24         # Called for each response that goes through the spider
     25         # middleware and into the spider.
     26 
     27         # Should return None or raise an exception.
     28         return None
     29 
     30     def process_spider_output(self, response, result, spider):
     31         # Called with the results returned from the Spider, after
     32         # it has processed the response.
     33 
     34         # Must return an iterable of Request, dict or Item objects.
     35         for i in result:
     36             yield i
     37 
     38     def process_spider_exception(self, response, exception, spider):
     39         # Called when a spider or process_spider_input() method
     40         # (from other spider middleware) raises an exception.
     41 
     42         # Should return either None or an iterable of Response, dict
     43         # or Item objects.
     44         pass
     45 
     46     def process_start_requests(self, start_requests, spider):
     47         # Called with the start requests of the spider, and works
     48         # similarly to the process_spider_output() method, except
     49         # that it doesn’t have a response associated.
     50 
     51         # Must return only requests (not items).
     52         for r in start_requests:
     53             yield r
     54 
     55     def spider_opened(self, spider):
     56         spider.logger.info('Spider opened: %s' % spider.name)
     57 
     58 
     59 class BoleDownloaderMiddleware(object):
     60     # Not all methods need to be defined. If a method is not defined,
     61     # scrapy acts as if the downloader middleware does not modify the
     62     # passed objects.
     63 
     64     @classmethod
     65     def from_crawler(cls, crawler):
     66         # This method is used by Scrapy to create your spiders.
     67         s = cls()
     68         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
     69         return s
     70 
     71     def process_request(self, request, spider):
     72         # Called for each request that goes through the downloader
     73         # middleware.
     74 
     75         # Must either:
     76         # - return None: continue processing this request
     77         # - or return a Response object
     78         # - or return a Request object
     79         # - or raise IgnoreRequest: process_exception() methods of
     80         #   installed downloader middleware will be called
     81         return None
     82 
     83     def process_response(self, request, response, spider):
     84         # Called with the response returned from the downloader.
     85 
     86         # Must either;
     87         # - return a Response object
     88         # - return a Request object
     89         # - or raise IgnoreRequest
     90         return response
     91 
     92     def process_exception(self, request, exception, spider):
     93         # Called when a download handler or a process_request()
     94         # (from other downloader middleware) raises an exception.
     95 
     96         # Must either:
     97         # - return None: continue processing this exception
     98         # - return a Response object: stops process_exception() chain
     99         # - return a Request object: stops process_exception() chain
    100         pass
    101 
    102     def spider_opened(self, spider):
    103         spider.logger.info('Spider opened: %s' % spider.name)
    View Code

    4、通过Pipeline对爬取的数据进行存储,分为MongoDB数据库存储,MySQL数据库存储,本地文件夹存储

     1 # -*- coding: utf-8 -*-
     2 # ==========================MongoDB===========================
     3 import pymongo
     4 class MongoPipeline(object):
     5     def __init__(self,mongodb_url,mongodb_DB):
     6         self.mongodb_url = mongodb_url
     7         self.mongodb_DB = mongodb_DB
     8 
     9     @classmethod
    10     # 获取settings配置文件当中设置的MONGODB_URL和MONGODB_DB
    11     def from_crawler(cls,crawler):
    12         return cls(
    13                     mongodb_url=crawler.settings.get("MONGODB_URL"),
    14                     mongodb_DB=crawler.settings.get("MONGODB_DB")
    15                    )
    16 
    17     # 开启爬虫时连接MongoDB数据库
    18     def open_spider(self,spider):
    19         self.client = pymongo.MongoClient(self.mongodb_url)
    20         self.db = self.client[self.mongodb_DB]
    21 
    22     def process_item(self,item,spider):
    23         table_name = item.collection
    24         self.db[table_name].insert(dict(item))
    25         return item
    26 
    27     # 关闭爬虫时断开MongoDB数据库连接
    28     def close_spider(self,spider):
    29         self.client.close()
    30 
    31 
    32 # ============================MySQL===========================
    33 import pymysql
    34 class MysqlPipeline():
    35     def __init__(self,host,database,port,user,password):
    36         self.host = host
    37         self.database = database
    38         self.port = port
    39         self.user = user
    40         self.password = password
    41 
    42     @classmethod
    43     # 获取settings配置文件当中设置的MySQL各个参数
    44     def from_crawler(cls,crawler):
    45         return cls(
    46             host=crawler.settings.get("MYSQL_HOST"),
    47             database=crawler.settings.get("MYSQL_DATABASE"),
    48             port=crawler.settings.get("MYSQL_PORT"),
    49             user=crawler.settings.get("MYSQL_USER"),
    50             password=crawler.settings.get("MYSQL_PASSWORD")
    51         )
    52 
    53     # 开启爬虫时连接MongoDB数据库
    54     def open_spider(self,spider):
    55         self.db = pymysql.connect(host=self.host,database=self.database,user=self.user,password=self.password,port=self.port,charset="utf8")
    56         self.cursor = self.db.cursor()
    57 
    58     def process_item(self,item,spider):
    59         data = dict(item)
    60         keys = ",".join(data.keys())            # 字段名
    61         values =",".join(["%s"]*len(data))     #
    62         sql = "insert into %s(%s) values(%s)"%(item.table, keys, values)
    63         self.cursor.execute(sql, tuple(data.values()))
    64         self.db.commit()
    65         return item
    66 
    67     # 关闭爬虫时断开MongoDB数据库连接
    68     def close_spider(self,spider):
    69         self.db.close()
    70 
    71 
    72 
    73 # # ============================本地===========================
    74 import scrapy
    75 from scrapy.exceptions import DropItem
    76 from scrapy.pipelines.images import ImagesPipeline
    77 
    78 class ImagePipeline(ImagesPipeline):
    79 
    80     # 由于item里的url不是list,所以重写下面几个函数
    81     def file_path(self, request, response=None, info=None):
    82         url = request.url
    83         file_name = url.split("/")[-1]    # 将url连接的最后一部分作为文件名称
    84         return file_name
    85 
    86     # results为item对应的图片下载的结果,他是一个list,每个元素为元组,并包含了下载成功和失败的信息
    87     def item_completed(self, results, item, info):
    88 
    89         # 获取图片地址path
    90         image_paths = [x["path"] for ok,x in results if ok]
    91         if not image_paths:
    92             raise DropItem("图片下载失败!!")
    93         return item
    94 
    95     def get_media_requests(self, item, info):
    96 
    97         # 获取item文件里的url字段并加入队列等待被调用进行下载图片
    98         yield scrapy.Request(item["url"])

    5、最后就是spider数据爬取了

     1 import scrapy
     2 import json
     3 import sys
     4 
     5 sys.path.append(r'D:spideroleitem.py')
     6 from bole.items import Bole_mode
     7 
     8 class BoleSpider(scrapy.Spider):
     9     name = 'boleSpider'
    10 
    11     def start_requests(self):
    12         url = "https://image.so.com/zj?ch=photography&sn={}&listtype=new&temp=1"
    13         page = self.settings.get("MAX_PAGE")
    14         for i in range(int(page)+1):
    15             yield scrapy.Request(url=url.format(i*30))
    16 
    17     def parse(self,response):
    18         photo_list = json.loads(response.text)
    19         item  = Bole_mode()
    20         for image in photo_list.get("list"):
    21             item["id"] = image["id"]
    22             item["url"] = image["qhimg_url"]
    23             item["title"] = image["group_title"]
    24             item["thumb"] = image["qhimg_thumb_url"]
    25             yield item

    6、最最后就是对爬取的结果展示一下呗(只展示MySQL和本地,MongoDB没打开)

    (1) MySQL存储

    (2) 本地存储

  • 相关阅读:
    [已解决]报错:Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=256m;
    比较asyncio.run_coroutine_threadsafe 和 run_in_executor的区别
    sql server表结构对比
    sql server乱码显示问题
    sql server表分区系列【转】
    无法使用备份文件,因为原先格式化该文件时所用扇区大小为 512,而目前所在设备的扇区大小为 4096
    notepad++安装SQL格式化插件
    Linux学习笔记(21)linux查看系统状态
    mysql导入报错 [Err] 1273
    mysql cte
  • 原文地址:https://www.cnblogs.com/zhaco/p/10707364.html
Copyright © 2011-2022 走看看