zoukankan      html  css  js  c++  java
  • scrapy框架爬取糗妹妹网站qiumeimei.com图片

    1. 创建项目

      scrapy startproject qiumeimei

    2. 建蜘蛛文件qiumei.py

      cd qiumeimei

      scrapy genspider qiumei www.qiumeimei.com

    3. 考虑到只需要下载图片,先在items.py定义字段

      

    import scrapy
    
    class QiumeimeiItem(scrapy.Item):
        # define the fields for your item here like:
        img_path = scrapy.Field()
        pass
    

    4. 写蜘蛛文件qiumei.py

      

    # -*- coding: utf-8 -*-
    import scrapy
    
    from qiumeimei.items import QiumeimeiItem
    class QiumeiSpider(scrapy.Spider):
        name = 'qiumei'
        # allowed_domains = ['www.qiumeimei.com']
        start_urls = ['http://www.qiumeimei.com/image']
    
        def parse(self, response):
            img_url = response.css('.main>p>img::attr(data-lazy-src)').extract()
            # print(img_url)
            for url in img_url:
                # print(url)
                item = QiumeimeiItem()
                item['img_path'] = url
    
                yield item
    
            next_url = response.css('.pagination a.next::attr(href)').extract_first()
            if next_url:
                yield scrapy.Request(url=next_url,callback=self.parse)
    

    5. 管道文件pipelines.py      这里图片是全部放在了一个文件夹里,在settings.py中定义了一个路径,见下文第6步:

    import os,scrapy
    from scrapy.pipelines.images import ImagesPipeline
    from qiumeimei.settings import IMAGES_STORE as images_store
    class QiumeimeiPipeline(ImagesPipeline):
        def get_media_requests(self, item, info):
            img_path = item['img_path']
            # print(000)
            yield scrapy.Request(url=img_path)
    
        def item_completed(self, results, item, info):
    
            old_name_list = [x['path'] for t, x in results]
            old_name = images_store + old_name_list[0]
            # print(111)
            #图片名称
            from datetime import datetime
            i = str(datetime.now())
            # print(222)
            img_path = item['img_path']
            img_type = img_path.split('.')[-1]
            img_name = i[:4]+i[5:7]+i[8:10]+i[11:13]+i[14:16]+i[17:19]+i[20:]
            #图片路径   所有图片放在一个文件夹里
            # print(333)
            path = images_store + img_name +'.'+ img_type
            print(path+' 已下载...')
            os.rename(old_name,path)
    
            return item
    

    6. 设置文件settings.py

    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    #图片路径,会自动创建
    IMAGES_STORE = './images/'
    
    #开启管道
    ITEM_PIPELINES = {
       'qiumeimei.pipelines.QiumeimeiPipeline': 300,
    }
    

      已成功:

  • 相关阅读:
    Github
    Vocabulary in Computer
    js中三种定义变量的方式const, var, let的区别
    Node.js-1
    JSON_in_js
    JSON快速入门
    Mysql tinyint长度为1时在java中被转化成boolean型
    maven上解决循环依赖、又不想新加第三模块的方法
    关于springboot和tomcat的服务能力做下简单的测试
    tomcat各个端口的作用
  • 原文地址:https://www.cnblogs.com/wshr210/p/11359977.html
Copyright © 2011-2022 走看看