zoukankan      html  css  js  c++  java
  • Python爬虫 —— 抓取美女图片(Scrapy篇)

     

    杂谈:

    之前用requests模块爬取了美女图片,今天用scrapy框架实现了一遍。

    (图片尺度确实大了点,但老衲早已无恋红尘,权当观赏哈哈哈)

    Item:

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    import scrapy
    
    class GirlpicItem(scrapy.Item):
        title = scrapy.Field()
        image = scrapy.Field()
        index = scrapy.Field()

    Spider:

    #coding:utf-8
    from scrapy.spiders import Spider
    from scrapy.http import Request
    from scrapy.selector import Selector
    from girlpic.items import GirlpicItem
    import scrapy
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    
    class GirlpicSipder(Spider):
        name = 'girlpic'
        allowed_domains = []  # 允许的域名
        start_urls = ["http://www.mzitu.com/all/"]
    
        def parse(self, response):
            groups = response.xpath("//div[@class='main-content']//ul[@class='archives']//a")
            count = 0
            for group in groups:
                count = count + 1
                if count > 5:
                    return   #此处小心,不要用os.exit(0)
                groupUrl = group.xpath('@href').extract()[0]
                title = group.xpath("text()").extract()[0]
                request = scrapy.Request(url=groupUrl, callback=self.getGroup, meta={'title': title,'groupUrl':groupUrl}, dont_filter=True)
                yield request
    
        def getGroup(self, response):
            maxIndex = response.xpath("//div[@class='pagenavi']//span/text()").extract()[-2]
            for index in range(1, int(maxIndex) + 1):
                pageUrl = response.meta['groupUrl']+'/'+str(index)
                meta = response.meta
                meta['index'] = index
                request = scrapy.Request(url=pageUrl, callback=self.getPage, meta=meta, dont_filter=True)
                yield request
    
        def getPage(self, response):
            imageurl = response.xpath("//div[@class='main-image']//img/@src").extract()[0]  # 获取图片url
            request = scrapy.Request(url=imageurl, callback=self.FormItem, meta=response.meta,dont_filter=True)
            yield request
    
        def FormItem(self, response):
            title = response.meta['title']
            index = response.meta['index']
            image = response.body
            item = GirlpicItem(title=title,index=index,image=image)
            yield item

    PipeLine:

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    import os
    import codecs
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    
    class GirlpicPipeline(object):
    
        def __init__(self):
            self.dirpath = u'D:学习资料'
            if not os.path.exists(self.dirpath):
                os.makedirs(self.dirpath)
    
        def process_item(self, item, spider):
            title = item['title']
            index = item['index']
            image = item['image']
            groupdir = os.path.join(self.dirpath, title)
            if not os.path.exists(groupdir):
                os.makedirs(groupdir)
            imagepath = os.path.join(groupdir, str(index) + u'.jpg')
            file = codecs.open(imagepath, 'wb')
            file.write(image)
            file.close()
            return item
  • 相关阅读:
    LPR之我见
    安装tensorflow2.2cpu的简洁方法
    anaconda安装keras
    redis 查看当前连接数
    2020 8 14
    docker安装jenkins
    使用docker安装gitlab
    提问:游戏测试与一般的软件测试的区别在哪里?
    “战斗天使”- 测试媛是如何崛起的?
    关系型数据库的几种常用主键
  • 原文地址:https://www.cnblogs.com/DOLFAMINGO/p/9245530.html
Copyright © 2011-2022 走看看