zoukankan      html  css  js  c++  java
  • 爬取360摄影美图

    爬取360摄影美图

    新建项目

    scrapy startproject images360

    创建一个Spider

    scrapy genspider image image.so.com

    构造请求:

    爬取50页,每页30张,先在settings.py里定义一个MAX_PAGE,添加定义   MAX_PAGE = 50

    定义 start_requests

        def start_requests(self):
            data = {'ch': 'photography', 'listtype': 'new'}
            base_url = 'https://image.so.com/zj?'
            for page in range(1, self.settings.get('MAX_PAGE') + 1):
                data['sn'] = page * 30
                params = urlencode(data)
                url = base_url + params
                yield Request(url, self.parse)

    修改settings.py中ROBOTSTXT_OBEY变量,将其设置为False。

    ROBOTSTXT_OBEY = False

    提取信息:
    from scrapy import Item, Field
    class ImageItem(Item):
        collection = 'images'
        
        id = Field()
        url = Field()
        title = Field()
        thumb = Field()

    图片ID,链接,标题,缩略图

    提取有关信息
            def parse(self, response):
            result = json.loads(response.text)
            for image in result.get('list'):
                item = ImageItem()
                item['id'] = image.get('imageid')
                item['url'] = image.get('qhimg_url')
                item['title'] = image.get('group_title')
                item['thumb'] = image.get('qhimg_thumb_url')
                yield item

    解析json,遍历其list,取出图片信息,对ImageItem赋值,生成Item对象。

    存储信息

    import pymongoclass MongoPipeline(object):
        def __init__(self, mongo_uri, mongo_db):
            self.mongo_uri = mongo_uri
            self.mongo_db = mongo_db
        
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                mongo_uri=crawler.settings.get('MONGO_URI'),
                mongo_db=crawler.settings.get('MONGO_DB')
            )
        
        def open_spider(self, spider):
            self.client = pymongo.MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]
        
        def process_item(self, item, spider):
            name = item.collection
            self.db[name].insert(dict(item))
            return item
        
        def close_spider(self, spider):
            self.client.close()

    settings.py里设置

    MONGO_URI = 'localhost'
    MONGO_DB = 'images360'

    ImagePipeline
    from scrapy import Request
    from scrapy.exceptions import DropItem
    from scrapy.pipelines.images import ImagesPipeline
    
    
    class ImagePipeline(ImagesPipeline):
        def file_path(self, request, response=None, info=None):
            url = request.url
            file_name = url.split('/')[-1]
            return file_name
        
        def item_completed(self, results, item, info):
            image_paths = [x['path'] for ok, x in results if ok]
            if not image_paths:
                raise DropItem('Image Downloaded Failed')
            return item
        
        def get_media_requests(self, item, info):
            yield Request(item['url'])



  • 相关阅读:
    模态框 显示出模态框后在加载(可用模块框中加入editormd编辑器)
    python实现 列表内元素按照出现次数排序
    Selenium
    Python的Tqdm模块——进度条配置
    [Python3]selenium爬取淘宝商品信息
    如何用Matplotlib画一张好看的图
    maven如何引入本地jar
    tensorflow和bazel版本对应问题及对应的bazel安装
    tensorflow保存模型的3种方式的资源汇总
    利用率统计脚本
  • 原文地址:https://www.cnblogs.com/wanglinjie/p/9206490.html
Copyright © 2011-2022 走看看