zoukankan      html  css  js  c++  java
  • scrapy--matplotlib

      昨天晚上看了一些关于保存文件的相关资料,早早的睡了,白天根据网上查找的资料,自己再捡起来.弄了一上午就爬取出来了,开心!!!好吧,让我们开始

    老规矩,先上图。大家也赶快行动起来

    分类文件:

    文件内coding.py

    1.matlib.py

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from Matlib.items import MatlibItem
    import pdb
    
    class MatlibSpider(scrapy.Spider):
        name = 'matlib'
        allowed_domains = ['matplotlib.org']
        start_urls = ['https://matplotlib.org/examples/index.html']
    
        def parse(self, response):
            #le = LinkExtractor(restrict_css='div.toctree-wrapper.compound li.toctree-l2')
            le = LinkExtractor(restrict_css='div.toctree-wrapper.compound li.toctree-l1', deny='/index.html$')
            for link in le.extract_links(response):
                yield scrapy.Request(link.url,callback=self.parse_url)
    
        def parse_url(self,response):
            sel = response.css('a.reference.external::attr(href)').extract()[0]
            url = response.urljoin(sel)
            mpl = MatlibItem()
            #mpl['files_url'] = [url]
            #pdb.set_trace()
            mpl['files_url'] = url.encode('utf-8')
            #return mpl
            yield mpl

    2.items.py

    import scrapy
    
    class MatlibItem(scrapy.Item):
    
        files_url = scrapy.Field()
        files     = scrapy.Field()

    3.pipelines.py# -*- coding: utf-8 -*-

    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    import scrapy
    from scrapy.pipelines.files import FilesPipeline
    import urlparse
    from os.path import basename,dirname,join
    import pdb
    class MyFilePipeline(FilesPipeline): def get_media_requests(self,item,info): #调用这个方法,爬虫才能保存文件!! #for file in item['files_url']: #yield scrapy.Request(file.encode('utf-8')) yield scrapy.Request(item['files_url'])   

    '''
    def file_path(self, request, response=None, info=None): #重写文件名,和路径 split_url = str(request.url).split('/') kind_name = split_url[-2] file_name = split_url[-1] return '%s/%s' % (kind_name, file_name) '''
    def file_path(self,request,response=None,info=None):
    path=urlparse.urlparse(request.url).path
         return join(basename(dirname(path)),basename(path))

    4.settings.py(其他和我之前发表的一样,大家可以去查找下)

    ITEM_PIPELINES = {
        #'Matlib.pipelines.MatlibPipeline': 200,
        'Matlib.pipelines.MyFilePipeline': 2,
        #'scrapy.pipelines.files.FilesPipeline': 1,
    }

    遇到的问题:

      1.matlib.py

        1.url = response.urljoin(sel) :url需要解码才能在pipelines.py/scrapy.Request(item[''files_url]) 中运行下载

      2.pipelines.py:

        1.未重写文件名,会自动保存为checksum名,

          'checksum': '715610c4375a1d749bc26b39cf7e7199',
          'path': 'animation/bayes_update.py',
          'url': 'https://matplotlib.org/examples/animation/bayes_update.py'}],

        2.def get_media_requests(self,item,info): 调用这个函数,文件才能下载!!!看其他人重写路径的时候,没有调用这个函数,误导了我很久

    如果有小伙伴,遇到了其他问题,欢迎留言,大家一起进步

  • 相关阅读:
    vue零散知识
    vue router 和 组件生命周期的理解
    未搞懂的问题
    前端问题总结
    垂直居中
    css,js加载阻塞页面渲染的理解
    node 学习
    自定义事件
    学习react 遇到的问题
    [AHOI2001]彩票摇奖
  • 原文地址:https://www.cnblogs.com/eilinge/p/9407804.html
Copyright © 2011-2022 走看看