zoukankan      html  css  js  c++  java
  • python爬虫之scrapy文件下载

     我们在写普通脚本的时候,从一个网站拿到一个文件的下载url,然后下载,直接将数据写入文件或者保存下来,但是这个需要我们自己一点一点的写出来,而且反复利用率并不高,为了不重复造轮子,scrapy提供很流畅的下载文件方式,只需要随便写写便可用了。

     mat.py文件

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from scrapy.linkextractor import LinkExtractor
     4 from weidashang.items import matplotlib
     5 
     6 class MatSpider(scrapy.Spider):
     7     name = "mat"
     8     allowed_domains = ["matplotlib.org"]
     9     start_urls = ['https://matplotlib.org/examples']
    10 
    11     def parse(self, response):
           #抓取每个脚本文件的访问页面,拿到后下载
    12 link = LinkExtractor(restrict_css='div.toctree-wrapper.compound li.toctree-l2') 13 for link in link.extract_links(response): 14 yield scrapy.Request(url=link.url,callback=self.example) 15 16 def example(self,response):
          #进入每个脚本的页面,抓取源码文件按钮,并和base_url结合起来形成一个完整的url
    17 href = response.css('a.reference.external::attr(href)').extract_first() 18 url = response.urljoin(href) 19 example = matplotlib() 20 example['file_urls'] = [url] 21 return example

    pipelines.py

    1 class MyFilePlipeline(FilesPipeline):
    2     def file_path(self, request, response=None, info=None):
    3         path = urlparse(request.url).path
    4         return join(basename(dirname(path)),basename(path))

    settings.py

    1 ITEM_PIPELINES = {
    2    'weidashang.pipelines.MyFilePlipeline': 1,
    3 }
    4 FILES_STORE = 'examples_src'

    items.py

    class matplotlib(Item):
        file_urls = Field()
        files = Field()

     run.py

    1 from scrapy.cmdline import execute
    2 execute(['scrapy', 'crawl', 'mat','-o','example.json'])
  • 相关阅读:
    RobotFrameWork(一)robotfamework(python版)及Ride在windows安装
    Sql日期时间格式转换[zhuan]
    SQL query
    WPF窗体视图中绑定Resources文件中字符串时,抛出:System.Windows.Markup.StaticExtension
    Power Map 更新日志
    球面墨卡托(Spherical Mercator)
    TPL(Task Parallel Library)多线程、并发功能
    WPF:保存窗口当前状态截图方法
    dynamic关键字
    Error: Cannot Determine the location of the VS common tools folder
  • 原文地址:https://www.cnblogs.com/lei0213/p/8098180.html
Copyright © 2011-2022 走看看