zoukankan      html  css  js  c++  java
  • scrapy --爬取媒体文件示例详解

    scrapy 图片数据的爬取

    • 基于scrapy进行图片数据的爬取:

      • 在爬虫文件中只需要解析提取出图片地址,然后将地址提交给管道
      • 配置文件中写入文件存储位置:IMAGES_STORE = './imgsLib'
      • 在管道文件中进行管道类的制定:
        • 1.from scrapy.pipelines.images import ImagesPipeline
        • 2.将管道类的父类修改成ImagesPipeline
        • 3.重写父类的三个方法
    • 校花网爬取示例

      • spider.py文件

        import scrapy
        from imgspider.items import ImgspiderItem
        
        
        class ImgSpiderSpider(scrapy.Spider):
            name = 'img_spider'
            # allowed_domains = ['www.xxx.com']
            start_urls = ['http://www.521609.com/daxuemeinv/']
            url = 'http://www.521609.com/daxuemeinv/list8%d.html'
            pageNum = 1
        
            def parse(self, response):
                li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')
                # 拼接图片url
                for li in li_list:
                    print(self.pageNum)
                    img_src = 'http://www.521609.com' + li.xpath('./a[1]/img/@src').extract_first()
                    item = ImgspiderItem()
                    item['src'] = img_src
                    yield item
        
                    if self.pageNum < 3:
                        self.pageNum += 1
                        new_url = format(self.url % self.pageNum)
                        yield scrapy.Request(new_url, callback=self.parse)
        
      • pipelines.py文件

        import scrapy
        from imgspider.items import ImgspiderItem
        
        
        class ImgSpiderSpider(scrapy.Spider):
            name = 'img_spider'
            # allowed_domains = ['www.xxx.com']
            start_urls = ['http://www.521609.com/daxuemeinv/']
            url = 'http://www.521609.com/daxuemeinv/list8%d.html'
            pageNum = 1
        
            def parse(self, response):
                li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')
                # 拼接图片url
                for li in li_list:
                    print(self.pageNum)
                    img_src = 'http://www.521609.com' + li.xpath('./a[1]/img/@src').extract_first()
                    item = ImgspiderItem()
                    item['src'] = img_src
                    yield item
        
                    if self.pageNum < 3:
                        self.pageNum += 1
                        new_url = format(self.url % self.pageNum)
                        yield scrapy.Request(new_url, callback=self.parse)
        
  • 相关阅读:
    DAY-4 Linux基础及常用命令(1)
    DAY-3 计算机基础之网络
    DAY-2 计算机基础之操作系统
    DAY-1 计算机基础
    梅花作品欣赏
    简洁大气网址(国外)跟设计大学的案例很像
    animate css3 应用的借鉴,一个同事写的JS
    漂亮的素材
    几个不错的素材站
    正式开始我的技术生涯
  • 原文地址:https://www.cnblogs.com/bigox/p/11447918.html
Copyright © 2011-2022 走看看