zoukankan      html  css  js  c++  java
  • scrapy --爬取媒体文件示例详解

    scrapy 图片数据的爬取

    • 基于scrapy进行图片数据的爬取:

      • 在爬虫文件中只需要解析提取出图片地址,然后将地址提交给管道
      • 配置文件中写入文件存储位置:IMAGES_STORE = './imgsLib'
      • 在管道文件中进行管道类的制定:
        • 1.from scrapy.pipelines.images import ImagesPipeline
        • 2.将管道类的父类修改成ImagesPipeline
        • 3.重写父类的三个方法
    • 校花网爬取示例

      • spider.py文件

        import scrapy
        from imgspider.items import ImgspiderItem
        
        
        class ImgSpiderSpider(scrapy.Spider):
            name = 'img_spider'
            # allowed_domains = ['www.xxx.com']
            start_urls = ['http://www.521609.com/daxuemeinv/']
            url = 'http://www.521609.com/daxuemeinv/list8%d.html'
            pageNum = 1
        
            def parse(self, response):
                li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')
                # 拼接图片url
                for li in li_list:
                    print(self.pageNum)
                    img_src = 'http://www.521609.com' + li.xpath('./a[1]/img/@src').extract_first()
                    item = ImgspiderItem()
                    item['src'] = img_src
                    yield item
        
                    if self.pageNum < 3:
                        self.pageNum += 1
                        new_url = format(self.url % self.pageNum)
                        yield scrapy.Request(new_url, callback=self.parse)
        
      • pipelines.py文件

        import scrapy
        from imgspider.items import ImgspiderItem
        
        
        class ImgSpiderSpider(scrapy.Spider):
            name = 'img_spider'
            # allowed_domains = ['www.xxx.com']
            start_urls = ['http://www.521609.com/daxuemeinv/']
            url = 'http://www.521609.com/daxuemeinv/list8%d.html'
            pageNum = 1
        
            def parse(self, response):
                li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')
                # 拼接图片url
                for li in li_list:
                    print(self.pageNum)
                    img_src = 'http://www.521609.com' + li.xpath('./a[1]/img/@src').extract_first()
                    item = ImgspiderItem()
                    item['src'] = img_src
                    yield item
        
                    if self.pageNum < 3:
                        self.pageNum += 1
                        new_url = format(self.url % self.pageNum)
                        yield scrapy.Request(new_url, callback=self.parse)
        
  • 相关阅读:
    docker 容器卷及提交
    docker 容器命令及解析
    docker镜像常用命令及解析
    drf 中集成swgger api功能文档
    drf 二次封装Response
    drf 中 自定义 异常处理方法
    drf 中自定义登录 以及token验证
    drf_vue对接极验验证
    django 信号的使用
    element plut tree renderContent
  • 原文地址:https://www.cnblogs.com/bigox/p/11447918.html
Copyright © 2011-2022 走看看