zoukankan html css js c++ java

scrapy下载文件，当传递一个url列表到pipelines管道处理时，如何做到不受多线程影响进行排序。

与下载图片类似：

1.item中需要有固定的字段

    file_urls = scrapy.Field()
    files = scrapy.Field()

2.获取到文件的url，通过item["file_urls"]传送到 pipelines

    def parse_item(self, response):
        item = ScrapyanthingItem()
        data = response.body.decode(response.encoding).replace("\", "")
        item["file_urls"] = re.findall(r'https://[a-zA-Z0-9./]+/index.m3u8', data)[0]
        yield item

3.pipelines 中处理file_urls

from scrapy.pipelines.images import ImagesPipeline, FilesPipeline


class DownloadM3u8Pipeline(FilesPipeline):  # 继承FilePipeline 
    def get_media_requests(self, item, info):
        m3u8_url=item['file_urls']  #！！！ 重要：如果 file_urls是一个列表

   request_list = []  #！！！！！！ 重要：如果file_urls是一个列表（多个url），因为pipelines会调用多线程，那么如何排序。 可以如下，i作为序号， 通过meta 传值到file_path方法中 通过request.meta.get["m3u8_ts_name"]获取后拼接为文件名即可
        i = 0
        for m3u8_url in m3u8_url_list:
            # loggers().debug("正在下载的类型：{}，地址：{}，下载进度：{}/{}。".format(item["category"], m3u8_url, i, l))
            # q_que.put(i)
            item["m3u8_url"] = m3u8_url
            i += 1
            m3u8_ts_name = '%04d' % i
            request_list.append(Request(m3u8_url, meta={"item": item, "m3u8_ts_name": m3u8_ts_name}))
        return request_list

yield Request(m3u8_url, meta={"item": item})  # 请求文件url,scrapy的FilesPipeline会调用file_path，和（第4步）设置中的FILES_STORE储存文件

    def file_path(self, request, response=None, info=None):
　　　　　　# 返回图片储存的地址  a/a.m3u8
        item = request.meta["item"]
        # date = datetime.date.today()
        st = uuid.uuid4().hex
        geshi = item['file_urls'].split(".")[-1]
        file_paths = '{}.{}'.format(st, geshi)
        return file_paths

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")  # 如果没有路径则抛出异常
        item['m3u8_paths'] = image_paths
        return item

4.setting中设置

# 文件储存目录
project_dir = os.path.dirname(__file__)
FILES_STORE = os.path.join(project_dir, "warehouse/files")  # 必须指定FILES_STORE字段

ITEM_PIPELINES = {
    'ScrapyAnthing.pipelines.DownloadM3u8Pipeline': 1,  # 启动文件下载中间件
}

查看全文

相关阅读:
jquery checkbox的相关操作——全选、反选、获得所有选中的checkbox
js 跳转页面刷新页面
 一些基本的正则- 手机号正则,邮箱正则,数字正则,字母正则,汉子正则,身份证正则.等
 vue 中使用国际化（i18n）
nvm 下载node不会自动下载npm
angular 自定义组件和form的formControlName 连用
 angular8 搜索组件封装
 vscode 设置缩进 4
滚动条样式设置
 使用vscode 1. 报在签出前,请清理储存库工作树. 2.拉取代码报错

原文地址：https://www.cnblogs.com/tangpg/p/14600445.html