zoukankan      html  css  js  c++  java
  • 【python爬虫】实现多线程下载器

    写在前面

    为什么要多线程?单个线程不能下载吗?多线程能占满网络实现宽带的满速下载而单线程不能。

    举个栗子:你的宽带是100Mb/s,理论上最大下载速度是100/8=12.5MB/s。你要下载一个843MB的视频,采用单线程下载你需要560秒才能下载完,而采用多线程(12个线程)你却可以在93秒内完成下载,时间将近缩短了6倍。

    如果计算一下网络的利用率,你还可以发现:单线程的平均下载速度是1.50MB/s,而多线程的平均下载速度是9.06MB/s,多线程几乎将网络资源利用满了。这就是多线程的好处!

    安装依赖

    requests库用于从服务器请求资源。

    pip3 install requests
    

    测试样例

    一个843MBMP4格式的视频文件。

    https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo/1250921_c7af3a2b73d03604f6421ef11134af72.mp4
    

    多个线程

    使用concurrent.futures模块的子类ThreadPoolExecutor创建线程池实现多线程。

    from concurrent.futures import ThreadPoolExecutor
    from requests import get, head
    import time
    
    
    class downloader:
        def __init__(self, url, num, name):
            self.url = url
            self.num = num
            self.name = name
            self.getsize = 0
            r = head(self.url, allow_redirects=True)
            self.size = int(r.headers['Content-Length'])
    
        def down(self, start, end, chunk_size=10240):
            headers = {'range': f'bytes={start}-{end}'}
            r = get(self.url, headers=headers, stream=True)
            with open(self.name, "rb+") as f:
                f.seek(start)
                for chunk in r.iter_content(chunk_size):
                    f.write(chunk)
                    self.getsize += chunk_size
    
        def main(self):
            start_time = time.time()
            f = open(self.name, 'wb')
            f.truncate(self.size)
            f.close()
            tp = ThreadPoolExecutor(max_workers=self.num)
            futures = []
            start = 0
            for i in range(self.num):
                end = int((i+1)/self.num*self.size)
                future = tp.submit(self.down, start, end)
                futures.append(future)
                start = end+1
            while True:
                process = self.getsize/self.size*100
                last = self.getsize
                time.sleep(1)
                curr = self.getsize
                down = (curr-last)/1024
                if down > 1024:
                    speed = f'{down/1024:6.2f}MB/s'
                else:
                    speed = f'{down:6.2f}KB/s'
                print(f'process: {process:6.2f}% | speed: {speed}', end='
    ')
                if process >= 100:
                    print(f'process: {100.00:6}% | speed:  00.00KB/s', end=' | ')
                    break
            tp.shutdown()
            end_time = time.time()
            total_time = end_time-start_time
            average_speed = self.size/total_time/1024/1024
            print(f'total-time: {total_time:.0f}s | average-speed: {average_speed:.2f}MB/s')
    
    
    if __name__ == '__main__':
        url = 'https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo/1250921_c7af3a2b73d03604f6421ef11134af72.mp4'
        down = downloader(url, 12, 'test.mp4')
        down.main()
    

    单个线程

    import requests
    import time
    start = time.time()
    url = 'https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo/1250921_c7af3a2b73d03604f6421ef11134af72.mp4'
    res = requests.get(url, stream=True)
    with open('test.mp4', 'wb') as f:
        for chunk in res.iter_content(chunk_size=10240):
            f.write(chunk)
    end = time.time()
    print(end-start)
    

    对比分析

    同样下载一个843MB的视频,多线程和单线程的对比分析结果如下:

    对比项多线程单线程
    总计用时93s560s
    平均速度9.06MB/s1.50MB/s

    温馨提示

    这里还和多线程网络下载器IDM对比了一下,发现用python实现的多线程下载器的下载速度并不亚于IDM,如果继续开发,实现断点续传和GUI后,应该可以完全替代IDM的下载功能。

    未来展望

    • 多线程
    • 断点续传
    • GUI

    引用参考

    [0] https://blog.csdn.net/qq_41488943/article/details/107118377
    [1] https://docs.python.org/zh-cn/3.8/library/concurrent.futures.html#threadpoolexecutor
    [2] https://requests.readthedocs.io/zh_CN/latest/user/quickstart.html#id9
    
  • 相关阅读:
    HTML撑起浮动子元素得父元素高度
    H5弃用标签和属性
    HTML常用转义字符
    php微信公众号开发入门
    常见正则表达式总结
    解决上下两个相邻图片之间存在默认间距的问题
    移动端真机调试的两种方法
    H5使用小结
    CF 11D
    Codeforces Round #639 (Div. 2) C Hilbert's Hotel (数学)
  • 原文地址:https://www.cnblogs.com/ghgxj/p/14219138.html
Copyright © 2011-2022 走看看