zoukankan      html  css  js  c++  java
  • 【Python爬虫】性能提升

    并发、异步IO

    在编写爬虫时,性能的消耗主要在IO请求中。当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。

    import requests
    
    def fetch_async(url):
        response = requests.get(url)
        return response
    
    
    url_list = ['http://www.github.com', 'http://www.bing.com']
    
    for url in url_list:
        print(url,fetch_async(url))
    1.同步执行
    from concurrent.futures import ThreadPoolExecutor
    import requests
    
    
    def fetch_async(url):
        response = requests.get(url)
        return response
    
    
    url_list = ['http://www.github.com', 'http://www.bing.com']
    pool = ThreadPoolExecutor(5)
    for url in url_list:
        pool.submit(fetch_async, url)
    
    pool.shutdown(wait=True)
    2-多线程(线程池)执行
    """并发未来-线程池"""
    from concurrent.futures import ThreadPoolExecutor
    import time
    import requests
    
    def task(url):
        response = requests.get(url)
        print(url,response.status_code)
        response.encoding = response.apparent_encoding
        if response.status_code == 200:
            return {"url":url,"text":response.text}
    
    def save_to_html(res,*args,**kwargs):
        res = res.result()    #res 回调函数接收到res返回的是一个对象<Future at 0x1ed4cf245c0 state=finished returned dict>
        filename = res['url'].split(".")[-2] + ".html"
        with open(filename,'w+') as f:
            f.write(res["text"])
        print(filename,"--->写入成功!")
    
    def parse_html(res,*args,**kwargs):
        pass
    
    if __name__ == '__main__':
        start = time.time()
        pool = ThreadPoolExecutor()    #线程池 不过不指定值 默认为CPU*5
        url_list = [
            'http://www.cnblogs.com/',
            'https://huaban.com/favorite/beauty/',
            'http://www.bing.com',
            'http://www.zhihu.com',
            'http://www.sina.com',
            'http://www.baidu.com',
            'http://www.autohome.com.cn',
        ]
        for url in url_list:
            v = pool.submit(task,url)
            v.add_done_callback(save_to_html)
            v.add_done_callback(parse_html)
    
        pool.shutdown(wait=True)
        print("consume time is:",time.time()-start)
    3-多线程+回调函数
    from concurrent.futures import ProcessPoolExecutor
    import requests
    
    def fetch_async(url):
        response = requests.get(url)
        return response
    
    
    url_list = ['http://www.github.com', 'http://www.bing.com']
    pool = ProcessPoolExecutor(5)
    for url in url_list:
        pool.submit(fetch_async, url)
    
    pool.shutdown(wait=True)
    4-多进程
    """并发未来-进程池"""
    from concurrent.futures import ProcessPoolExecutor
    import time
    import requests
    
    def task(url):
        response = requests.get(url)
        print(url,response.status_code)
        response.encoding = response.apparent_encoding
        if response.status_code == 200:
            return {"url":url,"text":response.text}
    
    def save_to_html(res,*args,**kwargs):
        res = res.result()    #res 回调函数接收到res返回的是一个对象<Future at 0x1ed4cf245c0 state=finished returned dict>
        filename = res['url'].split(".")[-2] + ".html"
        with open(filename,'w+') as f:
            f.write(res["text"])
        print(filename,"--->写入成功!")
    
    def parse_html(res,*args,**kwargs):
        pass
    
    if __name__ == '__main__':
        start = time.time()
        pool = ProcessPoolExecutor()    #线程池 不过不指定值 默认为CPU*5
        url_list = [
            'http://www.cnblogs.com/',
            'https://huaban.com/favorite/beauty/',
            'http://www.bing.com',
            'http://www.zhihu.com',
            'http://www.sina.com',
            'http://www.baidu.com',
            'http://www.autohome.com.cn',
        ]
        for url in url_list:
            v = pool.submit(task,url)
            v.add_done_callback(save_to_html)
            v.add_done_callback(parse_html)
    
        pool.shutdown(wait=True)
        print("consume time is:",time.time()-start)
    5-多进程+回调函数

    通过上述代码均可以完成对请求性能的提高,对于多线程和多进行的缺点是在IO阻塞时会造成了线程和进程的浪费,所以异步IO首选:

    补充:协程+异步IO(还举例讲了 并发、并行、同步、异步、阻塞、非阻塞

    参考:https://blog.csdn.net/weixin_41207499/article/details/80657201

    参考:https://www.cnblogs.com/ssyfj/p/9222342.html

    https://www.liaoxuefeng.com/wiki/1016959663602400/1017985577429536

  • 相关阅读:
    js,timeout,promise执行顺序
    vue数据响应的坑
    css中的block与none
    javascript 私有化属性,和公共属性
    animal与@keyframe
    css3中的translate,transform,transition的区别
    AngularJS实现原理
    bootstrap添加多个模态对话框支持
    ajax
    jQuery点击弹出层,弹出模态框,点击模态框消失
  • 原文地址:https://www.cnblogs.com/XJT2018/p/11002526.html
Copyright © 2011-2022 走看看