zoukankan      html  css  js  c++  java
  • doraemon的python 提升爬取效率(单线程+多任务异步协程)

    ### 5.单线程+加多任务异步协程
    
    **线程池:**
    
    ```python
    from multiprocessing.dummy import Pool
    import requests
    import time
    #同步代码
    start = time.time()
    pool = Pool(3)
    urls = ['http://127.0.0.1:5000/bobo','http://127.0.0.1:5000/jay','http://127.0.0.1:5000/tom']
    for url in urls:
       page_text = requests.get(url).text
       print(page_text)
    print('总耗时:',time.time()-start)
    
    #异步代码
    start = time.time()
    pool = Pool(3)
    urls = ['http://127.0.0.1:5000/bobo','http://127.0.0.1:5000/jay','http://127.0.0.1:5000/tom']
    def get_request(url):
        return requests.get(url).text
    
    response_list = pool.map(get_request,urls)
    print(response_list)
    
    #解析
    def parse(page_text):
        print(len(page_text))
    
    pool.map(parse,response_list)
    print('总耗时:',time.time()-start)
    ```
    
    **协程对象**
    
    ```python
    from time import sleep
    import asyncio
    
    async def get_request(url):
        print('正在请求:',url)
        sleep(2)
        print('请求结束:',url)
    
    c = get_request('www.1.com')
    print(c)
    ```
    
    **任务对象**
    
    ```python
    from time import sleep
    import asyncio
    
    #回调函数:
    #默认参数:任务对象
    def callback(task):
        print('i am callback!!1')
        print(task.result())#result返回的就是任务对象对应的那个特殊函数的返回值
    
    async def get_request(url):
        print('正在请求:',url)
        sleep(2)
        print('请求结束:',url)
        return 'hello bobo'
    
    #创建一个协程对象
    c = get_request('www.1.com')
    #封装一个任务对象
    task = asyncio.ensure_future(c)
    
    #给任务对象绑定回调函数,协程执行之后就会执行回调函数
    task.add_done_callback(callback)
    
    #创建一个事件循环对象
    loop = asyncio.get_event_loop()
    loop.run_until_complete(task)#将任务对象注册到事件循环对象中并且开启了事件循环
    ```
    
    #### 5.1 多任务异步协程
    
    ```python
    import asyncio
    from time import sleep
    import time
    start = time.time()
    urls = [
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo'
    ]
    #在待执行的代码块中不可以出现不支持异步模块的代码
    #在该函数内部如果有阻塞操作必须使用await关键字进行修饰
    async def get_request(url):
        print('正在请求:',url)
        await asyncio.sleep(2)
        print('请求结束:',url)
        return 'hello bobo'
    
    tasks = [] #放置所有的任务对象
    for url in urls:
        c = get_request(url)
        task = asyncio.ensure_future(c)
        tasks.append(task)
    
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))
    
    print(time.time()-start)
    ```
    
    **在爬虫中应用多任务异步协程**
    
    ```python
    import asyncio
    import requests
    import time
    start = time.time()
    urls = [
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo'
    ]
    #无法实现异步的效果:是因为requests模块是一个不支持异步的模块
    async def req(url):
        page_text = requests.get(url).text
        return page_text
    
    tasks = []
    for url in urls:
        c = req(url)
        task = asyncio.ensure_future(c)
        tasks.append(task)
    
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))
    
    print(time.time()-start)
    ```
    
    #### 5.2 aiohttp(requests不支持异步)
    
    ```python
    import asyncio
    import requests
    import time
    import aiohttp
    from lxml import etree
    urls = [
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
    ]
    #无法实现异步的效果:是因为requests模块是一个不支持异步的模块
    async def req(url):
        async with aiohttp.ClientSession() as s:
            async with await s.get(url) as response:
                #response.read():byte
                page_text = await response.text()
                return page_text
    
        #细节:在每一个with前面加上async,在每一步的阻塞操作前加上await
    
    def parse(task):
        page_text = task.result()
        tree = etree.HTML(page_text)
        name = tree.xpath('//p/text()')[0]
        print(name)
    if __name__ == '__main__':
        start = time.time()
        tasks = []
        for url in urls:
            c = req(url)
            task = asyncio.ensure_future(c)
            task.add_done_callback(parse)
            tasks.append(task)
    
        loop = asyncio.get_event_loop()
        loop.run_until_complete(asyncio.wait(tasks))
    
        print(time.time()-start)
    ```
    
  • 相关阅读:
    重新梳理HTML基础知识
    Bootstrap响应式栅格系统的设计原理
    php 循环爬虫 or 持久执行任务 总断掉服务 解决,flush(),ob_flush()的组合使用
    Linux中工作目录切换命令
    Linux中系统状态检测命令
    Linux系统中rm删除命令
    Linux中touch命令使用(创建文件)
    Linux中 mkdir 创建文件夹命令
    Linux 中 cp 命令(文件复制)
    Linux中 mv(文件移动)
  • 原文地址:https://www.cnblogs.com/doraemon548542/p/11972550.html
Copyright © 2011-2022 走看看