zoukankan      html  css  js  c++  java
  • 爬虫之性能相关

    1.1 实现并发的常见方法

      1、简介

          1. 在编写爬虫时,性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。

          2. 进程:启用进程非常浪费资源

          3. 线程:线程多,并且在阻塞过程中无法执行其他任务

          4. 协程:gevent只用起一个线程,当请求发出去后gevent就不管,永远就只有一个线程工作,谁先回来先处理

      2、实现并发几个方法比较

        1)使用线程池实现并发

    #! /usr/bin/env python
    # -*- coding: utf-8 -*-
    import requests
    from concurrent.futures import ThreadPoolExecutor
    
    def fetch_request(url):
        result = requests.get(url)
        print(result.content)
    
    pool = ThreadPoolExecutor(10)       # 创建一个线程池,最多开10个线程
    url_list = [
        'www.google.com',
        'http://www.baidu.com',
    ]
    
    for url in url_list:
        # 去线程池中获取一个线程
        # 线程去执行fetch_request方法
        pool.submit(fetch_request,url)
    
    pool.shutdown(True)     # 主线程自己关闭,让子线程自己拿任务执行
    使用线程池实现并发

        2)使用进程池实现并发

    #! /usr/bin/env python
    # -*- coding: utf-8 -*-
    import requests
    from concurrent.futures import ProcessPoolExecutor
    
    def fetch_request(url):
        result = requests.get(url)
        print(result.text)
    
    url_list = [
        'www.google.com',
        'http://www.bing.com',
    ]
    
    if __name__ == '__main__':
        pool = ProcessPoolExecutor(10)  # 线程池
        # 缺点:线程多,并且在阻塞过程中无法执行其他任务
        for url in url_list:
            # 去线程池中获取一个进程
            # 进程去执行fetch_request方法
            pool.submit(fetch_request,url)
        pool.shutdown(True)
    使用进程池实现并发

        3)多线程+回调函数执行

    #! /usr/bin/env python
    # -*- coding: utf-8 -*-
    from concurrent.futures import ThreadPoolExecutor
    import requests
    
    def fetch_async(url):
        response = requests.get(url)
        return response
    
    def callback(future):
        print(future.result().content)
    
    if __name__ == '__main__':
        url_list = ['http://www.github.com', 'http://www.bing.com']
        pool = ThreadPoolExecutor(5)
        for url in url_list:
            v = pool.submit(fetch_async, url)
            v.add_done_callback(callback)
        pool.shutdown(wait=True)
    多线程+回调函数执行

        4) 协程:微线程实现异步

    #! /usr/bin/env python
    # -*- coding: utf-8 -*-
    import gevent
    import requests
    from gevent import monkey
    
    monkey.patch_all()
    
    # 这些请求谁先回来就先处理谁
    def fetch_async(method, url, req_kwargs):
        print(method, url, req_kwargs)
        response = requests.request(method=method, url=url, **req_kwargs)
        print(response.url, response.content)
    
    
    if __name__ == '__main__':
        ##### 发送请求 #####
        gevent.joinall([
            gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
            gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
            gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
        ])
    协程:微线程实现异步
  • 相关阅读:
    SPOJ SAMER08A
    SPOJ TRAFFICN
    CS Academy Set Subtraction
    CS Academy Bad Triplet
    CF Round 432 C. Five Dimensional Points
    CF Round 432 B. Arpa and an exam about geometry
    SPOJ INVCNT
    CS Academy Palindromic Tree
    身体训练
    简单瞎搞题
  • 原文地址:https://www.cnblogs.com/jiaxinzhu/p/12528979.html
Copyright © 2011-2022 走看看