1.1 实现并发的常见方法
1、简介
1. 在编写爬虫时,性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。
2. 进程:启用进程非常浪费资源
3. 线程:线程多,并且在阻塞过程中无法执行其他任务
4. 协程:gevent只用起一个线程,当请求发出去后gevent就不管,永远就只有一个线程工作,谁先回来先处理
2、实现并发几个方法比较
1)使用线程池实现并发
#! /usr/bin/env python # -*- coding: utf-8 -*- import requests from concurrent.futures import ThreadPoolExecutor def fetch_request(url): result = requests.get(url) print(result.content) pool = ThreadPoolExecutor(10) # 创建一个线程池,最多开10个线程 url_list = [ 'www.google.com', 'http://www.baidu.com', ] for url in url_list: # 去线程池中获取一个线程 # 线程去执行fetch_request方法 pool.submit(fetch_request,url) pool.shutdown(True) # 主线程自己关闭,让子线程自己拿任务执行
2)使用进程池实现并发
#! /usr/bin/env python # -*- coding: utf-8 -*- import requests from concurrent.futures import ProcessPoolExecutor def fetch_request(url): result = requests.get(url) print(result.text) url_list = [ 'www.google.com', 'http://www.bing.com', ] if __name__ == '__main__': pool = ProcessPoolExecutor(10) # 线程池 # 缺点:线程多,并且在阻塞过程中无法执行其他任务 for url in url_list: # 去线程池中获取一个进程 # 进程去执行fetch_request方法 pool.submit(fetch_request,url) pool.shutdown(True)
3)多线程+回调函数执行
#! /usr/bin/env python # -*- coding: utf-8 -*- from concurrent.futures import ThreadPoolExecutor import requests def fetch_async(url): response = requests.get(url) return response def callback(future): print(future.result().content) if __name__ == '__main__': url_list = ['http://www.github.com', 'http://www.bing.com'] pool = ThreadPoolExecutor(5) for url in url_list: v = pool.submit(fetch_async, url) v.add_done_callback(callback) pool.shutdown(wait=True)
4) 协程:微线程实现异步
#! /usr/bin/env python # -*- coding: utf-8 -*- import gevent import requests from gevent import monkey monkey.patch_all() # 这些请求谁先回来就先处理谁 def fetch_async(method, url, req_kwargs): print(method, url, req_kwargs) response = requests.request(method=method, url=url, **req_kwargs) print(response.url, response.content) if __name__ == '__main__': ##### 发送请求 ##### gevent.joinall([ gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}), gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}), gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}), ])
1111111111111