python对异步编程有原生的支持,即asyncio标准库,使用异步IO模型可以节约大量的IO等待时间,非常适合于爬虫任务。
1.基本用法
import time
import asyncio
import aiohttp # 用异步方式获取网页内容
urls = ['https://www.baidu.com'] * 400
async def get_html(url, sem):
async with(sem):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
html = await resp.text()
def main():
loop = asyncio.get_event_loop() # 获取事件循环
sem = asyncio.Semaphore(10) # 控制并发的数量
tasks = [get_html(url, sem) for url in urls] # 把所有任务放到一个列表中
loop.run_until_complete(asyncio.wait(tasks)) # 激活协程
loop.close() # 关闭事件循环
if __name__ == '__main__':
start = time.time()
main()
print(time.time()-start) # 5.03s
2.多进程+协程
如果想进一步加快爬取速度,考虑到python多线程的全局锁限制,可以采用多进程+协程的方案:
import time
import asyncio
import aiohttp # 用异步方式获取网页内容
from multiprocessing import Pool
all_urls = ['https://www.baidu.com'] * 400
async def get_html(url, sem):
async with(sem):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
html = await resp.text()
def main(urls):
loop = asyncio.get_event_loop() # 获取事件循环
sem = asyncio.Semaphore(10) # 控制并发的数量
tasks = [get_html(url, sem) for url in urls] # 把所有任务放到一个列表中
loop.run_until_complete(asyncio.wait(tasks)) # 激活协程
loop.close() # 关闭事件循环
if __name__ == '__main__':
start = time.time()
p = Pool(4)
for i in range(4):
p.apply_async(main, args=(all_urls[i*100:(i+1)*100],))
p.close()
p.join()
print(time.time()-start) # 2.87s
可以看出来多进程已经加快了爬取速度,具体加速效果跟机器CPU配置相关。