zoukankan html css js c++ java

异步高性能爬虫简单实现

异步爬虫

异步的由来

在我们爬取网站时，通常会有阻塞操作，比如：请求页面，IO等，

如果说爬取的网站数量不是很多，对于阻塞的时间就不会有太大的感官性，那如果数量成百上千，甚至上万呢？

所以需要一种方法来解决阻塞的问题，也就是采用异步的方式

异步的实现方式：

　　　方式1：多线程、多进程

方式2：线程池、进程池

方式3：异步加协程（推荐使用）

多任务异步协程

示例：

import asyncio
import time
urls = [
    "http://www.baidu.com",
    "http://www.sougou.com",
    "http://www.pearvideo.com",
]

async def get_pagetext(url):

    print(url,"正在下载。。。")
    time.sleep(2)
    print(url,"下载完成")

# 存放多个任务列表
stacks = []
start = time.time()
for url in urls:
    c = get_pagetext(url)
    # 将方法注册到loop对象中
    task = asyncio.ensure_future(c)
    stacks.append(task)
# 创建loop对象
loop = asyncio.get_event_loop()
# 将任务列表封装到asyncio.wite()中  固定的格式
loop.run_until_complete(asyncio.wait(stacks))
print(time.time()-start)

运行结果：

http://www.baidu.com 正在下载。。。
http://www.baidu.com 下载完成
http://www.sougou.com 正在下载。。。
http://www.sougou.com 下载完成
http://www.pearvideo.com 正在下载。。。
http://www.pearvideo.com 下载完成
耗时： 6.005340576171875

从运行结果可以看出，异步协程没有起到作用，程序运行时还是串行的，什么原因呢？

大家可以注意到，我在代码函数中，添加了一个2s的睡眠时间，

在异步协程中，不可以出现同步代码，否则协程就会中断。那我们该如何实现堵塞效果呢？

在asyncie中也给我们提供一个sleep方法： asyncio.sleep(time) ,修改代码：

async def get_pagetext(url):

    print(url,"正在下载。。。")
    # 在异步协程中，如果出现同步模块相关的代码，协程就会中断
    # time.sleep(2)
    # 当在asyncio中遇到阻塞时，需要手动挂起 使用await关键字挂起
    await asyncio.sleep(2)
    print(url,"下载完成")

重新执行：

耗时： 2.0028979778289795

多任务异步简单应用：

先创建一个简单的flask服务：

from flask import Flask
import time

app = Flask(__name__)

@app.route('/index')
def index():
    time.sleep(2)
    return "index"

@app.route('/home')
def home():
    time.sleep(2)
    return "home"

@app.route('/backend')
def backend():
    time.sleep(2)
    return "backend"


if __name__ == '__main__':
    app.run(threaded=True)

构建爬虫代码：

import requests
import asyncio
import time
start = time.time()
urls = [
    "http://127.0.0.1:5000/index", "http://127.0.0.1:5000/home", "http://127.0.0.1:5000/backend"
]

async def get_pagetext(url):
    print(url,"正在下载。。。")
    response = requests.get(url)
    print(url,"下载完成", response.text)


tasks = []

for url in urls:

    c = get_pagetext(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
print("耗时:", end-start)

运行结果：

http://127.0.0.1:5000/index 正在下载。。。
http://127.0.0.1:5000/index 下载完成 index
http://127.0.0.1:5000/home 正在下载。。。
http://127.0.0.1:5000/home 下载完成 home
http://127.0.0.1:5000/backend 正在下载。。。
http://127.0.0.1:5000/backend 下载完成 backend
耗时: 6.023314476013184

为什么耗时6s呢？因为reques.get()操作也是一个基于同步的代码，想实现异步必须使用基于异步的网络请求模块: " aiohttp "

aiohttp模块

初步认识：

实现异步请求是基于aiohttp的ClientSession模块的实例化对象session进行发起的

语法格式：

    async with aiohttp.ClientSession() as session:

请求方法：

session的请求方法和request一样,语法和requests相同，也可以进行UA伪装，传参

        async with session.get(url=url, headers=headers, params=None) as response:
        async with session.post(url=url, headers=headers, data=None) as response:

数据的格式：

    async with aiohttp.ClientSession() as session:
        async with session.get(url=url, headers=headers, params=None) as response:
        async with session.post(url=url, headers=headers, data=None) as response:
            # 获取response数据之前一定要手动挂起 不然会报错: coroutine 'ClientResponse.text' was never awaited
            page_text = response.text()     # 返回的是字符串格式的数据
            page_text = response.json()     # 返回的是json格式的数据
            page_text = response.read()     # 返回的是二进制格式的数据

使用aiphttp进行实现

async def get_pagetext(url):

    print(url,"正在下载。。。")
    # reques.get()操作也是一个基于同步的代码，想实现异步必须使用基于异步的网络请求模块: " aiohttp "
    # response = requests.get(url)
    # 使用ClientSession模块的session
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            # 获取response数据之前一定要手动挂起 不然会报错: coroutine 'ClientResponse.text' was never awaited
            page_text = response.text()
            print("下载完成", page_text)

运行结果：

下载完成 <coroutine object ClientResponse.text at 0x7f0e0bf6ee08>
下载完成 <coroutine object ClientResponse.text at 0x7f0e0bf8b4c0>
下载完成 <coroutine object ClientResponse.text at 0x7f0e0bf8b200>
耗时: 2.0088424682617188
/usr/lib/python3.6/asyncio/events.py:145: RuntimeWarning: coroutine 'ClientResponse.text' was never awaited
  self._callback(*self._args)

现在时间变成了2s左右了，说明实现了异步请求，但是结果中报了个错： RuntimeWarning: coroutine 'ClientResponse.text' was never awaited

错误信息表示”'ClientResponse.text“没有被挂起，所以在获取数据之前一定要使用关键字”await“手动挂起

查看全文

相关阅读:
char、varchar、nchar、nvarchar的区别
 linux和windows下分别如何查看电脑是32位的还是64位？
HP-Unix安装Memcache问题
 安装GCC-4.6.1详细教程
 JSTL 核心标签库使用
 JSP && EL表达式
 UNIX环境高级编程——标准IO-实现查看所有用户
 UNIX环境高级编程——环境变量表读取/添加/修改/删除
 UNIX网络编程——进程间通信概述
 UNIX网络编程——通过UNIX域套接字传递描述符和 sendmsg/recvmsg 函数

原文地址：https://www.cnblogs.com/fanhua-wushagn/p/12961822.html