zoukankan html css js c++ java

爬虫神器——异步爬虫

一、背景

有一个项目需要使用到爬虫，因此去Google上搜索了一下，发现除了requests，scrapy，多线程，多进程以外，还有一种方法异步爬虫。使用aiohttp+async来进行网站的爬取。

二、使用

通过以下的指令安装

# 通过指令安装aiohttp
pip install aiohttp

我们在使用aiohttp的时候，还需要注意就是需要使用async这种异步编程的方式。同时这个方法需要python3.5以上。主要是使用aiohttp这种异步的方式来进行爬取。

三、模型（不是很准确的结构图）

3.1 多线程爬虫模型

3.2 异步爬虫模型

四、核心代码

其实和一般的request爬虫一样，只是需要注意这里需要使用aiohttp他是异步编程的，因此需要用到async的方法

# 设置爬虫的日志格式
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

df = pd.DataFrame(columns=['occupation', 'companyName', 'location', 'salary', 'date', 'education', 'experience', 'companyType', 'companySize', 'type', 'request'])


# 异步HTTP请求
async def fetch(sem11, session, url):
    async with sem11:
        async with session.get(url, headers=header) as response:
            return await response.text(errors='ignore')

# 处理网页
async def download(sem, url):
    async with aiohttp.ClientSession() as session:
        try:
            html = await fetch(sem, session, url)  # 这里添加网页的解析代码
            await parser(html)
        except Exception as err:
            print(err)


if __name__ == '__main__':
    # 统计该爬虫的消耗时间
    print('*' * 50)
    t_start_web = time.time()
    loop = asyncio.get_event_loop()
    sem1 = asyncio.Semaphore(100)
    # 第一个人 3413
    # 第二个人 3413：6826
    # 第三个人 6826：
    tasks = [asyncio.ensure_future(download(sem1, url)) for url in urls[:3413]]
    tasks = asyncio.gather(*tasks)
    loop.run_until_complete(tasks)
    t_end_web = time.time()
    print('网站爬取总共耗时：%s' % (t_end_web - t_start_web))

五、参考

5.1 详细介绍

http://www.ityouknow.com/python/2019/12/28/python-aiohttp-102.html

https://www.jianshu.com/p/b8010594557f

5.2 性能对比

https://www.cnblogs.com/jiyongjia/p/9803991.html

查看全文

相关阅读:
C 语言编程经典 100 例
 visual studio.net已检测到指定的web服务器运行的不是asp.net1.1版。无法运行asp.net web应用程序
 如何编译及运行java
VBScript 函数集
 SQL SERVER定时作业的设置方法
 显示桌面按钮不小心被删，有什么办法找回？
随机抽取n个记录的SQL
打开项目时提示如下错误:Visual Studio .NET 无法创建应用程序。问题很可能是因为本地 Web 服务器上没有安装所需的组件
 简单的数据库连接
 ASP中各种数据库连接代码

原文地址：https://www.cnblogs.com/future-dream/p/13347621.html