zoukankan html css js c++ java

python学习笔记——普通爬虫和协程爬虫

普通爬虫和协程爬虫

普通爬虫逻辑：

import time
from lxml import etree
import requests
urls = [
	'https://henan.qq.com/a/20190822/001115.htm',
	'https://henan.qq.com/a/20190822/001128.htm',
	'https://henan.qq.com/a/20190822/001086.htm',
	'https://henan.qq.com/a/20190822/001764.htm',
	'https://henan.qq.com/a/20190822/001163.htm',
	'https://henan.qq.com/a/20190822/001169.htm',
	'https://henan.qq.com/a/20190822/001196.htm',
	'https://henan.qq.com/a/20190822/001278.htm'
]
url = 'https://henan.qq.com/a/20190822/001764.htm'
def get_titles(url,cnt):
	reponse = requests.get(url)
	html = reponse.content
	title = etree.HTML(html).xpath('//*[@id="Main-Article-QQ"]/div[2]/div[1]/div[2]/div[1]/h1/text()')
	print('第%d个title:%s' % (cnt,''.join(title)))

if __name__ == '__main__':
	start1 = time.time()
	i = 0
	for url in urls:
		i = i + 1
		start = time.time()
		get_titles(url,i)
		print('第%d个title爬取耗时:%.5f秒' % (i,float(time.time() - start)))
	print('爬取总耗时:%.5f秒' % float(time.time() - start1))

get_titles()函数首先使用requests模块发起了一个get请求，获取html的页面源码。
然后利用etree中的xpath解析出想要获取到的内容。
xpath('')中使用的是xpath语法，可以准确的定位获取到的内容。可以在审查元素中直接右键Copy->Copy Xpath复制出xpath代码，然后使用时在尾部加上text()就可以。

协程爬虫

import time
from lxml import etree
import aiohttp
import asyncio
urls = [
	'https://henan.qq.com/a/20190822/001115.htm',
	'https://henan.qq.com/a/20190822/001128.htm',
	'https://henan.qq.com/a/20190822/001086.htm',
	'https://henan.qq.com/a/20190822/001764.htm',
	'https://henan.qq.com/a/20190822/001163.htm',
	'https://henan.qq.com/a/20190822/001169.htm',
	'https://henan.qq.com/a/20190822/001196.htm',
	'https://henan.qq.com/a/20190822/001278.htm',
]
titles = []
sem = asyncio.Semaphore(10)
async def get_title(url):
	with(await sem):
		async with aiohttp.ClientSession() as session:
			async with session.request('GET',url) as resp:
				html = await resp.read()
				title = etree.HTML(html).xpath('//*[@id="Main-Article-QQ"]/div[2]/div[1]/div[2]/div[1]/h1/text()')
				print(''.join(title))

def main():
	loop = asyncio.get_event_loop()
	tasks = [get_title(url) for url in urls]
	loop.run_until_complete(asyncio.wait(tasks))
	loop.close()
if __name__ == '__main__':
	start = time.time()
	main()
	print('总耗时: %.5f秒' % float(time.time()-start))

在协程中，由于requests库提供的相关方法不是awaitable，使得无法放在await后面，因此无法在协程中直接使用requests库进行请求。

为了解决这个问题，官方提供了一个aiohttp库，实现异步网页请求等功能。
使用aiohttp模块请求网页：

import aiohttp
async with aiohttp.ClintSession() as session:
	async with session.get('http://test.com') as resp:
		print(resp.status)
		print(await resp.text())

其中async with是封装了异步实现功能的异步上下文管理器
这段代码中将aiohttp模块中的ClientSession方法命名为session，并且将ClientResponse对象命名为resp,ClientSession.get()协程的必须参数为一个URL。再通过resp.text()获取到网页的全部内容。

查看全文

相关阅读:
使用mail架包发送邮件javax.mail.AuthenticationFailedException: failed to connect at javax.mail.Service.connec
java容器 Map Set List
COJ 1686：记忆化搜索
 POJ 3694:桥
 COJ 1685：贪心+set
COJ 1687:Set
COJ 1684:线段树
 POJ 3693:RMQ+后缀数组
 URAL 1297：后缀数组求最长回文串
 POJ 1743：后缀数组

原文地址：https://www.cnblogs.com/pr1s0n/p/12246058.html