1.简单爬虫
from urllib import request def f(url): print('GET: %s' % url) resp = request.urlopen(url) #赋给一个实例,请求 data = resp.read() #把结果读出来 f=open('url.html','wb') f.write(data) f.close() print('%d bytes received from %s.' % (len(data), url)) f('http://www.cnblogs.com/alex3714/articles/5248247.html')
运行结果:
C:abccdxdddOldboypython-3.5.2-embed-amd64python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py GET: http://www.cnblogs.com/alex3714/articles/5248247.html 91829 bytes received from http://www.cnblogs.com/alex3714/articles/5248247.html. Process finished with exit code 0
2.爬多个网页
from urllib import request import gevent def f(url): print('GET: %s' % url) resp = request.urlopen(url) #赋给一个实例,请求 data = resp.read() #把结果读出来 print('%d bytes received from %s.' % (len(data), url)) #启动3个协程并且传参数 gevent.joinall([ gevent.spawn(f, 'https://www.python.org/'), gevent.spawn(f, 'https://www.yahoo.com/'), gevent.spawn(f, 'https://github.com/'), ])
运行结果:
GET: https://www.python.org/ 48751 bytes received from https://www.python.org/. GET: https://www.yahoo.com/ 479631 bytes received from https://www.yahoo.com/. GET: https://github.com/ 55394 bytes received from https://github.com/. Process finished with exit code 0
3.测试运行时间:
from urllib import request import gevent import time def f(url): print('GET: %s' % url) resp = request.urlopen(url) #赋给一个实例,请求 data = resp.read() #把结果读出来 print('%d bytes received from %s.' % (len(data), url)) start_time=time.time() #启动3个协程并且传参数 gevent.joinall([ gevent.spawn(f, 'https://www.python.org/'), gevent.spawn(f, 'https://www.yahoo.com/'), gevent.spawn(f, 'https://github.com/'), ]) print('cost is %s:'%(time.time()-start_time))
运行结果:通过时间看到也是串行运行的。gevent默认检测不到 urllib 进行的是否是io操作。
C:abccdxdddOldboypython-3.5.2-embed-amd64python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py
GET: https://www.python.org/
48751 bytes received from https://www.python.org/.
GET: https://www.yahoo.com/
488624 bytes received from https://www.yahoo.com/.
GET: https://github.com/
55394 bytes received from https://github.com/.
cost is 4.5304529666900635:
Process finished with exit code 0
4.同步与异步的时间比较:
from urllib import request import gevent import time #from gevent import monkey #monkey.patch_all() #把当前程序的所有io操作给我单独地做上标记 def f(url): print('GET: %s' % url) resp = request.urlopen(url) #赋给一个实例,请求 data = resp.read() #把结果读出来 print('%d bytes received from %s.' % (len(data), url)) urls=['https://www.python.org/','https://www.yahoo.com/','https://github.com/'] start_time=time.time() for url in urls: f(url) print('同步cost is %s:'%(time.time()-start_time)) async_time_start=time.time() #异步的起始时间 gevent.joinall([ gevent.spawn(f, 'https://www.python.org/'), gevent.spawn(f, 'https://www.yahoo.com/'), gevent.spawn(f, 'https://github.com/'), ]) print('异步cost is %s:'%(time.time()-async_time_start))
运行时间:几乎差不多,看不出异步的优势。
C:abccdxdddOldboypython-3.5.2-embed-amd64python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py GET: https://www.python.org/ 48751 bytes received from https://www.python.org/. GET: https://www.yahoo.com/ 480499 bytes received from https://www.yahoo.com/. GET: https://github.com/ 55394 bytes received from https://github.com/. 同步cost is 7.112711191177368: GET: https://www.python.org/ 48751 bytes received from https://www.python.org/. GET: https://www.yahoo.com/ 485666 bytes received from https://www.yahoo.com/. GET: https://github.com/ 55390 bytes received from https://github.com/. 异步cost is 4.510450839996338: Process finished with exit code 0
5.因为gevent默认检测不到 urllib 进行的是否是io操作。要想让两者关联起来,需要再导入一个新函数(补丁)
from gevent import monkey,
monkey.patch_all()
from urllib import request import gevent import time from gevent import monkey monkey.patch_all() #把当前程序的所有io操作给我单独地做上标记 def f(url): print('GET: %s' % url) resp = request.urlopen(url) #赋给一个实例,请求 data = resp.read() #把结果读出来 print('%d bytes received from %s.' % (len(data), url)) urls=['https://www.python.org/','https://www.yahoo.com/','https://github.com/'] start_time=time.time() for url in urls: f(url) print('同步cost is %s:'%(time.time()-start_time)) async_time_start=time.time() #异步的起始时间 gevent.joinall([ gevent.spawn(f, 'https://www.python.org/'), gevent.spawn(f, 'https://www.yahoo.com/'), gevent.spawn(f, 'https://github.com/'), ]) print('异步cost is %s:'%(time.time()-async_time_start))
运行结果:
C:abccdxdddOldboypython-3.5.2-embed-amd64python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py GET: https://www.python.org/ 48751 bytes received from https://www.python.org/. GET: https://www.yahoo.com/ 487577 bytes received from https://www.yahoo.com/. GET: https://github.com/ 55392 bytes received from https://github.com/. 同步cost is 5.784578323364258: GET: https://www.python.org/ GET: https://www.yahoo.com/ GET: https://github.com/ 480662 bytes received from https://www.yahoo.com/. 48751 bytes received from https://www.python.org/. 55394 bytes received from https://github.com/. 异步cost is 1.8721871376037598: Process finished with exit code 0