zoukankan      html  css  js  c++  java
  • Urllib--爬虫

    1.简单爬虫

    from urllib import request
    
    def f(url):
        print('GET: %s' % url)
        resp = request.urlopen(url) #赋给一个实例,请求
        data = resp.read() #把结果读出来
        f=open('url.html','wb')
        f.write(data)
        f.close()
        print('%d bytes received from %s.' % (len(data), url))
    
    f('http://www.cnblogs.com/alex3714/articles/5248247.html')
    

     运行结果:

    C:abccdxdddOldboypython-3.5.2-embed-amd64python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py
    GET: http://www.cnblogs.com/alex3714/articles/5248247.html
    91829 bytes received from http://www.cnblogs.com/alex3714/articles/5248247.html.
    
    Process finished with exit code 0
    

    2.爬多个网页

    from urllib import request
    import gevent
    
    def f(url):
        print('GET: %s' % url)
        resp = request.urlopen(url) #赋给一个实例,请求
        data = resp.read() #把结果读出来
        print('%d bytes received from %s.' % (len(data), url))
    
    #启动3个协程并且传参数
    gevent.joinall([
            gevent.spawn(f, 'https://www.python.org/'),
            gevent.spawn(f, 'https://www.yahoo.com/'),
            gevent.spawn(f, 'https://github.com/'),
    ])
    

     运行结果:

    GET: https://www.python.org/
    48751 bytes received from https://www.python.org/.
    GET: https://www.yahoo.com/
    479631 bytes received from https://www.yahoo.com/.
    GET: https://github.com/
    55394 bytes received from https://github.com/.
    
    Process finished with exit code 0
    

    3.测试运行时间:

    from urllib import request
    import gevent
    import time
    
    def f(url):
        print('GET: %s' % url)
        resp = request.urlopen(url) #赋给一个实例,请求
        data = resp.read() #把结果读出来
        print('%d bytes received from %s.' % (len(data), url))
    
    start_time=time.time()
    #启动3个协程并且传参数
    gevent.joinall([
            gevent.spawn(f, 'https://www.python.org/'),
            gevent.spawn(f, 'https://www.yahoo.com/'),
            gevent.spawn(f, 'https://github.com/'),
    ])
    print('cost is %s:'%(time.time()-start_time))
    

     运行结果:通过时间看到也是串行运行的。gevent默认检测不到 urllib 进行的是否是io操作。

    C:abccdxdddOldboypython-3.5.2-embed-amd64python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py
    GET: https://www.python.org/
    48751 bytes received from https://www.python.org/.
    GET: https://www.yahoo.com/
    488624 bytes received from https://www.yahoo.com/.
    GET: https://github.com/
    55394 bytes received from https://github.com/.
    cost is 4.5304529666900635:
    
    Process finished with exit code 0
    

    4.同步与异步的时间比较:

    from urllib import request
    import gevent
    import time
    #from gevent import monkey
    
    #monkey.patch_all() #把当前程序的所有io操作给我单独地做上标记
    def f(url):
        print('GET: %s' % url)
        resp = request.urlopen(url) #赋给一个实例,请求
        data = resp.read() #把结果读出来
        print('%d bytes received from %s.' % (len(data), url))
    
    urls=['https://www.python.org/','https://www.yahoo.com/','https://github.com/']
    start_time=time.time()
    for url in urls:
        f(url)
    print('同步cost is %s:'%(time.time()-start_time))
    
    
    async_time_start=time.time() #异步的起始时间
    gevent.joinall([
            gevent.spawn(f, 'https://www.python.org/'),
            gevent.spawn(f, 'https://www.yahoo.com/'),
            gevent.spawn(f, 'https://github.com/'),
    ])
    print('异步cost is %s:'%(time.time()-async_time_start))
    

     运行时间:几乎差不多,看不出异步的优势。

    C:abccdxdddOldboypython-3.5.2-embed-amd64python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py
    GET: https://www.python.org/
    48751 bytes received from https://www.python.org/.
    GET: https://www.yahoo.com/
    480499 bytes received from https://www.yahoo.com/.
    GET: https://github.com/
    55394 bytes received from https://github.com/.
    同步cost is 7.112711191177368:
    GET: https://www.python.org/
    48751 bytes received from https://www.python.org/.
    GET: https://www.yahoo.com/
    485666 bytes received from https://www.yahoo.com/.
    GET: https://github.com/
    55390 bytes received from https://github.com/.
    异步cost is 4.510450839996338:
    
    Process finished with exit code 0
    

    5.因为gevent默认检测不到 urllib 进行的是否是io操作。要想让两者关联起来,需要再导入一个新函数(补丁)

    from gevent import monkey
    monkey.patch_all()

    from urllib import request
    import gevent
    import time
    from gevent import monkey
    
    monkey.patch_all() #把当前程序的所有io操作给我单独地做上标记
    def f(url):
        print('GET: %s' % url)
        resp = request.urlopen(url) #赋给一个实例,请求
        data = resp.read() #把结果读出来
        print('%d bytes received from %s.' % (len(data), url))
    
    urls=['https://www.python.org/','https://www.yahoo.com/','https://github.com/']
    start_time=time.time()
    for url in urls:
        f(url)
    print('同步cost is %s:'%(time.time()-start_time))
    
    
    async_time_start=time.time() #异步的起始时间
    gevent.joinall([
            gevent.spawn(f, 'https://www.python.org/'),
            gevent.spawn(f, 'https://www.yahoo.com/'),
            gevent.spawn(f, 'https://github.com/'),
    ])
    print('异步cost is %s:'%(time.time()-async_time_start))
    

     运行结果:

    C:abccdxdddOldboypython-3.5.2-embed-amd64python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py
    GET: https://www.python.org/
    48751 bytes received from https://www.python.org/.
    GET: https://www.yahoo.com/
    487577 bytes received from https://www.yahoo.com/.
    GET: https://github.com/
    55392 bytes received from https://github.com/.
    同步cost is 5.784578323364258:
    GET: https://www.python.org/
    GET: https://www.yahoo.com/
    GET: https://github.com/
    480662 bytes received from https://www.yahoo.com/.
    48751 bytes received from https://www.python.org/.
    55394 bytes received from https://github.com/.
    异步cost is 1.8721871376037598:
    
    Process finished with exit code 0
    
  • 相关阅读:
    复习正则表达式20190618
    python每日练习10题2
    java多线程
    资源2
    apache
    行转列,列转行
    mysql5.7安装(正确安装)实战
    常见规则引擎技术
    Spark性能优化之道——解决Spark数据倾斜(Data Skew)的N种姿势
    Vue开源项目库汇总
  • 原文地址:https://www.cnblogs.com/momo8238/p/7372538.html
Copyright © 2011-2022 走看看