zoukankan      html  css  js  c++  java
  • python 多线程、多进程、协程性能对比(以爬虫为例)

      基本配置:阿里云服务器低配,单核2G内存

      首先是看协程的效果:

      
    import requests
    import lxml.html as HTML
    import sys
    import time
    import gevent
    from gevent import monkey
    monkey.patch_all()
    
    # create url
    urls = []
    for i in range(int(sys.argv[1]),int(sys.argv[2])):
        url = 'http://grri94kmi4.app.tianmaying.com/songs?page='+str(i)
        urls.append(url)
    
    def get_data(url):
        t1 = time.time()
        res = requests.get(url)
        if res.status_code == 200:
            print(url+' : '+'url open success'+'  time use: '+ str(time.time()-t1))
        html = HTML.fromstring(res.content)
        trs = html.xpath('//tbody/tr')
        data = []
        for tr in trs:
            s = {}
            s['name'] = tr.xpath('./td/a/text()')[0]
            s['url'] = tr.xpath('./td/a/@href')[0]
            s['id'] = s['url'][30:]
            s['comment'] = tr.xpath('./td[last()]/text()')[0]
            data.append(s)
    
    if __name__ == '__main__':
        total = time.time()
        task = []
        for url in urls:
            task.append(gevent.spawn(get_data,url))
        gevent.joinall(task)
        print('total time use :', time.time()-total)
    View Code

      在爬取20个链接的情况下,用时为4s:

      total time use : 4.873192071914673

      线程和进程差不多 ,用时6s左右

      

    import requests
    import lxml.html as HTML
    import sys
    import time
    from multiprocessing import Pool as ThreadPool
    # create url
    urls = []
    for i in range(int(sys.argv[1]),int(sys.argv[2])):
       url = 'http://grri94kmi4.app.tianmaying.com/songs?page='+str(i)
       urls.append(url)
    
    def get_data(url):
       t1 = time.time()
       res = requests.get(url)
       if res.status_code == 200:
         print(url+' : '+'url open success'+'  time use: '+ str(time.time()-t1))
       html = HTML.fromstring(res.content)
       trs = html.xpath('//tbody/tr')
       data = []
       for tr in trs:
         s = {}
         s['name'] = tr.xpath('./td/a/text()')[0]
         s['url'] = tr.xpath('./td/a/@href')[0]
         s['id'] = s['url'][30:]
         s['comment'] = tr.xpath('./td[last()]/text()')[0]
         data.append(s)
    
    if __name__ == '__main__':
       total = time.time()
       pool = ThreadPool()
       results = pool.map(get_data,urls)
       pool.close()
       pool.join()
       print('total time use :', time.time()-total)

      

  • 相关阅读:
    判断&数学&生活
    Tomcat7源码环境搭建
    CentOS 7 下使用虚拟环境Virtualenv安装Tensorflow cpu版记录
    Quartz学习笔记1:Quartz概述
    Docker学习笔记2: Docker 概述
    大数据基础知识问答----spark篇,大数据生态圈
    [MSSQL] [EntityFramework(.Net Core)] 自增长id字段,无法插入数据
    [json-server] RESTful API 中,取主数据时,同时获取多个关联子表的数据
    前后端分离开发之前端自己的API(DB)---- (2)
    前后端分离开发之前端自己的API(DB)---- (1)
  • 原文地址:https://www.cnblogs.com/peter1994/p/7641658.html
Copyright © 2011-2022 走看看