zoukankan      html  css  js  c++  java
  • pandas

    pandas 中 apply 是个很常用的方法,但其效率是比较低的,本文介绍一些加速方法

    数据准备

    df = pd.DataFrame(np.random.randint(0, 11, size=(1000000, 5)),
                          columns=('a','b','c','d','e'))

    apply 效率测试

    if __name__ == '__main__':def func(a,b,c,d,e):
            if e == 10:
                return c*d
            elif (e < 10) and (e>=5):
                return c+d
            elif e < 5:
                return a+b
    
        time.process_time()
        df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)
        print(time.process_time())      # 19s

    耗时 19s

    swift 加速

    if __name__ == '__main__':### swift 加速
        import swifter
        time.process_time()
        df['new'] = df.swifter.apply(lambda x : func(x['a'],x['b'],x['c'],x['d'],x['e']),axis=1)
        print(time.process_time())      # 7s    注意观察,启动线程需 7s,最终执行只是 1s

    耗时 7s

    向量化

    处理 pandas 和 numpy 最好的方法就是向量化,即当做向量进行操作,胜过一切 方法、函数、乱七八糟的加速手段

    if __name__ == '__main__':### 向量化
        # 向量操作快于 apply
        time.process_time()
        df['new'] = df['c'] * df['d']  # default case e = =10
        mask = df['e'] < 10
        df.loc[mask, 'new'] = df['c'] + df['d']
        mask = df['e'] < 5
        df.loc[mask, 'new'] = df['a'] + df['b']
        print(time.process_time())      # 1.2s

    耗时 1.2s

    本文的重点 其实不是 apply 方法,记住一点即可:把 pandas 和 numpy 当做向量处理是最快的

    参考资料还有其他更快的方法,但我实验不成功,就没写,大家可以试试

    完整代码

    import time
    import pandas as pd
    import numpy as np
    
    
    if __name__ == '__main__':
        df = pd.DataFrame(np.random.randint(0, 11, size=(1000000, 5)),
                          columns=('a','b','c','d','e'))
    
        def func(a,b,c,d,e):
            if e == 10:
                return c*d
            elif (e < 10) and (e>=5):
                return c+d
            elif e < 5:
                return a+b
    
        # time.clock()
        time.process_time()
        df['new'] = df.apply(lambda x: func(x['a'], x['b'], x['c'], x['d'], x['e']), axis=1)
        # print(time.clock())
        print(time.process_time())      # 19s
    
        ### swift 加速
        import swifter
        time.process_time()
        df['new'] = df.swifter.apply(lambda x : func(x['a'],x['b'],x['c'],x['d'],x['e']),axis=1)
        print(time.process_time())      # 7s    注意观察,启动线程需 7s,最终执行只是 1s
    
        ### 向量化
        # 向量操作快于 apply
        time.process_time()
        df['new'] = df['c'] * df['d']  # default case e = =10
        mask = df['e'] < 10
        df.loc[mask, 'new'] = df['c'] + df['d']
        mask = df['e'] < 5
        df.loc[mask, 'new'] = df['a'] + df['b']
        print(time.process_time())      # 1.2s
    
        ### 类型转换 + 向量化
        df = df.astype(np.int16)
        time.process_time()
        # df = df.astype(np.int16)
        df['new'] = df['c'] * df['d']  # default case e = =10
        mask = df['e'] < 10
        df.loc[mask, 'new'] = df['c'] + df['d']
        mask = df['e'] < 5
        df.loc[mask, 'new'] = df['a'] + df['b']
        print(time.process_time())      # 1.3s
    
        ### values
        time.process_time()
        df = df.astype(np.int16)
        df['new'] = df['c'].values * df['d'].values  # default case e = =10
        mask = df['e'].values < 10
        df.loc[mask, 'new'] = df['c'] + df['d']
        mask = df['e'].values < 5
        df.loc[mask, 'new'] = df['a'] + df['b']
        print(time.process_time())          # 1s

    参考资料:

    https://mp.weixin.qq.com/s/cfoToYjcXXV5NJfwUr_1wA  Pandas中Apply函数加速百倍的技巧

  • 相关阅读:
    git撤销修改
    python参数组合
    java打包jar后,使之一直在linux上运行,不随终端退出而关闭
    输入流加载资源文件的3种方式
    ActiveMQ集群下的消息回流功能
    activemq在一台服务器上启动多个Broker
    JAVA多线程下载
    829. 连续整数求和-leetcode
    mysql笔记-索引
    redis源码学习-skiplist
  • 原文地址:https://www.cnblogs.com/yanshw/p/15207172.html
Copyright © 2011-2022 走看看