zoukankan      html  css  js  c++  java
  • pandas、spark计算相关性系数速度对比

    pandas、spark计算相关性系数速度对比

    相关性计算有三种算法:pearson、spearman,kenall。

    在pandas库中,对一个Dataframe,可以直接计算这三个算法的相关系数correlation,方法为:data.corr()

    底层是依赖scipy库的算法。

    为了提升计算速度,使用spark平台来加速执行。

    比较了pandas,spark并发scipy算法,spark mllib库的计算速度。

    总体来说,spark mllib速度最快,其次是spark并发,pandas速度最慢。

    corr执行速度测试结果

    时间单位:秒

    数据大小 corr算法 pandas spark + scipy spark mllib 备注
    1000*3600 pearsonr 203 170 37 pyspark
    1000*3600 pearsonr 203 50 没有计算 spark scipy计算一半
    1000*3600 pearsonr 203 125 37 client模式
    1000*3600 pearsonr 202 157 38 client模式
    1000*3600 spearmanr 1386 6418 37 client模式
    1000*3600 spearmanr 1327 6392 38 client模式
    1000*3600 kendall 4326 398 无此算法 client模式
    1000*3600 kendall 4239 346 无此算法 client模式
    1000*1000 spearmanr 127 294 12 client 模式
    1000*1000 spearmanr 98 513 5.55 client 模式
    1000*360 spearmanr 13 150 没有计算 160秒,列表推导式 res = [st.spearmanr(data.iloc[:, i], data.iloc[:, j])[0] for i in range(N) for j in range(N)]
    1000*360 kendall 40 45 无此算法 116秒,列表推导式 res = [st.kendall(data.iloc[:, i], data.iloc[:, j])[0] for i in range(N) for j in range(N)]

    说明:spearmanr 算法在spark scipy组合下执行速度较慢,需要再对比分析,感觉存在问题的。


    三种算法脚本如下:

    pandas 脚本

    
    import numpy as np
    import pandas as pd
    import time
    
    C = 1000
    N = 3600
    
    data = pd.DataFrame(np.random.randn(C * N).reshape(C, -1))
    
    
    print("============================ {}".format(data.shape))
    print("start pandas corr ---{} ".format(time.time()))
    start = time.time()
    # {'pearson', 'kendall', 'spearman'}
    res = data.corr(method='pearson')
    end_1 = time.time()
    
    res = data.corr(method='spearman')
    end_2 = time.time()
    
    res = data.corr(method='kendall')
    end_3 = time.time()
    
    print("pandas pearson count {} total cost : {}".format(len(res), end_1 - start))
    print("pandas spearman count {} total cost : {}".format(len(res), end_2 - end_1))
    print("pandas kendall count {} total cost : {}".format(len(res), end_3 - end_2))
    

    spark scipy脚本

    from pyspark import SparkContext
    sc = SparkContext()
    import numpy as np
    import pandas as pd
    from scipy import stats as st
    import time
    
    # t1 = st.kendalltau(x, y)
    # t2 = st.spearmanr(x, y)
    # t3 = st.pearsonr(x, y)
    
    C = 1000
    N = 3600
    
    data = pd.DataFrame(np.random.randn(C * N).reshape(C, -1))
    
    
    def pearsonr(n):
        x = data.iloc[:, n]
        res = [st.pearsonr(x, data.iloc[:, i])[0] for i in range(data.shape[1])]
        return res
    
    
    def spearmanr(n):
        x = data.iloc[:, n]
        res = [st.spearmanr(x, data.iloc[:, i])[0] for i in range(data.shape[1])]
        return res
    
    
    def kendalltau(n):
        x = data.iloc[:, n]
        res = [st.kendalltau(x, data.iloc[:, i])[0] for i in range(data.shape[1])]
        return res
    
    
    start = time.time()
    res = sc.parallelize(np.arange(N)).map(lambda x: pearsonr(x)).collect()
    # res = sc.parallelize(np.arange(N)).map(lambda x: spearmanr(x)).collect()
    # res = sc.parallelize(np.arange(N)).map(lambda x: kendalltau(x)).collect()
    end = time.time()
    
    print("pearsonr count {} total cost : {}".format(len(res), end - start))
    print("spearmanr count {} total cost : {}".format(len(res), end - start))
    print("kendalltau count {} total cost : {}".format(len(res), end - start))
    
    
    # 纯python算法
    s = time.time()
    res = [st.spearmanr(data.iloc[:, i], data.iloc[:, j])[0] for i in range(N) for j in range(N)]
    end = time.time()
    print(end-s)
    
    start = time.time()
    dd = sc.parallelize(res).map(lambda x: st.spearmanr(data.iloc[:, x[0]], data.iloc[:, x[1]])).collect()
    end = time.time()
    print(end-start)
    
    start = time.time()
    dd = sc.parallelize(res).map(lambda x: st.kendalltau(data.iloc[:, x[0]], data.iloc[:, x[1]])).collect()
    end = time.time()
    print(end-start)
    

    spark mllib脚本

    from pyspark import SparkContext
    sc = SparkContext()
    from pyspark.mllib.stat import Statistics
    import time
    import numpy as np
    
    L = 1000
    N = 3600
    t = [np.random.randn(N) for i in range(L)]
    
    data = sc.parallelize(t)
    
    start = time.time()
    res = Statistics.corr(data, method="pearson")  # spearman  pearson
    end = time.time()
    print("pearson : ", end-start)
    
    
    start = time.time()
    res = Statistics.corr(data, method="spearman")  # spearman  pearson
    end = time.time()
    print("spearman: ", end-start)
    
  • 相关阅读:
    网络编程
    面向对象总结
    面象对象编程(选课系统)
    类的魔法方法和部分单例模式
    简易3D开发,ThingJS之大道至简
    ThingJS参与3D众创,一起建设“实体中国”!
    ThingJS:轻松让空间“立起来”,展示你的3D创造力
    一个产品的状态不好?ThingJS来找茬
    ThingJS提供有地理位置的信息弹窗示例
    一次灵感盛宴,ThingJS推出场景Market
  • 原文地址:https://www.cnblogs.com/StitchSun/p/13225260.html
Copyright © 2011-2022 走看看