zoukankan      html  css  js  c++  java
  • Gini系数的原理

    转载:https://blog.csdn.net/u010665216/article/details/78528261

    首先,我们直接构造赛题结果:真实数据与预测数据:

    predictions = [0.9, 0.3, 0.8, 0.75, 0.65, 0.6, 0.78, 0.7, 0.05, 0.4, 0.4, 0.05, 0.5, 0.1, 0.1]
    actual = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

    我们将预测值从小到大排列:

    data = zip(actual, predictions)
    sorted_data = sorted(data, key=lambda d: d[1])
    sorted_actual = [d[0] for d in sorted_data]
    print('Sorted Actual Values', sorted_actual)

    我们对排序后的真实值累计求和:

    cumulative_actual = np.cumsum(sorted_actual)
    cumulative_index = np.arange(1, len(cumulative_actual)+1)
    
    plt.plot(cumulative_index, cumulative_actual)
    plt.xlabel('Cumulative Number of Predictions')
    plt.ylabel('Cumulative Actual Values')
    plt.show()

    我们将数据Normalization到0,1之间,并画出45度线:

    cumulative_actual_shares = cumulative_actual / sum(actual)
    cumulative_index_shares = cumulative_index / len(predictions)
    
    #Add (0, 0) to the plot
    x_values = [0] + list(cumulative_index_shares)
    y_values = [0] + list(cumulative_actual_shares)
    
    #Display the 45° line stacked on top of the y values
    diagonal = [x - y for (x, y) in zip(x_values, y_values)]
    
    plt.stackplot(x_values, y_values, diagonal)
    plt.xlabel('Cumulative Share of Predictions')
    plt.ylabel('Cumulative Share of Actual Values')
    plt.show()

    计算橙色区域面积:

    fy = scipy.interpolate.interp1d(x_values, y_values)
    blue_area, _ = scipy.integrate.quad(fy, 0, 1, points=x_values)
    orange_area = 0.5 - blue_area
    print('Orange Area: %.3f' % orange_area)

    最大可能的基尼系数:

    前面我们是按照预测值对真实值排序,得到一个基尼系数;现在我们按照真实值给真实值排序,得到最大可能的基尼系数:

    cumulative_actual_shares_perfect = np.cumsum(sorted(actual)) / sum(actual)
    y_values_perfect = [0] + list(cumulative_actual_shares_perfect)
    
    #Display the 45° line stacked on top of the y values
    diagonal = [x - y for (x, y) in zip(x_values, y_values_perfect)]
    
    plt.stackplot(x_values, y_values_perfect, diagonal)
    plt.xlabel('Cumulative Share of Predictions')
    plt.ylabel('Cumulative Share of Actual Values')
    plt.show()
    
    # Integrate the the curve function
    fy = scipy.interpolate.interp1d(x_values, y_values_perfect)
    blue_area, _ = scipy.integrate.quad(fy, 0, 1, points=x_values)
    orange_area = 0.5 - blue_area
    print('Orange Area: %.3f' % orange_area)

    数据挖掘中的Scoring Metric的实现:

    def gini(actual, pred):
        assert (len(actual) == len(pred))
        all = np.asarray(np.c_[actual, pred, np.arange(len(actual))], dtype=np.float)
        all = all[np.lexsort((all[:, 2], -1 * all[:, 1]))]
        totalLosses = all[:, 0].sum()
        giniSum = all[:, 0].cumsum().sum() / totalLosses
    
        giniSum -= (len(actual) + 1) / 2.
        return giniSum / len(actual)
    
    
    def gini_normalized(actual, pred):
        return gini(actual, pred) / gini(actual, actual)
    
    
    gini_predictions = gini(actual, predictions)
    gini_max = gini(actual, actual)
    ngini= gini_normalized(actual, predictions)
    print('Gini: %.3f, Max. Gini: %.3f, Normalized Gini: %.3f' % (gini_predictions, gini_max, ngini))
  • 相关阅读:
    go语言】Goroutines 并发模式
    Mysql Innodb 引擎优化 参数(innodb_buffer_pool_size)
    多key业务,数据库水平切分架构一次搞定
    Goroutine是如何工作的?
    PHP进程之信号捕捉中的declare(ticks=1)
    php多进程总结
    mysql强制性操作
    rabbitMQ高可用
    服务器TIME_WAIT和CLOSE_WAIT详解和解决办法
    mysql在innodb索引下b+树的高度问题。
  • 原文地址:https://www.cnblogs.com/wzdLY/p/9821791.html
Copyright © 2011-2022 走看看