zoukankan      html  css  js  c++  java
  • Gini系数的原理

    转载:https://blog.csdn.net/u010665216/article/details/78528261

    首先,我们直接构造赛题结果:真实数据与预测数据:

    predictions = [0.9, 0.3, 0.8, 0.75, 0.65, 0.6, 0.78, 0.7, 0.05, 0.4, 0.4, 0.05, 0.5, 0.1, 0.1]
    actual = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

    我们将预测值从小到大排列:

    data = zip(actual, predictions)
    sorted_data = sorted(data, key=lambda d: d[1])
    sorted_actual = [d[0] for d in sorted_data]
    print('Sorted Actual Values', sorted_actual)

    我们对排序后的真实值累计求和:

    cumulative_actual = np.cumsum(sorted_actual)
    cumulative_index = np.arange(1, len(cumulative_actual)+1)
    
    plt.plot(cumulative_index, cumulative_actual)
    plt.xlabel('Cumulative Number of Predictions')
    plt.ylabel('Cumulative Actual Values')
    plt.show()

    我们将数据Normalization到0,1之间,并画出45度线:

    cumulative_actual_shares = cumulative_actual / sum(actual)
    cumulative_index_shares = cumulative_index / len(predictions)
    
    #Add (0, 0) to the plot
    x_values = [0] + list(cumulative_index_shares)
    y_values = [0] + list(cumulative_actual_shares)
    
    #Display the 45° line stacked on top of the y values
    diagonal = [x - y for (x, y) in zip(x_values, y_values)]
    
    plt.stackplot(x_values, y_values, diagonal)
    plt.xlabel('Cumulative Share of Predictions')
    plt.ylabel('Cumulative Share of Actual Values')
    plt.show()

    计算橙色区域面积:

    fy = scipy.interpolate.interp1d(x_values, y_values)
    blue_area, _ = scipy.integrate.quad(fy, 0, 1, points=x_values)
    orange_area = 0.5 - blue_area
    print('Orange Area: %.3f' % orange_area)

    最大可能的基尼系数:

    前面我们是按照预测值对真实值排序,得到一个基尼系数;现在我们按照真实值给真实值排序,得到最大可能的基尼系数:

    cumulative_actual_shares_perfect = np.cumsum(sorted(actual)) / sum(actual)
    y_values_perfect = [0] + list(cumulative_actual_shares_perfect)
    
    #Display the 45° line stacked on top of the y values
    diagonal = [x - y for (x, y) in zip(x_values, y_values_perfect)]
    
    plt.stackplot(x_values, y_values_perfect, diagonal)
    plt.xlabel('Cumulative Share of Predictions')
    plt.ylabel('Cumulative Share of Actual Values')
    plt.show()
    
    # Integrate the the curve function
    fy = scipy.interpolate.interp1d(x_values, y_values_perfect)
    blue_area, _ = scipy.integrate.quad(fy, 0, 1, points=x_values)
    orange_area = 0.5 - blue_area
    print('Orange Area: %.3f' % orange_area)

    数据挖掘中的Scoring Metric的实现:

    def gini(actual, pred):
        assert (len(actual) == len(pred))
        all = np.asarray(np.c_[actual, pred, np.arange(len(actual))], dtype=np.float)
        all = all[np.lexsort((all[:, 2], -1 * all[:, 1]))]
        totalLosses = all[:, 0].sum()
        giniSum = all[:, 0].cumsum().sum() / totalLosses
    
        giniSum -= (len(actual) + 1) / 2.
        return giniSum / len(actual)
    
    
    def gini_normalized(actual, pred):
        return gini(actual, pred) / gini(actual, actual)
    
    
    gini_predictions = gini(actual, predictions)
    gini_max = gini(actual, actual)
    ngini= gini_normalized(actual, predictions)
    print('Gini: %.3f, Max. Gini: %.3f, Normalized Gini: %.3f' % (gini_predictions, gini_max, ngini))
  • 相关阅读:
    python中的编码问题
    CVPR2018 Tutorial 之 Visual Recognition and Beyond
    hdu 1376 Octal Fractions
    hdu 1329 Hanoi Tower Troubles Again!
    hdu 1309 Loansome Car Buyer
    hdu 1333 Smith Numbers
    hdu 1288 Hat's Tea
    hdu 1284 钱币兑换问题
    hdu 1275 两车追及或相遇问题
    hdu 1270 小希的数表
  • 原文地址:https://www.cnblogs.com/wzdLY/p/9821791.html
Copyright © 2011-2022 走看看