zoukankan      html  css  js  c++  java
  • gbdt和xgboost中feature importance的获取

    来源于stack overflow,其实就是计算每个特征对于降低特征不纯度的贡献了多少,降低越多的,说明feature越重要

    I'll use the sklearn code, as it is generally much cleaner than the R code.

    Here's the implementation of the feature_importances property of the GradientBoostingClassifier (I removed some lines of code that get in the way of the conceptual stuff)

    def feature_importances_(self):
        total_sum = np.zeros((self.n_features, ), dtype=np.float64)
        for stage in self.estimators_:
            stage_sum = sum(tree.feature_importances_
                            for tree in stage) / len(stage)
            total_sum += stage_sum
    
        importances = total_sum / len(self.estimators_)
        return importances
    

    This is pretty easy to understand. self.estimators_ is an array containing the individual trees in the booster, so the for loop is iterating over the individual trees. There's one hickup with the

    stage_sum = sum(tree.feature_importances_
                    for tree in stage) / len(stage)
    

    this is taking care of the non-binary response case. Here we fit multiple trees in each stage in a one-vs-all way. Its simplest conceptually to focus on the binary case, where the sum has one summand, and this is just tree.feature_importances_. So in the binary case, we can rewrite this all as

    def feature_importances_(self):
        total_sum = np.zeros((self.n_features, ), dtype=np.float64)
        for tree in self.estimators_:
            total_sum += tree.feature_importances_ 
        importances = total_sum / len(self.estimators_)
        return importances
    

    So, in words, sum up the feature importances of the individual trees, then divide by the total number of trees. It remains to see how to calculate the feature importances for a single tree.

    The importance calculation of a tree is implemented at the cython level, but it's still followable. Here's a cleaned up version of the code

    cpdef compute_feature_importances(self, normalize=True):
        """Computes the importance of each feature (aka variable)."""
    
        while node != end_node:
            if node.left_child != _TREE_LEAF:
                # ... and node.right_child != _TREE_LEAF:
                left = &nodes[node.left_child]
                right = &nodes[node.right_child]
    
                importance_data[node.feature] += (
                    node.weighted_n_node_samples * node.impurity -
                    left.weighted_n_node_samples * left.impurity -
                    right.weighted_n_node_samples * right.impurity)
            node += 1
    
        importances /= nodes[0].weighted_n_node_samples
    
        return importances
    

    This is pretty simple. Iterate through the nodes of the tree. As long as you are not at a leaf node, calculate the weighted reduction in node purity from the split at this node, and attribute it to the feature that was split on

    importance_data[node.feature] += (
        node.weighted_n_node_samples * node.impurity -
        left.weighted_n_node_samples * left.impurity -
        right.weighted_n_node_samples * right.impurity)
    

    Then, when done, divide it all by the total weight of the data (in most cases, the number of observations)

    importances /= nodes[0].weighted_n_node_samples
    

    It's worth recalling that the impurity is a common metric to use when determining what split to make when growing a tree. In that light, we are simply summing up how much splitting on each feature allowed us to reduce the impurity across all the splits in the tree.

  • 相关阅读:
    解决在Pycharm中无法显示代码提示的问题
    解决在使用pip list时出现DEPRECATION
    Pycharm 有些库(函数)没有代码提示
    Oracle 11.2.0.4 For Windows 64bit+32bit 数据库
    Windows系统下oracle数据库每天定时备份
    PowerDesigner表创建脚本双引号问题
    Oracle11g 创建数据库中问题处理(必须运行Netca以配置监听程序)
    名人名言
    项目管理
    项目管理心得:一个项目经理的个人体会、经验总结(zz)
  • 原文地址:https://www.cnblogs.com/wuxiangli/p/6756577.html
Copyright © 2011-2022 走看看