gbdt和xgboost中feature importance的获取

zoukankan html css js c++ java

gbdt和xgboost中feature importance的获取
来源于stack overflow,其实就是计算每个特征对于降低特征不纯度的贡献了多少，降低越多的，说明feature越重要

I'll use the sklearn code, as it is generally much cleaner than the R code.

Here's the implementation of the feature_importances property of the GradientBoostingClassifier (I removed some lines of code that get in the way of the conceptual stuff)
```
def feature_importances_(self):
    total_sum = np.zeros((self.n_features, ), dtype=np.float64)
    for stage in self.estimators_:
        stage_sum = sum(tree.feature_importances_
                        for tree in stage) / len(stage)
        total_sum += stage_sum

    importances = total_sum / len(self.estimators_)
    return importances
```
This is pretty easy to understand. self.estimators_ is an array containing the individual trees in the booster, so the for loop is iterating over the individual trees. There's one hickup with the
```
stage_sum = sum(tree.feature_importances_
                for tree in stage) / len(stage)
```
this is taking care of the non-binary response case. Here we fit multiple trees in each stage in a one-vs-all way. Its simplest conceptually to focus on the binary case, where the sum has one summand, and this is just tree.feature_importances_. So in the binary case, we can rewrite this all as
```
def feature_importances_(self):
    total_sum = np.zeros((self.n_features, ), dtype=np.float64)
    for tree in self.estimators_:
        total_sum += tree.feature_importances_ 
    importances = total_sum / len(self.estimators_)
    return importances
```
So, in words, sum up the feature importances of the individual trees, then divide by the total number of trees. It remains to see how to calculate the feature importances for a single tree.

The importance calculation of a tree is implemented at the cython level, but it's still followable. Here's a cleaned up version of the code
```
cpdef compute_feature_importances(self, normalize=True):
    """Computes the importance of each feature (aka variable)."""

    while node != end_node:
        if node.left_child != _TREE_LEAF:
            # ... and node.right_child != _TREE_LEAF:
            left = &nodes[node.left_child]
            right = &nodes[node.right_child]

            importance_data[node.feature] += (
                node.weighted_n_node_samples * node.impurity -
                left.weighted_n_node_samples * left.impurity -
                right.weighted_n_node_samples * right.impurity)
        node += 1

    importances /= nodes[0].weighted_n_node_samples

    return importances
```
This is pretty simple. Iterate through the nodes of the tree. As long as you are not at a leaf node, calculate the weighted reduction in node purity from the split at this node, and attribute it to the feature that was split on
```
importance_data[node.feature] += (
    node.weighted_n_node_samples * node.impurity -
    left.weighted_n_node_samples * left.impurity -
    right.weighted_n_node_samples * right.impurity)
```
Then, when done, divide it all by the total weight of the data (in most cases, the number of observations)
```
importances /= nodes[0].weighted_n_node_samples
```
It's worth recalling that the impurity is a common metric to use when determining what split to make when growing a tree. In that light, we are simply summing up how much splitting on each feature allowed us to reduce the impurity across all the splits in the tree.
查看全文

相关阅读:
点燃圣火！ Ember.js 的初学者指南
 加班10天的过程
 创建SVN仓库的步骤
 OpenTSDB/HBase的调优过程整理
 HBase数据压缩算法编码探索
 解决修改css或js文件后，浏览器缓存未更新问题
 Debian 9 Stretch国内常用镜像源
 ELK填坑总结和优化过程
 elk中es集群web管理工具cerebro
Kafka集群管理工具kafka-manager的安装使用

原文地址：https://www.cnblogs.com/wuxiangli/p/6756577.html