xgboost是基于GBDT原理进行改进的算法,效率高,并且可以进行并行化运算;
而且可以在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性,
调用的源码就不准备详述,本文主要侧重的是计算的原理,函数get_fscore源码如下,
源码来自安装包:xgboost/python-package/xgboost/core.py
通过下面的源码可以看出,特征评分可以看成是被用来分离决策树的次数,而这个与
《统计学习基础-数据挖掘、推理与推测》中10.13.1 计算公式有写差异,此处需要注意。
注:考虑的角度不同,计算方法略有差异。
def get_fscore(self, fmap=''): """Get feature importance of each feature. Parameters ---------- fmap: str (optional) The name of feature map file """ return self.get_score(fmap, importance_type='weight') def get_score(self, fmap='', importance_type='weight'): """Get feature importance of each feature. Importance type can be defined as: 'weight' - the number of times a feature is used to split the data across all trees. 'gain' - the average gain of the feature when it is used in trees 'cover' - the average coverage of the feature when it is used in trees Parameters ---------- fmap: str (optional) The name of feature map file """ if importance_type not in ['weight', 'gain', 'cover']: msg = "importance_type mismatch, got '{}', expected 'weight', 'gain', or 'cover'" raise ValueError(msg.format(importance_type)) # if it's weight, then omap stores the number of missing values if importance_type == 'weight': # do a simpler tree dump to save time trees = self.get_dump(fmap, with_stats=False) fmap = {} for tree in trees: for line in tree.split(' '): # look for the opening square bracket arr = line.split('[') # if no opening bracket (leaf node), ignore this line if len(arr) == 1: continue # extract feature name from string between [] fid = arr[1].split(']')[0].split('<')[0] if fid not in fmap: # if the feature hasn't been seen yet fmap[fid] = 1 else: fmap[fid] += 1 return fmap else: trees = self.get_dump(fmap, with_stats=True) importance_type += '=' fmap = {} gmap = {} for tree in trees: for line in tree.split(' '): # look for the opening square bracket arr = line.split('[') # if no opening bracket (leaf node), ignore this line if len(arr) == 1: continue # look for the closing bracket, extract only info within that bracket fid = arr[1].split(']') # extract gain or cover from string after closing bracket g = float(fid[1].split(importance_type)[1].split(',')[0]) # extract feature name from string before closing bracket fid = fid[0].split('<')[0] if fid not in fmap: # if the feature hasn't been seen yet fmap[fid] = 1 gmap[fid] = g else: fmap[fid] += 1 gmap[fid] += g # calculate average value (gain/cover) for each feature for fid in gmap: gmap[fid] = gmap[fid] / fmap[fid] return gmap
GBDT特征评分的计算说明原理:
链接:1、http://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/
详细的代码说明过程:可以从上面的链接进入下面的链接:
http://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting