zoukankan      html  css  js  c++  java
  • 【原创】xgboost 特征评分的计算原理

    xgboost是基于GBDT原理进行改进的算法,效率高,并且可以进行并行化运算;

    而且可以在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性,

    调用的源码就不准备详述,本文主要侧重的是计算的原理,函数get_fscore源码如下,

    源码来自安装包:xgboost/python-package/xgboost/core.py

    通过下面的源码可以看出,特征评分可以看成是被用来分离决策树的次数,而这个与

    《统计学习基础-数据挖掘、推理与推测》中10.13.1 计算公式有写差异,此处需要注意。

    注:考虑的角度不同,计算方法略有差异。

     def get_fscore(self, fmap=''):
            """Get feature importance of each feature.
    
            Parameters
            ----------
            fmap: str (optional)
               The name of feature map file
            """
    
            return self.get_score(fmap, importance_type='weight')
    
        def get_score(self, fmap='', importance_type='weight'):
            """Get feature importance of each feature.
            Importance type can be defined as:
                'weight' - the number of times a feature is used to split the data across all trees.
                'gain' - the average gain of the feature when it is used in trees
                'cover' - the average coverage of the feature when it is used in trees
    
            Parameters
            ----------
            fmap: str (optional)
               The name of feature map file
            """
    
            if importance_type not in ['weight', 'gain', 'cover']:
                msg = "importance_type mismatch, got '{}', expected 'weight', 'gain', or 'cover'"
                raise ValueError(msg.format(importance_type))
    
            # if it's weight, then omap stores the number of missing values
            if importance_type == 'weight':
                # do a simpler tree dump to save time
                trees = self.get_dump(fmap, with_stats=False)
    
                fmap = {}
                for tree in trees:
                    for line in tree.split('
    '):
                        # look for the opening square bracket
                        arr = line.split('[')
                        # if no opening bracket (leaf node), ignore this line
                        if len(arr) == 1:
                            continue
    
                        # extract feature name from string between []
                        fid = arr[1].split(']')[0].split('<')[0]
    
                        if fid not in fmap:
                            # if the feature hasn't been seen yet
                            fmap[fid] = 1
                        else:
                            fmap[fid] += 1
    
                return fmap
    
            else:
                trees = self.get_dump(fmap, with_stats=True)
    
                importance_type += '='
                fmap = {}
                gmap = {}
                for tree in trees:
                    for line in tree.split('
    '):
                        # look for the opening square bracket
                        arr = line.split('[')
                        # if no opening bracket (leaf node), ignore this line
                        if len(arr) == 1:
                            continue
    
                        # look for the closing bracket, extract only info within that bracket
                        fid = arr[1].split(']')
    
                        # extract gain or cover from string after closing bracket
                        g = float(fid[1].split(importance_type)[1].split(',')[0])
    
                        # extract feature name from string before closing bracket
                        fid = fid[0].split('<')[0]
    
                        if fid not in fmap:
                            # if the feature hasn't been seen yet
                            fmap[fid] = 1
                            gmap[fid] = g
                        else:
                            fmap[fid] += 1
                            gmap[fid] += g
    
                # calculate average value (gain/cover) for each feature
                for fid in gmap:
                    gmap[fid] = gmap[fid] / fmap[fid]
    
                return gmap
    

     GBDT特征评分的计算说明原理:

    链接:1、http://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/

    详细的代码说明过程:可以从上面的链接进入下面的链接:

    http://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting

  • 相关阅读:
    Java读写.properties文件实例,解决中文乱码问题
    web项目的getContextPath()
    PWC6345: There is an error in invoking javac. A full JDK (not just JRE) is required
    Eclipse安装与配置
    Linux基础整理 + 注释
    git命令集合
    遍历List集合的三种方法
    使用jqueryUI和corethink实现的类似百度的搜索提示
    corethink功能模块探索开发(十八)前台页面插入jit前端数据可视化库
    corethink功能模块探索开发(十七)opencmf.php 配置文件
  • 原文地址:https://www.cnblogs.com/haobang008/p/5929378.html
Copyright © 2011-2022 走看看