zoukankan      html  css  js  c++  java
  • XGBboost 特征评分的计算原理

      xgboost是基于GBDT原理进行改进的算法,效率高,并且可以进行并行化运算,而且可以在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性,

    调用的源码就不准备详述,本文主要侧重的是计算的原理,函数get_fscore源码如下,源码来自安装包:xgboost/python-package/xgboost/core.py

      通过下面的源码可以看出,特征评分可以看成是被用来分离决策树的次数。

    def get_fscore(self, fmap=''):
            """Get feature importance of each feature.
    
            Parameters
            ----------
            fmap: str (optional)
               The name of feature map file
            """
    
            return self.get_score(fmap, importance_type='weight')
    
        def get_score(self, fmap='', importance_type='weight'):
            """Get feature importance of each feature.
            Importance type can be defined as:
                'weight' - the number of times a feature is used to split the data across all trees.
                'gain' - the average gain of the feature when it is used in trees
                'cover' - the average coverage of the feature when it is used in trees
    
            Parameters
            ----------
            fmap: str (optional)
               The name of feature map file
            """
    
            if importance_type not in ['weight', 'gain', 'cover']:
                msg = "importance_type mismatch, got '{}', expected 'weight', 'gain', or 'cover'"
                raise ValueError(msg.format(importance_type))
    
            # if it's weight, then omap stores the number of missing values
            if importance_type == 'weight':
                # do a simpler tree dump to save time
                trees = self.get_dump(fmap, with_stats=False)
    
                fmap = {}
                for tree in trees:
                    for line in tree.split('
    '):
                        # look for the opening square bracket
                        arr = line.split('[')
                        # if no opening bracket (leaf node), ignore this line
                        if len(arr) == 1:
                            continue
    
                        # extract feature name from string between []
                        fid = arr[1].split(']')[0].split('<')[0]
    
                        if fid not in fmap:
                            # if the feature hasn't been seen yet
                            fmap[fid] = 1
                        else:
                            fmap[fid] += 1
    
                return fmap
    
            else:
                trees = self.get_dump(fmap, with_stats=True)
    
                importance_type += '='
                fmap = {}
                gmap = {}
                for tree in trees:
                    for line in tree.split('
    '):
                        # look for the opening square bracket
                        arr = line.split('[')
                        # if no opening bracket (leaf node), ignore this line
                        if len(arr) == 1:
                            continue
    
                        # look for the closing bracket, extract only info within that bracket
                        fid = arr[1].split(']')
    
                        # extract gain or cover from string after closing bracket
                        g = float(fid[1].split(importance_type)[1].split(',')[0])
    
                        # extract feature name from string before closing bracket
                        fid = fid[0].split('<')[0]
    
                        if fid not in fmap:
                            # if the feature hasn't been seen yet
                            fmap[fid] = 1
                            gmap[fid] = g
                        else:
                            fmap[fid] += 1
                            gmap[fid] += g
    
                # calculate average value (gain/cover) for each feature
                for fid in gmap:
                    gmap[fid] = gmap[fid] / fmap[fid]
    
                return gmap
    

      

  • 相关阅读:
    MongoDB Query 的几个方法
    jQuery日期和时间插件(jqueryuitimepickeraddon.js)中文破解版使用
    entity framework使用技巧
    SQL Server TSQL高级查询
    Visual Studio 2012资源管理器里单击打开改为双击打开文件
    ASP.NET MVC 3发布报错(ASP.NET MVC 3在没有安装环境的服务器上运行)的解决方案
    排序算法时间测试比较
    读书笔记之:C++ STL 开发技术导引3
    如何判断整数x的二进制中含有多少个1
    面试题:2012民生银行总行笔试题
  • 原文地址:https://www.cnblogs.com/lvpengbo/p/8822288.html
Copyright © 2011-2022 走看看