zoukankan html css js c++ java

决策树

决策树对实例进行分类的树形结构，由节点和有向边组成。其实很像平时画的流程图。

学习决策树之前要搞懂几个概念：

熵：表示随机变量不确定性的度量，定义：H（p）=-

信息增益：集合D的经验熵与特征A条件下D的经验条件熵H(D/A）之差（公式省略，自行查找）

信息增益比：信息增益g(D,A)与训练数据集D关于特征A的值得熵H_A（D）之比（公式省略）

基尼系数：（公式省略）

以上几个公式要牢记并学会推到。

具体计算过程：

ID3算法：寻找信息增益最大的特征

C4.5 寻找信息增益比最大的特征

另外有CART树算法，使用基尼系数来确定特征选择部分。

树的剪枝分为前剪枝和后剪枝。目的为防止过拟合。

前剪枝即为在树的构建过程中，若新增加的分支未使得准确率增加，则不进行该分支操作。

后剪枝即构建完决策树之后，从最底部的分支开始，若去掉该分支，分类准确性增加，则去掉该分支，否则保留。

def chooseBestFeatureToSplit(dataSet):#关键部分代码为寻找最优特征，寻找使得信息增益最大的特征
    numFeatures = len(dataSet[0]) - 1  # the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0;
    bestFeature = -1
    for i in range(numFeatures):  # iterate over all the features
        featList = [example[i] for example in dataSet]  # create a list of all the examples of this feature
        uniqueVals = set(featList)  # get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet) / float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy  # calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):  # compare this to the best gain so far
            bestInfoGain = infoGain  # if better than current best, set to best
            bestFeature = i
    return bestFeature  # returns an integer

def createTree(dataSet, labels): #构建决策树，寻找最优特征，做为根节点，之后根据特征分为几类，进行递归，最终形成完整决策树
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):
        return classList[0]  # stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1:  # stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel: {}}
    del (labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]  # copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree

　对应SKlearn中的API接口：

DecisionTreeClassifier 分类，注意此方法只有前剪枝选项。

DecisionTreeRegressor 回归。

查看全文

相关阅读:
POJ 1953 World Cup Noise
POJ 1995 Raising Modulo Numbers （快速幂取余）
poj 1256 Anagram
POJ 1218 THE DRUNK JAILER
POJ 1316 Self Numbers
POJ 1663 Number Steps
POJ 1664 放苹果
 如何查看DIV被设置什么CSS样式
 独行DIV自适应宽度布局CSS实例与扩大应用范围
 python 从入门到精通教程一：[1]Hello,world!

原文地址：https://www.cnblogs.com/the-home-of-123/p/9174426.html