zoukankan      html  css  js  c++  java
  • 决策树--从原理到实现

    一.引入

    决策树基本上是每一本机器学习入门书籍必讲的东西,其决策过程和平时我们的思维很相似,所以非常好理解,同时有一堆信息论的东西在里面,也算是一个入门应用,决策树也有回归和分类,但一般来说我们主要讲的是分类

    其实,个人感觉,决策树是从一些数据量中提取特征,按照特征的显著由强到弱来排列。常见应用为:回答一些问题,猜出你心里想的是什么?

    为什么第一个问题,永远都是男还是女?为什么?看完这个就知道了

    二.代码

      1 from math import log
      2 import operator
      3 
      4 def createDataSet():
      5     dataSet = [[1, 1, 'yes'],
      6                [1, 1, 'yes'],
      7                [1, 0, 'no'],
      8                [0, 1, 'no'],
      9                [0, 1, 'no']]
     10     labels = ['no surfacing','flippers']
     11     #change to discrete values
     12     return dataSet, labels
     13 
     14 def calcShannonEnt(dataSet):
     15     numEntries = len(dataSet)
     16     labelCounts = {}
     17     for featVec in dataSet: #the the number of unique elements and their occurance
     18         currentLabel = featVec[-1]
     19         if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
     20         labelCounts[currentLabel] += 1
     21     shannonEnt = 0.0
     22     for key in labelCounts:
     23         prob = float(labelCounts[key])/numEntries
     24         shannonEnt -= prob * log(prob,2) #log base 2
     25     return shannonEnt
     26     
     27 def splitDataSet(dataSet, axis, value):
     28     retDataSet = []
     29     for featVec in dataSet:
     30         if featVec[axis] == value:
     31             reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
     32             reducedFeatVec.extend(featVec[axis+1:])
     33             retDataSet.append(reducedFeatVec)
     34     return retDataSet
     35     
     36 def chooseBestFeatureToSplit(dataSet):
     37     numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
     38     baseEntropy = calcShannonEnt(dataSet)
     39     bestInfoGain = 0.0; bestFeature = -1
     40     for i in range(numFeatures):        #iterate over all the features
     41         featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
     42         uniqueVals = set(featList)       #get a set of unique values
     43         newEntropy = 0.0
     44         for value in uniqueVals:
     45             subDataSet = splitDataSet(dataSet, i, value)
     46             prob = len(subDataSet)/float(len(dataSet))
     47             newEntropy += prob * calcShannonEnt(subDataSet)     
     48         infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
     49         if (infoGain > bestInfoGain):       #compare this to the best gain so far
     50             bestInfoGain = infoGain         #if better than current best, set to best
     51             bestFeature = i
     52     return bestFeature                      #returns an integer
     53 
     54 def majorityCnt(classList):
     55     classCount={}
     56     for vote in classList:
     57         if vote not in classCount.keys(): classCount[vote] = 0
     58         classCount[vote] += 1
     59     sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
     60     return sortedClassCount[0][0]
     61 
     62 def createTree(dataSet,labels):
     63     classList = [example[-1] for example in dataSet]
     64     if classList.count(classList[0]) == len(classList): 
     65         return classList[0]#stop splitting when all of the classes are equal
     66     if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
     67         return majorityCnt(classList)
     68     bestFeat = chooseBestFeatureToSplit(dataSet)
     69     bestFeatLabel = labels[bestFeat]
     70     myTree = {bestFeatLabel:{}}
     71     del(labels[bestFeat])
     72     featValues = [example[bestFeat] for example in dataSet]
     73     uniqueVals = set(featValues)
     74     for value in uniqueVals:
     75         subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
     76         myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
     77     return myTree                            
     78     
     79 def classify(inputTree,featLabels,testVec):
     80     firstStr = inputTree.keys()[0]
     81     secondDict = inputTree[firstStr]
     82     featIndex = featLabels.index(firstStr)
     83     key = testVec[featIndex]
     84     valueOfFeat = secondDict[key]
     85     if isinstance(valueOfFeat, dict): 
     86         classLabel = classify(valueOfFeat, featLabels, testVec)
     87     else: classLabel = valueOfFeat
     88     return classLabel
     89 
     90 def storeTree(inputTree,filename):
     91     import pickle
     92     fw = open(filename,'w')
     93     pickle.dump(inputTree,fw)
     94     fw.close()
     95     
     96 def grabTree(filename):
     97     import pickle
     98     fr = open(filename)
     99     return pickle.load(fr)
    100     

    三.算法详解

    ❤信息增益

    传入数据集,得到该数据集的增益

     1 def calcShannonEnt(dataSet):
     2     numEntries = len(dataSet)
     3     labelCounts = {}
     4     for featVec in dataSet: #the the number of unique elements and their occurance
     5         currentLabel = featVec[-1]
     6         if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
     7         labelCounts[currentLabel] += 1
     8     shannonEnt = 0.0
     9     for key in labelCounts:
    10         prob = float(labelCounts[key])/numEntries
    11         shannonEnt -= prob * log(prob,2) #log base 2
    12     return shannonEnt

    得到信息熵后,我们按照获取最大信息增益的方法划分数据集就行了

    eg.运行下面的数据集

              [[1, 1, 'yes'],
    [1, 1, 'yes'],
    [1, 0, 'no'],
    [0, 1, 'no'],
    [0, 1, 'no']]

    labelCounts是一个map结构
    currentLabel  labelCounts[currentLabel]   prob
    yes        2                0.4
    no         3                0.6

    用信息论就可以得到0.4*log(-0.4)+0,6*log(-0.6)=0.971

    ❤划分数据集

      ※按照给定特征划分数据集

      传入数据集,第axis个(从0开始)特征,该特征的值

      输出根据该数据集划分得到的子数据集

    1 def splitDataSet(dataSet, axis, value):
    2     retDataSet = []
    3     for featVec in dataSet:
    4         if featVec[axis] == value:
    5             reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
    6             reducedFeatVec.extend(featVec[axis+1:])
    7             retDataSet.append(reducedFeatVec)
    8     return retDataSet
     eg.  myDat为
          [[1, 1, 'yes'],
    [1, 1, 'yes'],
    [1, 0, 'no'],
    [0, 1, 'no'],
    [0, 1, 'no']]
    传入(myDat,0,1),输出

    [[1, 'yes'],[1, 'yes'], [0, 'no']]

      ※选择最好的数据集划分方式

      传入数据集

      输出该数据集下按不同特征值排列得到信息熵变化最大的该特征值

     1 def chooseBestFeatureToSplit(dataSet):
     2     numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
     3     baseEntropy = calcShannonEnt(dataSet)
     4     bestInfoGain = 0.0; bestFeature = -1
     5     for i in range(numFeatures):        #iterate over all the features
     6         featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
     7         uniqueVals = set(featList)       #get a set of unique values
     8         newEntropy = 0.0
     9         for value in uniqueVals:
    10             subDataSet = splitDataSet(dataSet, i, value)
    11             prob = len(subDataSet)/float(len(dataSet))
    12             newEntropy += prob * calcShannonEnt(subDataSet)     
    13         infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
    14         if (infoGain > bestInfoGain):       #compare this to the best gain so far
    15             bestInfoGain = infoGain         #if better than current best, set to best
    16             bestFeature = i
    17     return bestFeature                      #returns an integer
     eg.  myDat为
          [[1, 1, 'yes'],
    [1, 1, 'yes'],
    [1, 0, 'no'],
    [0, 1, 'no'],
    [0, 1, 'no']]
    传入(myDat)

    第一次就是按第一个特征,值为1划分
         按第一个特征,值为0划分
         得到该情况下的信息熵
    第二次就是按第二个特征,值为1划分
         按第二个特征,值为0划分
         得到该情况下的信息熵
    ......
    选取信息熵最大时候的特征
      

    ❤递归构建决策树

    1 def majorityCnt(classList):
    2     classCount={}
    3     for vote in classList:
    4         if vote not in classCount.keys(): classCount[vote] = 0
    5         classCount[vote] += 1
    6     sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    7     return sortedClassCount[0][0]

     实现:传入label,统计不同label出现的频率,返回出现频率最大的

    O(∩_∩)O~创建树啦

    两个输入参数:数据集和标签列表

     1 def createTree(dataSet,labels):
     2     classList = [example[-1] for example in dataSet]
     3     if classList.count(classList[0]) == len(classList): 
     4         return classList[0]#stop splitting when all of the classes are equal
     5     if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
     6         return majorityCnt(classList)
     7     bestFeat = chooseBestFeatureToSplit(dataSet)
     8     bestFeatLabel = labels[bestFeat]
     9     myTree = {bestFeatLabel:{}}
    10     del(labels[bestFeat])
    11     featValues = [example[bestFeat] for example in dataSet]
    12     uniqueVals = set(featValues)
    13     for value in uniqueVals:
    14         subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
    15         myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    16     return myTree  

    O(∩_∩)O~~可以使用树来决策了

     1 def classify(inputTree,featLabels,testVec):
     2     firstStr = inputTree.keys()[0]
     3     secondDict = inputTree[firstStr]
     4     featIndex = featLabels.index(firstStr)
     5     key = testVec[featIndex]
     6     valueOfFeat = secondDict[key]
     7     if isinstance(valueOfFeat, dict): 
     8         classLabel = classify(valueOfFeat, featLabels, testVec)
     9     else: classLabel = valueOfFeat
    10     return classLabel

     测试如下:

    1 >>> import trees
    2 >>> myDat,labels=trees.createDataSet()
    3 >>> myTree=trees.createTree(myDat,labels)
    4 >>> myTree
    5 {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}
    6 >>> lab=trees.classify(myTree,['no surfacing','flippers'],[0,1])
    7 >>> lab
    8 'no'
  • 相关阅读:
    Ganglia与Centreon整合构建智能化监控报警平台
    通过Centreon监控apache、MySQL、Hadoop服务状态
    分布式监控告警平台Centreon快速使用
    分布式监控数据采集系统Ganglia实战
    Zabbix与ELK整合实现对日志数据的实时监控
    Zabbix通过与微信、钉钉整合实现实时告警
    详解容器设计模式
    深入理解 Pod
    docker exec实现原理
    使用Docker部署应用以及容器数据卷Volume
  • 原文地址:https://www.cnblogs.com/xiaoyingying/p/7509367.html
Copyright © 2011-2022 走看看