zoukankan      html  css  js  c++  java
  • 机器学习9—树回归学习笔记

    机器学习实战之树回归

    机器学习实战 ch09 问题解决办法

    最近在学习《机器学习实战(Machine Learning in Action)》,因为个人比较喜欢Python 3,而这本书里面的代码都是通过Python 2实现的,所以自己在调试的时候会改写成Python 3。 
      在前几章里问题都不是很大,但是在ch09 树回归这一章中出现了很多小问题,现在此做一个笔记,如果有错误望指正。

    问题一

      在使用书上ch09中的代码时,运行遇到的第一个问题就是类型错误,无法映射。错误提示为:TypeError: unsupported operand type(s) for /: 'map' and 'int'

    def loadDataSet(fileName):
        dataMat = []
        fr = open(fileName)
        for line in fr.readlines():
            curLine = line.strip().split('	')
            fltLine = map(float, curLine)
            dataMat.append(fltLine)
        return dataMat

      在loadDataSet这个方法中,使用了一个map方法来对从文本文件中读取的数据进行映射处理,也就是把读取到的string转换为float。这一个简单的类型转换按照书上实现的方法在Python 2中不会报错。 
      但是在Python 3中,map方法返回的是一个map对象,因此对于这个错误,解决办法很简单。

    def loadDataSet(fileName):
        dataMat = []
        fr = open(fileName, 'r')
        for line in fr.readlines():
            curLine = line.strip().split('	')
            fltLine = list(map(float, curLine))  # 方法1
            # fltLine = [float(item) for item in curLine] # 方法2
            dataMat.append(fltLine)
        fr.close()
        return dataMat

      这里给出两种解决办法: 
      方法1,将map方法返回的map对象再转换为list对象就行了。 
      方法2,使用列表推导式做一个处理。 
      这里给出的两种解决办法的运行结果完全一样,具体使用看个人喜好吧,关于性能问题暂不作考虑。

    问题二

      解决了问题一之后,重新运行,又出现新的错误,说matrix类型不能被哈希。错误提示为:TypeError: unhashable type: 'matrix'。 
      在chooseBestSplit方法里有这么一条语句:

    for splitVal in set((dataSet[:, featIndex])):

      这里报错就是无法把matrix对象转换为set类型,这里找了一个解决办法是将这一句改为:

    for splitVal in set((dataSet[:, featIndex].T.A.tolist())[0]):

      解释一下就是对于矩阵中满足featIndex的一列,先将其转置(.T),然后转换为numpy array类型(.A),再转换为list类型(.tolist())。 
      在一系列转换之后,就可以达到最初想要的效果了。

    问题三

      解决了问题二之后,再运行,出现新的错误。错误提示为:IndexError: index 0 is out of bounds for axis 0 with size 0。 
      在书中binSplitDataSet方法的三个参数为数据集、待切分的特征、特征值。该方法用于在给定特征和特征值的情况下,通过数组过滤的方式将数据集进行切分并返回两个子集。书中给出的实现为:

    def binSplitDataSet(dataSet, feature, value):
        mat0 = dataSet[nonzero(dataSet[:,feature] > value)[0],:][0]
        mat1 = dataSet[nonzero(dataSet[:,feature] <= value)[0],:][0]
        return mat0,mat1

      其实可以在测试的时候就发现,并不像书上给出的运行结果一样,mat0的值是一个一行的向量,mat1的值是一个三行的矩阵。 
      解决办法很简单,讲两条语句最后的[0]删去就行了。

    def binSplitDataSet(dataSet, feature, value):
        mat0 = dataSet[nonzero(dataSet[:, feature] > value)[0], :]
        mat1 = dataSet[nonzero(dataSet[:, feature] <= value)[0], :]
        return mat0, mat1

      解释一下,该方法使用的是数组过滤的办法,对于特征feature,先过滤出特征feature的值大于(或小于等于)value的。这里使用nonzero()方法返回的是一个元素均为boolean类型的list,而这个list[0]的值就是对应过滤出的元素下标,换句话说就是过滤出的值在原数组中的位置。最后一步是一个Python切片操作,通过dataSet[index, :]把对应的向量提取出来。 
      个人理解在这里需要得到的是被切分的两个子集,所以最后的[0]下标运算是多余的。

    小结

      在ch09中,9.3.2之前的问题就是这些。解决了这三个问题之后,程序可以正常执行,得出和书上一样的测试结果。 
      运行截图如下: 
      运行结果

    test9.py

    #-*- coding:utf-8
    
    
    import sys
    sys.path.append("regTrees.py")
    
    import regTrees
    from numpy import *
    import matplotlib.pyplot as plt
    
    
    
    # testMat = mat(eye(4))
    # print(testMat)
    # mat0, mat1 = regTrees.binSplitDataSet(testMat, 1, 0.5)
    # print("test9")
    # print(mat0)
    # print("mat0 over")
    # print(mat1)
    
    # myDat = regTrees.loadDataSet("ex00.txt")
    # myMat = mat(myDat)
    # regTree = regTrees.createTree(myMat)
    # print(regTree)
    #
    # myDat1 = regTrees.loadDataSet("ex0.txt")
    # myMat1 = mat(myDat1)
    # regTree1 = regTrees.createTree(myMat1)
    # print(regTree1)
    
    # myDat2 = regTrees.loadDataSet("ex2.txt")
    # myMat2 = mat(myDat2)
    # # regTree2 = regTrees.createTree(myMat2)
    # # regTree2 = regTrees.createTree(myMat2, ops = (1000, 4))
    # # print(regTree2)
    #
    # myTree = regTrees.createTree(myMat2, ops = (0, 1))
    # myDatTest = regTrees.loadDataSet("ex2test.txt")
    # myMat2Test = mat(myDatTest)
    # myNewTree = regTrees.prune(myTree, myMat2Test)
    # print(myNewTree)
    
    
    
    trainMat = mat(regTrees.loadDataSet('bikeSpeedVsIq_train.txt'))
    testMat = mat(regTrees.loadDataSet('bikeSpeedVsIq_test.txt'))
    
    # 回归树
    myTree = regTrees.createTree(trainMat, ops = (1, 20))
    yHat = regTrees.createForeCast(myTree, testMat[:, 0])
    resultCorr0 = corrcoef(yHat, testMat[:, 1], rowvar = 0) #corrcoef得到相关系数矩阵(向量的相似程度)
    print(resultCorr0)
    
    # 模型树
    myTree = regTrees.createTree(trainMat, regTrees.modelLeaf, regTrees.modelErr, (1,20))
    yHat = regTrees.createForeCast(myTree, testMat[:,0], regTrees.modelTreeEval)
    resultCorr1 = corrcoef(yHat, testMat[:,1], rowvar=0)
    print(resultCorr1)
    
    # 标准回归
    ws, X, Y = regTrees.linearSolve(trainMat)
    print(ws)
    for i in range(shape(testMat)[0]) :
        yHat[i] = testMat[i, 0] * ws[1, 0] + ws[0, 0]
    resultCorr2 = corrcoef(yHat, testMat[:,1], rowvar=0)
    print(resultCorr2)
    
    print("over!")

    regTrees.py

    '''
    Created on Feb 4, 2011
    Tree-Based Regression Methods
    @author: Peter Harrington
    '''
    from numpy import *
    
    def loadDataSet(fileName):
        dataMat = []
        fr = open(fileName, 'r')
        for line in fr.readlines():
            curLine = line.strip().split('	')
            fltLine = list(map(float, curLine))  # 方法1
            # fltLine = [float(item) for item in curLine] # 方法2
            dataMat.append(fltLine)
        fr.close()
        return dataMat
    
    def binSplitDataSet(dataSet, feature, value):
        mat0 = dataSet[nonzero(dataSet[:,feature] > value)[0],:]
        mat1 = dataSet[nonzero(dataSet[:,feature] <= value)[0],:]
        return mat0,mat1
    
    def regLeaf(dataSet):#returns the value used for each leaf
        return mean(dataSet[:,-1])
    
    def regErr(dataSet):
        return var(dataSet[:,-1]) * shape(dataSet)[0]
    
    def linearSolve(dataSet):   #helper function used in two places
        m,n = shape(dataSet)
        X = mat(ones((m,n))); Y = mat(ones((m,1)))#create a copy of data with 1 in 0th postion
        X[:,1:n] = dataSet[:,0:n-1]; Y = dataSet[:,-1]#and strip out Y
        xTx = X.T*X
        if linalg.det(xTx) == 0.0:
            raise NameError('This matrix is singular, cannot do inverse,
    
            try increasing the second value of ops')
        ws = xTx.I * (X.T * Y)
        return ws,X,Y
    
    def modelLeaf(dataSet):#create linear model and return coeficients
        ws,X,Y = linearSolve(dataSet)
        return ws
    
    def modelErr(dataSet):
        ws,X,Y = linearSolve(dataSet)
        yHat = X * ws
        return sum(power(Y - yHat,2))
    
    def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):
        tolS = ops[0]; tolN = ops[1]
        #if all the target variables are the same value: quit and return value
        if len(set(dataSet[:,-1].T.tolist()[0])) == 1: #exit cond 1
            return None, leafType(dataSet)
        m,n = shape(dataSet)
        #the choice of the best feature is driven by Reduction in RSS error from mean
        S = errType(dataSet)
        bestS = inf; bestIndex = 0; bestValue = 0
        for featIndex in range(n-1):
            for splitVal in set((dataSet[:, featIndex].T.A.tolist())[0]):
                mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal)
                if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): continue
                newS = errType(mat0) + errType(mat1)
                if newS < bestS: 
                    bestIndex = featIndex
                    bestValue = splitVal
                    bestS = newS
        #if the decrease (S-bestS) is less than a threshold don't do the split
        if (S - bestS) < tolS: 
            return None, leafType(dataSet) #exit cond 2
        mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue)
        if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN):  #exit cond 3
            return None, leafType(dataSet)
        return bestIndex,bestValue#returns the best feature to split on
                                  #and the value used for that split
    
    def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):#assume dataSet is NumPy Mat so we can array filtering
        feat, val = chooseBestSplit(dataSet, leafType, errType, ops)#choose the best split
        if feat == None: return val #if the splitting hit a stop condition return val
        retTree = {}
        retTree['spInd'] = feat
        retTree['spVal'] = val
        lSet, rSet = binSplitDataSet(dataSet, feat, val)
        retTree['left'] = createTree(lSet, leafType, errType, ops)
        retTree['right'] = createTree(rSet, leafType, errType, ops)
        return retTree  
    
    def isTree(obj):
        return (type(obj).__name__=='dict')
    
    def getMean(tree):
        if isTree(tree['right']): tree['right'] = getMean(tree['right'])
        if isTree(tree['left']): tree['left'] = getMean(tree['left'])
        return (tree['left']+tree['right'])/2.0
        
    def prune(tree, testData):
        if shape(testData)[0] == 0: return getMean(tree) #if we have no test data collapse the tree
        if (isTree(tree['right']) or isTree(tree['left'])):#if the branches are not trees try to prune them
            lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])
        if isTree(tree['left']): tree['left'] = prune(tree['left'], lSet)
        if isTree(tree['right']): tree['right'] =  prune(tree['right'], rSet)
        #if they are now both leafs, see if we can merge them
        if not isTree(tree['left']) and not isTree(tree['right']):
            lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])
            errorNoMerge = sum(power(lSet[:,-1] - tree['left'],2)) +
                sum(power(rSet[:,-1] - tree['right'],2))
            treeMean = (tree['left']+tree['right'])/2.0
            errorMerge = sum(power(testData[:,-1] - treeMean,2))
            if errorMerge < errorNoMerge: 
                print("merging")
                return treeMean
            else: return tree
        else: return tree
        
    def regTreeEval(model, inDat):
        return float(model)
    
    def modelTreeEval(model, inDat):
        n = shape(inDat)[1]
        X = mat(ones((1,n+1)))
        X[:,1:n+1]=inDat
        return float(X*model)
    
    def treeForeCast(tree, inData, modelEval=regTreeEval):
        if not isTree(tree): return modelEval(tree, inData)
        if inData[tree['spInd']] > tree['spVal']:
            if isTree(tree['left']): return treeForeCast(tree['left'], inData, modelEval)
            else: return modelEval(tree['left'], inData)
        else:
            if isTree(tree['right']): return treeForeCast(tree['right'], inData, modelEval)
            else: return modelEval(tree['right'], inData)
            
    def createForeCast(tree, testData, modelEval=regTreeEval):
        m=len(testData)
        yHat = mat(zeros((m,1)))
        for i in range(m):
            yHat[i,0] = treeForeCast(tree, mat(testData[i]), modelEval)
        return yHat

    treeExplore.py

    from numpy import *
    
    from tkinter import *
    import regTrees
    
    import matplotlib
    matplotlib.use('TkAgg')
    from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
    from matplotlib.figure import Figure
    
    def reDraw(tolS,tolN):
        reDraw.f.clf()        # clear the figure
        reDraw.a = reDraw.f.add_subplot(111)
        if chkBtnVar.get():
            if tolN < 2: tolN = 2
            myTree=regTrees.createTree(reDraw.rawDat, regTrees.modelLeaf, regTrees.modelErr, (tolS,tolN))
            yHat = regTrees.createForeCast(myTree, reDraw.testDat, regTrees.modelTreeEval)
        else:
            myTree=regTrees.createTree(reDraw.rawDat, ops=(tolS,tolN))
            yHat = regTrees.createForeCast(myTree, reDraw.testDat)
        reDraw.a.scatter(reDraw.rawDat[:,0].tolist(), reDraw.rawDat[:,1].tolist(), s=5) #use scatter for data set
        reDraw.a.plot(reDraw.testDat, yHat, linewidth=2.0) #use plot for yHat
        reDraw.canvas.show()
        
    def getInputs():
        try: tolN = int(tolNentry.get())
        except: 
            tolN = 10 
            print("enter Integer for tolN")
            tolNentry.delete(0, END)
            tolNentry.insert(0,'10')
        try: tolS = float(tolSentry.get())
        except: 
            tolS = 1.0 
            print("enter Float for tolS")
            tolSentry.delete(0, END)
            tolSentry.insert(0,'1.0')
        return tolN,tolS
    
    def drawNewTree():
        tolN,tolS = getInputs()#get values from Entry boxes
        reDraw(tolS,tolN)
        
    root=Tk()
    
    reDraw.f = Figure(figsize=(5,4), dpi=100) #create canvas
    reDraw.canvas = FigureCanvasTkAgg(reDraw.f, master=root)
    reDraw.canvas.show()
    reDraw.canvas.get_tk_widget().grid(row=0, columnspan=3)
    
    Label(root, text="tolN").grid(row=1, column=0)
    tolNentry = Entry(root)
    tolNentry.grid(row=1, column=1)
    tolNentry.insert(0,'10')
    Label(root, text="tolS").grid(row=2, column=0)
    tolSentry = Entry(root)
    tolSentry.grid(row=2, column=1)
    tolSentry.insert(0,'1.0')
    Button(root, text="ReDraw", command=drawNewTree).grid(row=1, column=2, rowspan=3)
    chkBtnVar = IntVar()
    chkBtn = Checkbutton(root, text="Model Tree", variable = chkBtnVar)
    chkBtn.grid(row=3, column=0, columnspan=2)
    
    reDraw.rawDat = mat(regTrees.loadDataSet('sine.txt'))
    reDraw.testDat = arange(min(reDraw.rawDat[:,0]), max(reDraw.rawDat[:,0]), 0.01)
    reDraw(1.0, 10)
    
    root.mainloop()
  • 相关阅读:
    Java:前程似锦的 NIO 2.0
    优秀的程序员都热爱写作
    Java -- JDBC 学习--获取数据库链接
    前端学习 -- Html&Css -- 条件Hack 和属性Hack
    前端学习 -- Html&Css -- ie6 png 背景问题
    前端学习 -- Html&Css -- 框架集
    ECMAScript 6 -- 字符串的扩展
    ECMAScript 6 -- 数组的解构赋值
    前端学习 -- Html&Css -- 表单
    前端学习 -- Html&Css -- 表格
  • 原文地址:https://www.cnblogs.com/Vae1990Silence/p/8484345.html
Copyright © 2011-2022 走看看