机器学习实战之树回归
机器学习实战 ch09 问题解决办法
最近在学习《机器学习实战(Machine Learning in Action)》,因为个人比较喜欢Python 3,而这本书里面的代码都是通过Python 2实现的,所以自己在调试的时候会改写成Python 3。
在前几章里问题都不是很大,但是在ch09 树回归这一章中出现了很多小问题,现在此做一个笔记,如果有错误望指正。
问题一
在使用书上ch09中的代码时,运行遇到的第一个问题就是类型错误,无法映射。错误提示为:TypeError: unsupported operand type(s) for /: 'map' and 'int'
。
def loadDataSet(fileName): dataMat = [] fr = open(fileName) for line in fr.readlines(): curLine = line.strip().split(' ') fltLine = map(float, curLine) dataMat.append(fltLine) return dataMat
在loadDataSet
这个方法中,使用了一个map
方法来对从文本文件中读取的数据进行映射处理,也就是把读取到的string转换为float。这一个简单的类型转换按照书上实现的方法在Python 2中不会报错。
但是在Python 3中,map方法返回的是一个map对象,因此对于这个错误,解决办法很简单。
def loadDataSet(fileName): dataMat = [] fr = open(fileName, 'r') for line in fr.readlines(): curLine = line.strip().split(' ') fltLine = list(map(float, curLine)) # 方法1 # fltLine = [float(item) for item in curLine] # 方法2 dataMat.append(fltLine) fr.close() return dataMat
这里给出两种解决办法:
方法1,将map方法返回的map对象再转换为list对象就行了。
方法2,使用列表推导式做一个处理。
这里给出的两种解决办法的运行结果完全一样,具体使用看个人喜好吧,关于性能问题暂不作考虑。
问题二
解决了问题一之后,重新运行,又出现新的错误,说matrix类型不能被哈希。错误提示为:TypeError: unhashable type: 'matrix'
。
在chooseBestSplit
方法里有这么一条语句:
for splitVal in set((dataSet[:, featIndex])):
这里报错就是无法把matrix对象转换为set类型,这里找了一个解决办法是将这一句改为:
for splitVal in set((dataSet[:, featIndex].T.A.tolist())[0]):
解释一下就是对于矩阵中满足featIndex
的一列,先将其转置(.T),然后转换为numpy array类型(.A),再转换为list类型(.tolist())。
在一系列转换之后,就可以达到最初想要的效果了。
问题三
解决了问题二之后,再运行,出现新的错误。错误提示为:IndexError: index 0 is out of bounds for axis 0 with size 0
。
在书中binSplitDataSet
方法的三个参数为数据集、待切分的特征、特征值。该方法用于在给定特征和特征值的情况下,通过数组过滤的方式将数据集进行切分并返回两个子集。书中给出的实现为:
def binSplitDataSet(dataSet, feature, value): mat0 = dataSet[nonzero(dataSet[:,feature] > value)[0],:][0] mat1 = dataSet[nonzero(dataSet[:,feature] <= value)[0],:][0] return mat0,mat1
其实可以在测试的时候就发现,并不像书上给出的运行结果一样,mat0的值是一个一行的向量,mat1的值是一个三行的矩阵。
解决办法很简单,讲两条语句最后的[0]
删去就行了。
def binSplitDataSet(dataSet, feature, value): mat0 = dataSet[nonzero(dataSet[:, feature] > value)[0], :] mat1 = dataSet[nonzero(dataSet[:, feature] <= value)[0], :] return mat0, mat1
解释一下,该方法使用的是数组过滤的办法,对于特征feature,先过滤出特征feature的值大于(或小于等于)value的。这里使用nonzero()方法返回的是一个元素均为boolean类型的list,而这个list[0]的值就是对应过滤出的元素下标,换句话说就是过滤出的值在原数组中的位置。最后一步是一个Python切片操作,通过dataSet[index, :]把对应的向量提取出来。
个人理解在这里需要得到的是被切分的两个子集,所以最后的[0]下标运算是多余的。
小结
在ch09中,9.3.2之前的问题就是这些。解决了这三个问题之后,程序可以正常执行,得出和书上一样的测试结果。
运行截图如下:
test9.py
#-*- coding:utf-8 import sys sys.path.append("regTrees.py") import regTrees from numpy import * import matplotlib.pyplot as plt # testMat = mat(eye(4)) # print(testMat) # mat0, mat1 = regTrees.binSplitDataSet(testMat, 1, 0.5) # print("test9") # print(mat0) # print("mat0 over") # print(mat1) # myDat = regTrees.loadDataSet("ex00.txt") # myMat = mat(myDat) # regTree = regTrees.createTree(myMat) # print(regTree) # # myDat1 = regTrees.loadDataSet("ex0.txt") # myMat1 = mat(myDat1) # regTree1 = regTrees.createTree(myMat1) # print(regTree1) # myDat2 = regTrees.loadDataSet("ex2.txt") # myMat2 = mat(myDat2) # # regTree2 = regTrees.createTree(myMat2) # # regTree2 = regTrees.createTree(myMat2, ops = (1000, 4)) # # print(regTree2) # # myTree = regTrees.createTree(myMat2, ops = (0, 1)) # myDatTest = regTrees.loadDataSet("ex2test.txt") # myMat2Test = mat(myDatTest) # myNewTree = regTrees.prune(myTree, myMat2Test) # print(myNewTree) trainMat = mat(regTrees.loadDataSet('bikeSpeedVsIq_train.txt')) testMat = mat(regTrees.loadDataSet('bikeSpeedVsIq_test.txt')) # 回归树 myTree = regTrees.createTree(trainMat, ops = (1, 20)) yHat = regTrees.createForeCast(myTree, testMat[:, 0]) resultCorr0 = corrcoef(yHat, testMat[:, 1], rowvar = 0) #corrcoef得到相关系数矩阵(向量的相似程度) print(resultCorr0) # 模型树 myTree = regTrees.createTree(trainMat, regTrees.modelLeaf, regTrees.modelErr, (1,20)) yHat = regTrees.createForeCast(myTree, testMat[:,0], regTrees.modelTreeEval) resultCorr1 = corrcoef(yHat, testMat[:,1], rowvar=0) print(resultCorr1) # 标准回归 ws, X, Y = regTrees.linearSolve(trainMat) print(ws) for i in range(shape(testMat)[0]) : yHat[i] = testMat[i, 0] * ws[1, 0] + ws[0, 0] resultCorr2 = corrcoef(yHat, testMat[:,1], rowvar=0) print(resultCorr2) print("over!")
regTrees.py
''' Created on Feb 4, 2011 Tree-Based Regression Methods @author: Peter Harrington ''' from numpy import * def loadDataSet(fileName): dataMat = [] fr = open(fileName, 'r') for line in fr.readlines(): curLine = line.strip().split(' ') fltLine = list(map(float, curLine)) # 方法1 # fltLine = [float(item) for item in curLine] # 方法2 dataMat.append(fltLine) fr.close() return dataMat def binSplitDataSet(dataSet, feature, value): mat0 = dataSet[nonzero(dataSet[:,feature] > value)[0],:] mat1 = dataSet[nonzero(dataSet[:,feature] <= value)[0],:] return mat0,mat1 def regLeaf(dataSet):#returns the value used for each leaf return mean(dataSet[:,-1]) def regErr(dataSet): return var(dataSet[:,-1]) * shape(dataSet)[0] def linearSolve(dataSet): #helper function used in two places m,n = shape(dataSet) X = mat(ones((m,n))); Y = mat(ones((m,1)))#create a copy of data with 1 in 0th postion X[:,1:n] = dataSet[:,0:n-1]; Y = dataSet[:,-1]#and strip out Y xTx = X.T*X if linalg.det(xTx) == 0.0: raise NameError('This matrix is singular, cannot do inverse, try increasing the second value of ops') ws = xTx.I * (X.T * Y) return ws,X,Y def modelLeaf(dataSet):#create linear model and return coeficients ws,X,Y = linearSolve(dataSet) return ws def modelErr(dataSet): ws,X,Y = linearSolve(dataSet) yHat = X * ws return sum(power(Y - yHat,2)) def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)): tolS = ops[0]; tolN = ops[1] #if all the target variables are the same value: quit and return value if len(set(dataSet[:,-1].T.tolist()[0])) == 1: #exit cond 1 return None, leafType(dataSet) m,n = shape(dataSet) #the choice of the best feature is driven by Reduction in RSS error from mean S = errType(dataSet) bestS = inf; bestIndex = 0; bestValue = 0 for featIndex in range(n-1): for splitVal in set((dataSet[:, featIndex].T.A.tolist())[0]): mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal) if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): continue newS = errType(mat0) + errType(mat1) if newS < bestS: bestIndex = featIndex bestValue = splitVal bestS = newS #if the decrease (S-bestS) is less than a threshold don't do the split if (S - bestS) < tolS: return None, leafType(dataSet) #exit cond 2 mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue) if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): #exit cond 3 return None, leafType(dataSet) return bestIndex,bestValue#returns the best feature to split on #and the value used for that split def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):#assume dataSet is NumPy Mat so we can array filtering feat, val = chooseBestSplit(dataSet, leafType, errType, ops)#choose the best split if feat == None: return val #if the splitting hit a stop condition return val retTree = {} retTree['spInd'] = feat retTree['spVal'] = val lSet, rSet = binSplitDataSet(dataSet, feat, val) retTree['left'] = createTree(lSet, leafType, errType, ops) retTree['right'] = createTree(rSet, leafType, errType, ops) return retTree def isTree(obj): return (type(obj).__name__=='dict') def getMean(tree): if isTree(tree['right']): tree['right'] = getMean(tree['right']) if isTree(tree['left']): tree['left'] = getMean(tree['left']) return (tree['left']+tree['right'])/2.0 def prune(tree, testData): if shape(testData)[0] == 0: return getMean(tree) #if we have no test data collapse the tree if (isTree(tree['right']) or isTree(tree['left'])):#if the branches are not trees try to prune them lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal']) if isTree(tree['left']): tree['left'] = prune(tree['left'], lSet) if isTree(tree['right']): tree['right'] = prune(tree['right'], rSet) #if they are now both leafs, see if we can merge them if not isTree(tree['left']) and not isTree(tree['right']): lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal']) errorNoMerge = sum(power(lSet[:,-1] - tree['left'],2)) + sum(power(rSet[:,-1] - tree['right'],2)) treeMean = (tree['left']+tree['right'])/2.0 errorMerge = sum(power(testData[:,-1] - treeMean,2)) if errorMerge < errorNoMerge: print("merging") return treeMean else: return tree else: return tree def regTreeEval(model, inDat): return float(model) def modelTreeEval(model, inDat): n = shape(inDat)[1] X = mat(ones((1,n+1))) X[:,1:n+1]=inDat return float(X*model) def treeForeCast(tree, inData, modelEval=regTreeEval): if not isTree(tree): return modelEval(tree, inData) if inData[tree['spInd']] > tree['spVal']: if isTree(tree['left']): return treeForeCast(tree['left'], inData, modelEval) else: return modelEval(tree['left'], inData) else: if isTree(tree['right']): return treeForeCast(tree['right'], inData, modelEval) else: return modelEval(tree['right'], inData) def createForeCast(tree, testData, modelEval=regTreeEval): m=len(testData) yHat = mat(zeros((m,1))) for i in range(m): yHat[i,0] = treeForeCast(tree, mat(testData[i]), modelEval) return yHat
treeExplore.py
from numpy import * from tkinter import * import regTrees import matplotlib matplotlib.use('TkAgg') from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg from matplotlib.figure import Figure def reDraw(tolS,tolN): reDraw.f.clf() # clear the figure reDraw.a = reDraw.f.add_subplot(111) if chkBtnVar.get(): if tolN < 2: tolN = 2 myTree=regTrees.createTree(reDraw.rawDat, regTrees.modelLeaf, regTrees.modelErr, (tolS,tolN)) yHat = regTrees.createForeCast(myTree, reDraw.testDat, regTrees.modelTreeEval) else: myTree=regTrees.createTree(reDraw.rawDat, ops=(tolS,tolN)) yHat = regTrees.createForeCast(myTree, reDraw.testDat) reDraw.a.scatter(reDraw.rawDat[:,0].tolist(), reDraw.rawDat[:,1].tolist(), s=5) #use scatter for data set reDraw.a.plot(reDraw.testDat, yHat, linewidth=2.0) #use plot for yHat reDraw.canvas.show() def getInputs(): try: tolN = int(tolNentry.get()) except: tolN = 10 print("enter Integer for tolN") tolNentry.delete(0, END) tolNentry.insert(0,'10') try: tolS = float(tolSentry.get()) except: tolS = 1.0 print("enter Float for tolS") tolSentry.delete(0, END) tolSentry.insert(0,'1.0') return tolN,tolS def drawNewTree(): tolN,tolS = getInputs()#get values from Entry boxes reDraw(tolS,tolN) root=Tk() reDraw.f = Figure(figsize=(5,4), dpi=100) #create canvas reDraw.canvas = FigureCanvasTkAgg(reDraw.f, master=root) reDraw.canvas.show() reDraw.canvas.get_tk_widget().grid(row=0, columnspan=3) Label(root, text="tolN").grid(row=1, column=0) tolNentry = Entry(root) tolNentry.grid(row=1, column=1) tolNentry.insert(0,'10') Label(root, text="tolS").grid(row=2, column=0) tolSentry = Entry(root) tolSentry.grid(row=2, column=1) tolSentry.insert(0,'1.0') Button(root, text="ReDraw", command=drawNewTree).grid(row=1, column=2, rowspan=3) chkBtnVar = IntVar() chkBtn = Checkbutton(root, text="Model Tree", variable = chkBtnVar) chkBtn.grid(row=3, column=0, columnspan=2) reDraw.rawDat = mat(regTrees.loadDataSet('sine.txt')) reDraw.testDat = arange(min(reDraw.rawDat[:,0]), max(reDraw.rawDat[:,0]), 0.01) reDraw(1.0, 10) root.mainloop()