zoukankan html css js c++ java

再探决策树算法之利用sklearn进行决策树实战

sklearn模块提供了决策树的解决方案，不用自己去造轮子了（不会造，感觉略复杂）：

下面是笔记：

Sklearn.tree参数介绍及使用建议参数介绍及使用建议
官网： http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html class sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2,min_samples_leaf=1, max_features=None, random_state=None, min_density=None, compute_importances=None,max_leaf_nodes=None)
比较重要的参数：
criterion ：规定了该决策树所采用的的最佳分割属性的判决方法，有两种：“gini”，“entropy”。
max_depth ：限定了决策树的最大深度，对于防止过拟合非常有用。
min_samples_leaf ：限定了叶子节点包含的最小样本数，这个属性对于防止上文讲到的数据碎片问题很有作用
模块中一些重要的属性方法：
n_classes_ ：决策树中的类数量。
classes_ ：返回决策树中的所有种类标签。
feature_importances_ ：feature的重要性，值越大那么越重要。
fit(X, y, sample_mask=None, X_argsorted=None, check_input=True, sample_weight=None) 将数据集x，和标签集y送入分类器进行训练，这里要注意一个参数是：sample_weright，它和样本的数量一样长，所携带的是每个样本的权重。
get_params(deep=True) 得到决策树的各个参数。
set_params(**params) 调整决策树的各个参数。
predict(X) 送入样本X，得到决策树的预测。可以同时送入多个样本。
transform(X, threshold=None) 返回X的较重要的一些feature，相当于裁剪数据。
score(X, y, sample_weight=None) 返回在数据集X,y上的测试分数，正确率。
使用建议
1. 当我们数据中的feature较多时，一定要有足够的数据量来支撑我们的算法，不然的话很容易overfitting
2. PCA是一种避免高维数据overfitting的办法。
3. 从一棵较小的树开始探索，用export方法打印出来看看。
4. 善用max_depth参数，缓慢的增加并测试模型，找出最好的那个depth。
5. 善用min_samples_split和min_samples_leaf参数来控制叶子节点的样本数量，防止overfitting。
6. 平衡训练数据中的各个种类的数据，防止一个种类的数据dominate。

后面开始实战：

#-*-coding:utf-8 -*-
from sklearn import tree
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split
import numpy as np

#读取数据
data=[]
labels=[]
#根据text里的数据格式将数据写到list里
with open('C:UserscchenDesktopsample.txt','r') as f:
    for line in f:
        linelist=line.split(' ')
        data.append([float(el) for el in linelist[:-1]])
        labels.append(linelist[-1].strip())
# print data
# [[1.5, 50.0], [1.5, 60.0], [1.6, 40.0], [1.6, 60.0], [1.7, 60.0], [1.7, 80.0], [1.8, 60.0], [1.8, 90.0], [1.9, 70.0], [1.9, 80.0]]
# print labels
x=np.array(data)
labels=np.array(labels)
# print labels
# ['thin' 'fat' 'thin' 'fat' 'thin' 'fat' 'thin' 'fat' 'thin' 'fat']
y=np.zeros(labels.shape)
# print y
# [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
# print labels=='fat'
# [False  True False  True False  True False  True False  True]
# 这个替换的方法很巧妙，可以一学，利用布尔值来给list赋值。要是我的话就要写个循环了。
y[labels=='fat']=1
# print y
# [ 0.  1.  0.  1.  0.  1.  0.  1.  0.  1.]
#拆分训练数据和测试数据,把20%的当做测试数据，其实我感觉直接分片就可以的，不过这样比较高大上一点
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
#使用信息熵作为划分标准，对决策树进行训练
clf=tree.DecisionTreeClassifier(criterion='entropy')
# print clf
# DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
#             max_features=None, max_leaf_nodes=None,
#             min_impurity_split=1e-07, min_samples_leaf=1,
#             min_samples_split=2, min_weight_fraction_leaf=0.0,
#             presort=False, random_state=None, splitter='best')
clf.fit(x_train,y_train)
#把决策树写入文件
with open(r'C:UserscchenDesktop	ree.dot','w+') as f:
    f=tree.export_graphviz(clf,out_file=f)
# digraph Tree {
# node [shape=box] ;
# 0 [label="X[1] <= 70.0
entropy = 0.9544
samples = 8
value = [3, 5]"] ;
# 1 [label="X[0] <= 1.65
entropy = 0.971
samples = 5
value = [3, 2]"] ;
# 0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
# 2 [label="X[1] <= 55.0
entropy = 0.9183
samples = 3
value = [1, 2]"] ;
# 1 -> 2 ;
# 3 [label="entropy = 0.0
samples = 1
value = [1, 0]"] ;
# 2 -> 3 ;
# 4 [label="entropy = 0.0
samples = 2
value = [0, 2]"] ;
# 2 -> 4 ;
# 5 [label="entropy = 0.0
samples = 2
value = [2, 0]"] ;
# 1 -> 5 ;
# 6 [label="entropy = 0.0
samples = 3
value = [0, 3]"] ;
# 0 -> 6 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
# }
#系数反应每个特征值的影响力
# print clf.feature_importances_
# [ 0.3608012  0.6391988],可以看到身高系数影响较大
#测试结果打印
anwser=clf.predict(x_train)
# print x_train
print anwser
# [ 1.  0.  1.  0.  1.  0.  1.  0.]
print y_train
# [ 1.  0.  1.  0.  1.  0.  1.  0.]
print np.mean(anwser==y_train)
# 1.0 很准，毕竟用的是训练的数据
#让我们用测试的数据来看看
anwser=clf.predict(x_test)
print anwser
# [ 0.  0.]
print y_test
# [ 0.  0.]
print np.mean(anwser==y_test)
# 1.0 也很准
#这个是教程里的注释，我没碰到
#准确率与召回率 #准确率：某个类别在测试结果中被正确测试的比率 #召回率：某个类别在真实结果中被正确预测的比率 #测试结果：array([ 0., 1., 0., 1., 0., 1., 0., 1., 0., 0.]) #真实结果：array([ 0., 1., 0., 1., 0., 1., 0., 1., 0., 1.]) #分为thin的准确率为0.83。是因为分类器分出了6个thin，其中正确的有5个，因此分为thin的准确率为5/6=0.83。 #分为thin的召回率为1.00。是因为数据集中共有5个thin，而分类器把他们都分对了（虽然把一个fat分成了thin！），召回率5/5=1。 #分为fat的准确率为1.00。不再赘述。 #分为fat的召回率为0.80。是因为数据集中共有5个fat，而分类器只分出了4个（把一个fat分成了thin！），召回率4/5=0.80。 #本例中，目标是尽可能保证找出来的胖子是真胖子（准确率），还是保证尽可能找到更多的胖子（召回率）。
precision,recall,thresholds=precision_recall_curve(y_train,clf.predict(x_train))
print precision,recall,thresholds
# [ 1.  1.] [ 1.  0.] [ 1.]
anwser=clf.predict_proba(x)[:,1]
print classification_report(y,anwser,target_names=['thin','fat'])
#              precision    recall  f1-score   support

       # thin       1.00      1.00      1.00         5
       #  fat       1.00      1.00      1.00         5
#
#  avg / total       1.00      1.00      1.00        10

查看全文

相关阅读:
linux 解压tgz 文件指令
 shell 脚本没有执行权限报错 bash: ./myshell.sh: Permission denied
linux 启动solr 报错 Your Max Processes Limit is currently 31202. It should be set to 65000 to avoid operational disruption.
远程查询批量导入数据
 修改 MZTreeView 赋权节点父节点选中子节点自动选中的问题
 关于乱码的问题解决记录
 我的网站优化之路
 对设计及重构的一点反思
 我的五年岁月
 奔三的路上

原文地址：https://www.cnblogs.com/AlwaysT-Mac/p/6647192.html