zoukankan      html  css  js  c++  java
  • python大战机器学习——模型评估、选择与验证

    1、损失函数和风险函数

    (1)损失函数:常见的有 0-1损失函数  绝对损失函数  平方损失函数  对数损失函数

    (2)风险函数:损失函数的期望      经验风险:模型在数据集T上的平均损失

      根据大数定律,当N趋向于∞时,经验风险趋向于风险函数

    2、模型评估方法

    (1)训练误差与测试误差

      训练误差:关于训练集的平均损失

      测试误差:定义模型关于测试集的平均损失。其反映了学习方法对未知测试数据集的预测能力

    (2)泛化误差:学到的模型对未知数据的预测能力。其越小,该模型越有效。泛化误差定义为所学习模型的期望风险

    (3)过拟合:对已知数据预测得很好,对未知数据预测得很差的现象。原因是将训练样本本身的一些特点当做了所有潜在样本都具有的一般性质,这会造成泛化能力的下降。常用的防止过拟合的办法为正则化。正则化是基于结构化风险最小化策略的实现。

    3、模型评估

    (1)留出法:直接将数据划分为三个互斥的部分,然后在训练集上训练模型,在验证集上选择模型,最后用测试集上的误差作为泛化误差的估计。

    (2)交叉验证法(S折交叉验证法):数据随机划分为S个互不相交且大小相同的子集,利用S-1个子集数据训练模型,利用余下的一个子集测试模型。对S种组合依次重复进行,获取测试误差的均值。

    (3)留一法:留出一个样例作为测试集。其缺点就是当数据集比较大时计算量太大

    (4)自助法:先从T中随机取出一个样本放入采样集TS中,再把该样本放回T中。经过N次随机采样操作,得到包含N个样本的采样集TS。将TS用作训练集,T-TS用过测试集。

    4、性能度量

    (1)测试准确率和测试错误率

    (2)混淆矩阵

      查准率:P=TP/(TP+FP)  ,即所有预测为正类的结果中,真正的正类的比例

      查全率:R=TP/(TP+FN),即正真的正类中,被分类器找出来的比例

      不同的问题中,判别标准不同。对于推荐系统,更侧重于查准率(即推荐的结果中,用户真正感兴趣的比例);对于医学诊断系统,更侧重于查全率(即疾病被发现的比例)

      2/F1=1/P+1/R

    5、ROC曲线

      真正例率:TPR=TP/(TP+FN)

      假正例率:FPR=FP/(TN+FP),刻画的是分类器错认为正类的负实例占所有负实例的比例

      以真正例率为纵轴、假正例率为横轴作图,就得到ROC曲线。在ROC图中,对角线对应于随机猜想模型。点(0,1)对应于理想模型。通常ROC曲线越靠近点(0,1)越好。

    6、偏差方差分解

    代码如下:

      1 from sklearn.metrics import zero_one_loss,log_loss
      2 from sklearn.model_selection import train_test_split,KFold,StratifiedKFold,LeaveOneOut,cross_val_score
      3 from sklearn.datasets import load_digits,load_iris
      4 from sklearn.svm import LinearSVC,SVC
      5 from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,classification_report
      6 from sklearn.metrics import confusion_matrix,precision_recall_curve,roc_curve
      7 from sklearn.metrics import mean_absolute_error,mean_squared_error,classification_report
      8 from sklearn.multiclass import OneVsRestClassifier
      9 from sklearn.model_selection import validation_curve,learning_curve,GridSearchCV,RandomizedSearchCV
     10 import matplotlib.pyplot as plt
     11 from sklearn.preprocessing import label_binarize
     12 from sklearn.linear_model import LogisticRegression
     13 import numpy as np
     14 #zero_one_loss
     15 # y_true=[1,1,1,1,1,0,0,0,0,0]
     16 # y_pred=[0,0,0,1,1,1,1,1,0,0]
     17 # print("zero_one_loss<fraction>:",zero_one_loss(y_true,y_pred,normalize=True))
     18 # print("zero_one_loss<num>:",zero_one_loss(y_true,y_pred,normalize=False))
     19 
     20 #log_loss
     21 # y_true=[1,1,1,0,0,0]
     22 # y_pred=[[0.1,0.9],
     23 #         [0.2,0.8],
     24 #         [0.3,0.7],
     25 #         [0.7,0.3],
     26 #         [0.8,0.2],
     27 #         [0.9,0.1]
     28 #         ]
     29 # print("log_loss<average>:",log_loss(y_true,y_pred,normalize=True))
     30 # print("log_loss<total>:",log_loss(y_true,y_pred,normalize=False))
     31 
     32 #train_test_split
     33 # X=[
     34 #     [1,2,3,4],
     35 #     [11,12,13,14],
     36 #     [21,22,23,24],
     37 #     [31,32,33,34],
     38 #     [41,42,43,44],
     39 #     [51,52,53,54],
     40 #     [61,62,63,64],
     41 #     [71,72,73,74]
     42 # ]
     43 # Y=[1,1,0,0,1,1,0,0]
     44 # X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.4,random_state=0)
     45 # print("X_train=",X_train)
     46 # print("X_test=",X_test)
     47 # print("Y_train=",Y_train)
     48 # print("Y_test=",Y_test)
     49 # X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.4,random_state=0,stratify=Y)
     50 # print("X_train=",X_train)
     51 # print("X_test=",X_test)
     52 # print("Y_train=",Y_train)
     53 # print("Y_test=",Y_test)
     54 
     55 #KFold
     56 # X=np.array([
     57 #     [1,2,3,4],
     58 #     [11,12,13,14],
     59 #     [21,22,23,24],
     60 #     [31,32,33,34],
     61 #     [41,42,43,44],
     62 #     [51,52,53,54],
     63 #     [61,62,63,64],
     64 #     [71,72,73,74],
     65 #     [81,82,83,84]
     66 # ])
     67 # Y=np.array([1,1,0,0,1,1,0,0,1])
     68 #
     69 # folder=KFold(n_splits=3,random_state=0,shuffle=False)
     70 # for train_index,test_index in folder.split(X,Y):
     71 #     print("Train Index:",train_index)
     72 #     print("Test Index:",test_index)
     73 #     print("X_train:",X[train_index])
     74 #     print("X_test:",X[test_index])
     75 #     print("")
     76 #
     77 # shuffle_folder=KFold(n_splits=3,random_state=0,shuffle=True)
     78 # for train_index,test_index in shuffle_folder.split(X,Y):
     79 #     print("Train Index:",train_index)
     80 #     print("Test Index:",test_index)
     81 #     print("X_train:",X[train_index])
     82 #     print("X_test:",X[test_index])
     83 #     print("")
     84 
     85 #StratifiedKFold
     86 # stratified_folder=StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
     87 #as the operation is similar to the above,pass
     88 
     89 #LeaveOneOut,too easy,pass
     90 # loo=LeaveOneOut(len(Y))
     91 
     92 #cross_val_score
     93 # digits=load_digits()
     94 # X=digits.data
     95 # Y=digits.target
     96 #
     97 # result=cross_val_score(LinearSVC(),X,Y,cv=10)
     98 # print("Cross Val Score is:",result)
     99 
    100 #accuracy_score,pass
    101 # accuracy_score(y_true,y_pred,normalize=True/False)
    102 
    103 #precision_score,pass
    104 # precision_socre(y_true,y_pred)
    105 
    106 #recall_score,pass
    107 # recall_score(y_true,y_pred)
    108 
    109 #f1_score,pass
    110 # f1_score(y_true,y_pred)
    111 
    112 #fbeta_score,pass
    113 # fbeta_score(y_true,y_pred,beta=num_beta)
    114 
    115 #classification_report
    116 # y_true=[1,1,1,1,1,0,0,0,0,0]
    117 # y_pred=[0,0,1,1,0,0,0,0,0,0]
    118 # print("Classification Report:
    ",classification_report(y_true,y_pred,target_names=["class_0","class_1"]))
    119 
    120 #confusion_matrix,pass
    121 # confusion_matrix(y_true,y_pred,labels=[0,1])
    122 
    123 #precision_recall_curve
    124 # iris=load_iris()
    125 # X=iris.data
    126 # Y=iris.target
    127 # #print(X,'
    ',Y)
    128 # Y=label_binarize(Y,classes=[0,1,2])
    129 # n_classes=Y.shape[1]
    130 # # print(n_classes,'
    ',Y)
    131 # np.random.seed(0)
    132 # n_samples,n_features=X.shape
    133 # # print(n_samples,'
    ',n_features)
    134 # X=np.c_[X,np.random.randn(n_samples,200*n_features)]
    135 # # n_samples,n_features=X.shape
    136 # # print(n_samples,'
    ',n_features)
    137 # x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.5,random_state=0)
    138 # clf=OneVsRestClassifier(SVC(kernel='linear',probability=True,random_state=0))
    139 # clf.fit(x_train,y_train)
    140 # y_score=clf.fit(x_train,y_train).decision_function(x_test)
    141 # # print(y_score)
    142 # fig=plt.figure()
    143 # ax=fig.add_subplot(1,1,1)
    144 # precision=dict()
    145 # recall=dict()
    146 # for i in range(n_classes):
    147 #     precision[i],recall[i],_=precision_recall_curve(y_test[:,i],y_score[:,i])
    148 #     ax.plot(recall[i],precision[i],label="target=%s"%i)
    149 # ax.set_xlabel("Recall Score")
    150 # ax.set_ylabel("Precision Score")
    151 # ax.set_title("P-R")
    152 # ax.legend(loc="best")
    153 # ax.set_xlim(0,1.1)
    154 # ax.set_ylim(0,1.1)
    155 # ax.grid()
    156 # plt.show()
    157 
    158 #roc_curve,roc_auc_score,pass
    159 # roc_curve(y_test,y_score)
    160 
    161 #mean_absolute_error,pass
    162 # mean_absolute_error(y_true,y_pred)
    163 
    164 #mean_squared_error,pass
    165 # mean_squared_error(y_true,y_pred)
    166 
    167 #validation_curve,pass
    168 # validation_curve(LinearSVC(),X,Y,param_name="C",param_range=np.logspace(-2,2),cv=10,scoring="accuracy")
    169 
    170 #learning_curve,pass
    171 # train_size=np.linspace(0.1,1.0,endpoint=True,dtype='float')
    172 # learning_curve(LinearSVC(),X,Y,cv=10,scoring="accuracy",train_sizes=train_size)
    173 
    174 #GridSearcgCV
    175 # digits=load_digits()
    176 # x_train,x_test,y_train,y_test=train_test_split(digits.data,digits.target,test_size=0.25,random_state=0,stratify=digits.target)
    177 # tuned_parameters=[{'penalty':['l1','l2'],'C':[0.01,0.05,0.1,0.5,1,5,10,50,100],'solver':['liblinear'],'multi_class':['ovr']},
    178 #                   {'penalty':['l2'],'C':[0.01,0.05,0.1,0.5,1,5,10,50,100],'solver':['lbfgs'],'multi_class':['ovr','multinomial']},
    179 #                   ]
    180 # clf=GridSearchCV(LogisticRegression(tol=1e-6),tuned_parameters,cv=10)
    181 # clf.fit(x_train,y_train)
    182 # print("Best parameters set found:",clf.best_params_)
    183 # print("Grid scores:")
    184 # for params,mean_score,scores in clf.grid_scores_:
    185 #     print("	%0.3f(+/-%0.03f) for %s"%(mean_score,scores.std()*2,params))
    186 # 
    187 # print("Optimized Score:",clf.score(x_test,y_test))
    188 # print("Detailed classification report:")
    189 # y_true,y_pred=y_test,clf.predict(x_test)
    190 # print(classification_report(y_true,y_pred))
    191 
    192 #RandomizedSearchCV
    193 # RandomizedSearchCV(LogisticRegression(penalty='l2',solver='lbfgs',tol=1e-6,tuned_parameters,cv=10,scoring='accuracy',n_iter=100))
    View Code
  • 相关阅读:
    祝各位博友新年快乐,全家幸福,大展宏图,财源滚滚!
    Android中级第五讲GPRS定位的实现
    Android高级开发第二讲Android中API翻译之Activity
    Android 图标、音频、颜色RGB工具集
    Android初级开发第八讲之startActivityForResult方法讲解
    Android高级开发第三讲应用程序基础
    Android高级开发第四讲API之Service
    Android高级开发第五讲API之Content Providers
    物联网操作系统是否需要基于Java和虚拟机进行构筑
    Android高级开发第四讲API之Intents and Intent Filters
  • 原文地址:https://www.cnblogs.com/acm-jing/p/7702085.html
Copyright © 2011-2022 走看看