交叉验证的目的是为了有在训练集中有更多的数据点,以获得最佳的学习效果,同时也希望有跟多的测试集数据来获得最佳验证。交叉验证的要点是将训练数据平分到k个容器中,在k折交叉验证中,将运行k次单独的试验,每一次试验中,你将挑选k个训练集中的一个作为验证集,剩下k-1个作为训练集,训练你的模型,用测试集测试你的模型。这样运行k次,有十个不同的测试集,将十个测试集的表现平均,就是将这k次试验结果取平均。这样你就差不多用了全部数据去训练,也用全部数据去测试。
#!/usr/bin/python """ Starter code for the validation mini-project. The first step toward building your POI identifier! Start by loading/formatting the data After that, it's not our code anymore--it's yours! """ import pickle import sys sys.path.append("../tools/") from feature_format import featureFormat, targetFeatureSplit from sklearn.metrics import accuracy_score from sklearn.cross_validation import train_test_split data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") ) ### first element is our labels, any added elements are predictor ### features. Keep this the same for the mini-project, but you'll ### have a different feature list when you do the final project. features_list = ["poi", "salary"] data = featureFormat(data_dict, features_list) labels, features = targetFeatureSplit(data) features_train,features_test,labels_train,labels_test=train_test_split(features,labels,test_size=0.3,random_state=42) from sklearn.tree import DecisionTreeClassifier dlf=DecisionTreeClassifier() dlf.fit(features_train ,labels_train) f=dlf.predict(features_test) print accuracy_score(f,labels_test) ### it's all yours from here forward!