训练集(train set) 验证集(validation set) 测试集(test set)

zoukankan html css js c++ java

训练集(train set) 验证集(validation set) 测试集(test set)
1：在NN训练中我们很常用的是训练集合以及测试集合，在训练集合上训练模型（我个人认为模型就是训练的方法以及对应的参数值，更偏重于参数值吧），训练好之后拿到测试集合上验证模型的泛华（就是该模型可以拿去实战的效果）的能力。

2：但是对于上述情况，举个例子，比如是在训练一个多层网络，我们用类似minFUNC的方法来训练，那么这个优化包会直接根据我们的输入直接迭代出来一个很好地结果了，此时模型就训练好了。但是如果运用SGD这些方法去训练的话，到底迭代多少次算好？有时候可能也不收敛，只是中间过程中的一个参数值是效果最好的，那我们如何知道这个参数值？

3：个人认为有了验证集，真的很适合来使用SGD来训练，在训练过程中，比如训练了一个epoch，那么来把训练好的参数用于验证集上，然后保存在验证集合上的精度，只要改精度满足一定条件，那么训练就可以终止。

4：关于训练集、验证集以及测试集合的选择，这个网上资料很多，不在这里说了。

补充一个伪代码：

for each epoch for each training data instance propagate error through the network adjust the weights calculate the accuracy over training data for each validation data instance calculate the accuracy over the validation data if the threshold validation accuracy is met exit training else continue training
这三个名词在机器学习领域的文章中极其常见，但很多人对他们的概念并不是特别清楚，尤其是后两个经常被人混用。Ripley, B.D（1996）在他的经典专著Pattern Recognition and Neural Networks中给出了这三个词的定义。
Training set: A set of examples used for learning, which is to fit the parameters [i.e., weights] of the classifier.
Validation set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network.
Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier.
显然，training set是用来训练模型或确定模型参数的，如ANN中权值等； validation set是用来做模型选择（model selection），即做模型的最终优化及确定的，如ANN的结构；而 test set则纯粹是为了测试已经训练好的模型的推广能力。当然，test set这并不能保证模型的正确性，他只是说相似的数据用此模型会得出相似的结果。但实际应用中，一般只将数据集分成两类，即training set 和test set，大多数文章并不涉及validation set。
Ripley还谈到了Why separate test and validation sets?
1. The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model.
2. After assessing the final model with the test set, YOU MUST NOT tune the model any further.

一，训练样本和测试样本

训练样本的目的是数学模型的参数，经过训练之后，可以认为你的模型系统确立了下来。

一般训练样本和测试样本相互独立，使用不同的数据。

在有监督(supervise)的机器学习中，数据集常被分成2~3个，即：训练集(train set) 验证集(validation set) 测试集(test set)。

http://blog.sina.com.cn/s/blog_4d2f6cf201000cjx.html

显然，

training set 是用来训练模型或确定模型参数的，如ANN中权值等；

validation set 是用来做模型选择（model selection），即做模型的最终优化及确定的，如ANN的结构；

test set 则纯粹是为了测试已经训练好的模型的推广能力。当然，test set这并不能保证模型的正确性，他只是说相似的数据用此模型会得出相似的结果。

但实际应用中，一般只将数据集分成两类，即training set 和test set，大多数文章并不涉及validation set。
查看全文

相关阅读:
python json 访问与字符串截取
 python 12306 车次数据获取
 12306 城市代码切片技巧
 python 9*9 乘法表
 python 列表转为字典的两个小方法
 python 三种遍历列表里面序号和值的方法
 虚拟机中访问连接在物理机上的摄像机(使用桥接)
C++程序调用python3
Notepad++编写运行python程序
 查看进程被哪台电脑的哪个进程连接(netstat)

原文地址：https://www.cnblogs.com/timssd/p/6034672.html