stratified k-fold - 走看看

zoukankan html css js c++ java

stratified k-fold

If you have a skewed dataset for binary classification with 90% positive samples and only 10% negative samples,

you don't want to use random k-fold cross-validation. Using simple k-fold cross-validation for a dataset like this can result in folds with all negative samples.

In these cases, we prefer using stratified k-fold cross-validation.Stratified k-fold cross-validation keeps the ratio of labels in each fold constant. So,in each fold, you will have the same 90% positive and 10% negative samples. Thus,whatever metric you choose to evaluate, it will give similar results across all folds.

stratified k-fold可以避免fold中全是正样本，负样本的问题，保证每个fold中正负样本比率相同。

import pandas as pd

from sklearn import model_selection

if __name__ == "__main__":

　　# Training data is in a csv file called train.csv

　　df = pd.read_csv("train.csv")

　　# we create a new column called kfold and fill it with -1

　　df["kfold"] = -1

　　# the next step is to randomize the rows of the data

　　df = df.sample(frac=1).reset_index(drop=True)

　　# fetch targets

　　y = df.target.values

　　# initiate the kfold class from model_selection module

　　kf = model_selection.StratifiedKFold(n_splits=5)

　　# fill the new kfold column

　　for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):

　　　　df.loc[v_, 'kfold'] = f

　　# save the new csv with kfold column

　　df.to_csv("train_folds.csv", index=False)

查看全文

相关阅读:
scrapy 命令行传参以及发送post请求payload参数
 scrapy框架+selenium的使用
 python 制作GUI页面以及多选框、单选框
 上线操作
 在Linux中使用selenium（环境部署）
解读Java NIO Buffer
Maven自定义Archetype
解决spark streaming集成kafka时只能读topic的其中一个分区数据的问题
 在windows下使用pip安装python包遇到缺失stdint.h文件的错误
 maven-shade-plugin插件未生效原因分析

原文地址：https://www.cnblogs.com/songyuejie/p/14781202.html