机器学习分类问题中_训练数据类别不均衡怎么解决

zoukankan html css js c++ java

机器学习分类问题中_训练数据类别不均衡怎么解决
碰到样本数据类别不均衡怎么办？

如果有 10000个样例，做二分类，9990条数据都属于正类1，如果不处理的话预测全部结果为 1，准确率也为 99%，但这显然不是想要的结果。

碰到这样样本很不平衡的样例，应该怎样做。

前期数据准备

1. 欠采样
def down_sample(df): df1=df[df['label']==1] #正例 df2=df[df['label']==0] ##负例 df3=df2.sample(frac=0.25) ##抽负例 return pd.concat([df1,df3],ignore_index=True)
```
对样本量很大的类，抽取更少的样本，达到样本平衡2.
```
2. 过采样
def up_sample(df): df1=df[df['label']==1] #正例 df2=df[df['label']==0] ##负例 df3=pd.concat([df1,df1,df1,df1,df1],ignore_index=True) return pd.concat([df2,df3],ignore_index=True)
对样本量偏少的数据，采用重复采样的策略

模型中调整调整权重

很多分类模型都有设置权重的参数

1. xgboost 设置 : scale_pos_weight

如做二分类，0/1， 0：1 = 1：100 可以设置scale_pos_weight=100

2. RF 设置： class_weight

可以指定, 但对于多分类问题需要注意：
- For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].
- The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
得到结果后寻找最优阈值

调整threshold的值，得到最优结果
Threshold = 0.45 for j in range(len(preds)): if preds[j]>=Threshold : preds[j]=1 　　else : 　　　　preds[j]=0
评价指标：

使用准确度结果可能不准确。可以尝试 Confusion Matrix, Precision, Recall, Auc_Roc
查看全文

相关阅读:
编程语言本身不产生任何价值
 探索几种常见的广告平台
 Talk about my most-recent job application, Got acknowledgement of Native American programmers of two rounds of Video interviews for over 2 months' time, Chinese f2f interview is a deep question.
UI 控件和工具库, 编程语言更高一层的Must have, before fully prepared.
Python趣味入门6:能计数的循环语句for
Python趣味入门5:循环语句while
交个朋友
 2020年开始，中国程序员前景一片灰暗，是这样吗？
2030年，程序员工资还能达到现在的水平吗？
Java虚拟机调优（七）-典型配置举例1

原文地址：https://www.cnblogs.com/gaoss/p/9677466.html

机器学习分类问题中_训练数据类别不均衡怎么解决

前期数据准备

1. 欠采样

2. 过采样

模型中调整调整权重

1. xgboost 设置 : scale_pos_weight

2. RF 设置： class_weight

得到结果后寻找最优阈值

调整threshold的值，得到最优结果

评价指标：

1. `xgboost 设置 : scale_pos_weight`