Dealing with Imbalanced Datasets

zoukankan html css js c++ java

Dealing with Imbalanced Datasets
Motivation

The Imbalanced Datasets are very common in our life such as illegal users or illness check. The machine learning model always performs bad on these datasets if there are no specific dealings, especially the prediction accuracy of minority class. For example, if the data is highly imbalanced such as 9995(negative):5(positive), then if your model just let every instance to be negative and you can get an acc of 99.95% but the result is meaningless. Another example is that misclassifying the minority is very severe. Assume that you misclassify the patient as normal. Oh my god!

So researchers proposed two kinds of methods for this problem:
- Cost Sensitive Learning
  When training your model, it will give different classes different weights in the loss function thus let the model focus more on the minority class. In sklearn, there are class_weight and sample_weight for you. For class_weight, you can specify the weights for different classes such as {0:0.1,1:0.9} or you can set it to balanced then weights will be computed by (frac{#samples}{#classes * np.bincount(y)}). For fit(sample_weight=), you give every instance different weights. When computing the loss for the instance, it will be class_weight * sample_weight * loss.
- Sampling
  Sampling means that we will change the original dataset rather than giving them different weights.
Sampling Methods

Over-sampling means to increment the minority class.
- Random Over Sampling
  To sample from minority class with replacement to let the number of each class is 1:1. Overfitting on minority class.
- Synthetic Minority Oversampling Technique (SMOTE)
[x_{new}=x_i+lambda(x_{zi}-x_i) ]
First you find the k_neighbors of (x_i) in the minority class, then just select one (x_{zi}) randomly and produce the new one. There are some variants such as borderline SMOTE, SVM SMOTE and KMeans SMOTE.
- Adaptive Synthetic (ADASYN)
  The difference between SMOTE and ADASYN is that SMOTE will generate new samples for random minority data until 1:1. But ADASYN will automatically decide the number of new points generated for each (x_i). There will be more points generated if there are more majority data around (x_i).
Under-sampling means to decrease the majority class.
- RUS
  Data waste.
Example
查看全文

相关阅读:
WTL for Visual Studio 2012 配置详解
 自己动手让Visual Studio的Win32向导支持生成对话框程序
 改造联想Y480的快捷键（跨进程替换窗口过程(子类化)的实现——远程线程注入）
Visual Studio 2012 Ultimate RTM 体验（附下载地址和KEY）
VC++实现获取文件占用空间大小的两种方法(非文件大小)
为Visual Studio添加默认INCLUDE包含路径一劳永逸的方法(更新)
Winsdows 8 环境下搭建Windows Phone 开发环境
 Linq to Visual Tree可视化树的类Linq查询扩展API(译)
检测元素是否在界面可显示区域
 Debug the Metro Style App：Registration of the app failed

原文地址：https://www.cnblogs.com/EIMadrigal/p/14738860.html

Dealing with Imbalanced Datasets

Motivation

Sampling Methods

Example