zoukankan html css js c++ java

正负样本不平衡的处理

转载：https://blog.csdn.net/qq_31813549/article/details/79964973

过采样

1.最简单的一种方法就是生成少数类的样本, 这其中最基本的一种方法就是: 从少数类的样本中进行随机采样来增加新的样本：

from sklearn.datasets import make_classification
from collections import Counter
import numpy as np

X,y = make_classification(n_samples=5000,
                          n_features=2,
                          n_informative=2,
                          n_redundant=0, 
                          n_repeated=0,
                          n_classes=3,
                          n_clusters_per_class=1,
                          weights=[0.01,0.05,0.94],
                          class_sep=0.8,
                          random_state=0)

print(Counter(y))
# =============================================================================
# Counter({2: 4674, 1: 262, 0: 64})
# =============================================================================

from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(ratio="auto",random_state=0)
X_sample,y_sampel = ros.fit_sample(X,y)
print(Counter(y_sampel))
# =============================================================================
# Counter({2: 4674, 1: 4674, 0: 4674})
# =============================================================================

2.相对于采样随机的方法进行过采样, 还有两种比较流行的采样少数类的方法:

(i) Synthetic Minority Oversampling Technique (SMOTE)

(ii) Adaptive Synthetic (ADASYN)

SMOTE:

对于少数类样本a, 随机选择一个最近邻的样本b, 然后从a与b的连线上随机选取一个点c作为新的少数类样本.

ADASYN:

关注的是在那些基于K最近邻分类器被错误分类的原始样本附近生成新的少数类样本.

SMOTE算法与ADASYN都是基于同样算法来合成新的少数类样本: 对于少数类样本a, 从它的最近邻中选择一个样本b, 然后在两点的连线上随机生成一个新的少数类样本, 不同的是对于少数类样本的选择.

SMOTE:

kind='regular' :随机选取少数类的样本.

kind='borderline1' and kind='borderline2':

少数类的样本分为三类:

噪音样本, 该少数类的所有最近邻样本都来自于不同于样本a的其他类别；危险样本, 至少一半的最近邻样本来自于同一类(不同于a的类别)；安全样本, 所有的最近邻样本都来自于同一个类.

这两种类型的SMOTE使用的是危险样本来生成新的样本数据:

对于borderline1 SMOTE，最近邻中的随机样本b与该少数类样本a来自于不同的类；对于borderline2 SMOTE，最近邻中的随机样本b可以是属于任何一个类的样本.

from sklearn.datasets import make_classification
from collections import Counter
import numpy as np

X,y = make_classification(n_samples=5000,
                          n_features=2,
                          n_informative=2,
                          n_redundant=0, 
                          n_repeated=0,
                          n_classes=3,
                          n_clusters_per_class=1,
                          weights=[0.01,0.05,0.94],
                          class_sep=0.8,
                          random_state=0)

print(Counter(y))
# =============================================================================
# Counter({2: 4674, 1: 262, 0: 64})
# =============================================================================

from imblearn.over_sampling import SMOTE,ADASYN

X_sample,y_sample = SMOTE(kind='borderline1').fit_sample(X,y)
print(sorted(Counter(y_sample).items()))

# =============================================================================
# [(0, 4674), (1, 4674), (2, 4674)]
# 
# =============================================================================

kind="svm", 使用支持向量机分类器产生支持向量然后再生成新的少数类样本.

下采样

1.原型生成

　　给定数据集S, 原型生成算法将生成一个子集S’, 其中|S’| < |S|, 但是子集并非来自于原始数据集. 意思就是说: 原型生成方法将减少数据集的样本数量, 剩下的样本是由原始数据集生成的, 而不是直接来源于原始数据集.ClusterCentroids函数实现了上述功能: 每一个类别的样本都会用K-Means算法的中心点来进行合成, 而不是随机从原始样本进行抽取.ClusterCentroids函数提供了一种很高效的方法来减少样本的数量, 但需要注意的是, 该方法要求原始数据集最好能聚类成簇. 此外, 中心点的数量应该设置好, 这样下采样的簇能很好地代表原始数据.

from sklearn.datasets import make_classification
from collections import Counter
import numpy as np

X,y = make_classification(n_samples=5000,
                          n_features=2,
                          n_informative=2,
                          n_redundant=0, 
                          n_repeated=0,
                          n_classes=3,
                          n_clusters_per_class=1,
                          weights=[0.01,0.05,0.94],
                          class_sep=0.8,
                          random_state=0)

print(Counter(y))
# =============================================================================
# Counter({2: 4674, 1: 262, 0: 64})
# =============================================================================

from imblearn.under_sampling import ClusterCentroids

X_sample,y_sample = ClusterCentroids().fit_sample(X,y)

print(sorted(Counter(y_sample).items()))

# =============================================================================
# [(0, 64), (1, 64), (2, 64)]
# =============================================================================

2.原型选择

原型选择算法是直接从原始数据集中进行抽取. 抽取的方法大概可以分为两类:

(i) controlled under-sampling techniques

(ii) cleaning under-sampling techniques

第一类的方法可以由用户指定下采样抽取的子集中样本的数量; 第二类方法则不接受这种用户的干预.

2.1 Controlled under-sampling techniques

RandomUnderSampler函数是一种快速并十分简单的方式来平衡各个类别的数据: 随机选取数据的子集.replacement=True, 可以实现自助法(boostrap)抽样

from sklearn.datasets import make_classification
from collections import Counter
import numpy as np

X,y = make_classification(n_samples=5000,
                          n_features=2,
                          n_informative=2,
                          n_redundant=0, 
                          n_repeated=0,
                          n_classes=3,
                          n_clusters_per_class=1,
                          weights=[0.01,0.05,0.94],
                          class_sep=0.8,
                          random_state=0)

print(Counter(y))
# =============================================================================
# Counter({2: 4674, 1: 262, 0: 64})
# =============================================================================

from imblearn.under_sampling import RandomUnderSampler

X_sample,y_sample =  RandomUnderSampler().fit_sample(X,y)

print(sorted(Counter(y_sample).items()))
# =============================================================================
# 
# [(0, 64), (1, 64), (2, 64)]
# =============================================================================

NearMiss函数：通过一些启发式规则来选择样本

version-1: 选择离N个近邻负样本平均距离最小的正样本;

version-2: 选择离N个负样本最远平均距离最小的正样本;

version-3: 是一个两段式的算法. 首先, 对于每一个负样本, 保留它们的M个近邻样本; 接着, 那些到N个近邻样本平均距离最大的正样本将被选择.

from sklearn.datasets import make_classification
from collections import Counter
import numpy as np

X,y = make_classification(n_samples=5000,
                          n_features=2,
                          n_informative=2,
                          n_redundant=0, 
                          n_repeated=0,
                          n_classes=3,
                          n_clusters_per_class=1,
                          weights=[0.01,0.05,0.94],
                          class_sep=0.8,
                          random_state=0)

print(Counter(y))
# =============================================================================
# Counter({2: 4674, 1: 262, 0: 64})
# =============================================================================

from imblearn.under_sampling import NearMiss

X_sample,y_sample =  NearMiss(random_state=0,version=1).fit_sample(X,y)

print(sorted(Counter(y_sample).items()))
# =============================================================================
# 
# [(0, 64), (1, 64), (2, 64)]
# =============================================================================

2.2 Cleaning under-sampling techniques

Tomek's Links : 样本x与样本y来自于不同的类别, 满足以下条件, 它们之间被称之为TomekLinks; 不存在另外一个样本z, 使得d(x,z) < d(x,y) 或者 d(y,z) < d(x,y)成立. 其中d(.)表示两个样本之间的距离, 也就是说两个样本之间互为近邻关系. 这个时候, 样本x或样本y很有可能是噪声数据, 或者两个样本在边界的位置附近.TomekLinks函数中的auto参数控制Tomek’s links中的哪些样本被剔除. 默认的ratio='auto' 移除多数类的样本, 当ratio='all'时, 两个样本均被移除.

　　EditedNearestNeighbours:这种方法应用最近邻算法来编辑数据集, 找出那些与邻居不太友好的样本然后移除. 对于每一个要进行下采样的样本, 那些不满足一些准则的样本将会被移除; 他们的绝大多数(kind_sel='mode')或者全部(kind_sel='all')的近邻样本都属于同一个类, 这些样本会被保留在数据集中.

未完，待续。。。

查看全文

相关阅读:
第十周学习进度
 第九周学习进度
 冲刺阶段站立会议每日任务10
冲刺阶段站立会议每日任务9
冲刺阶段站立会议每日任务8
冲刺阶段站立会议每日任务7
第八周学习进度
 对输入法的评价
 冲刺阶段站立会议每日任务6
冲刺阶段站立会议每日任务5

原文地址：https://www.cnblogs.com/wzdLY/p/9734335.html