不平衡学习方法理论和实战总结

zoukankan html css js c++ java

不平衡学习方法理论和实战总结
原文：http://blog.csdn.net/hero_fantao/article/details/35784773

不平衡学习方法

机器学习中样本不平衡问题大致分为两方面：

（1）类别中样本比率不平衡，但是几个类别的样本都足够多；

（2）类别中某类样本较少。

对第二个问题，其实不是我们重点，因为样本不足的话，覆盖空间是很小，如果特征足够多的话，这种数据对模型学习的价值也不大，所以，对这个问题，好的方法只能是找尽量多的小类样本来覆盖样本空间。

现在主要讨论第一个问题。

一: 采样方法

1. 随机重采样(random oversampling):

  样本不平衡时候，对小类样本就行随机重采样，以达到平衡。这种方法只是对小类样本进行简单的拷贝，缺点是容易over-fit，比如在决策树分类的时候，很有可能一个终端叶子节点的样本都是一个样本的拷贝而已，扩展性不足，这可能会提高模型训练的精度，但是对未知测试样本的预测可能是很差的。



2. 随机欠采样(random oversampling)：

     样本不平衡时候，对大类样本就行随机欠采样，就是取部分大类样本，以达到平衡。欠采样的问题是对样本减少可能会缺失样本空间中重要数据，降低准确性。

3. Synthetic Sampling with Data Generation

  对小类样本进行近似数据样本生成。对小类样本计算KNN，找出K个相近样本，根据K近邻样本于当前样本的距离，生成新的样本。

这种方法突破了原有的简单的重复采样的方法，通过创建新的小样本，丰富了小样本的样本空间，弥补了小样本样本空间不足的问题。缺点是它对所有的小类样本都计算相同的KNN。试想下对于那些和大类样本有明显的区分度的小样本，对于这些产生多余的样本价值不大。

4. Adaptive Synthetic Sampling

  Adaptive Synthetic Sampling是一种修正方法，他试图增加小样本中和大类样本比较相近的样本sampling。

方法如下：

二代价学习方法

一是从样本角度来看，尽量做到样本平衡，然后来用模型的学习。还有种就是通过设置不同样本误判的代价，比如设置小样本误判的代价大一些。个人的感觉，这种方法和一中重采样的效果差不多，牺牲一个换取另外一个。个人觉得一种好的方法是，正负样本不平衡时候，每次选取一部分大类样本和全部小样本，尽量平衡，训练一个模型。重复以上操作，训练得到若干模型，把这些模型做个voting，获得最终预测结果，可以效仿Adaboost，对每个模型进行加权。其实，voting的方法就能达到很不多的效果。

参考文献：

[1] He H, Garcia E A. Learning from imbalanced data[J]. Knowledge and Data Engineering, IEEE Transactions on, 2009, 21(9): 1263-1284.

[2] https://github.com/fmfn/UnbalancedDataset(2014/12/07 @phunter_lau分享的一个模块)

附上Adaptive Synthetic Sampling源码：
[python] view plain copy

'''''

Created on 2014/03/09

@author: dylan

'''

from sklearn.neighbors import NearestNeighbors

import numpy as np

import random







def get_class_count(y, minorityclasslabel = 1):

    minorityclasslabel_count = len(np.where(y == minorityclasslabel)[0])

    maxclasslabel_count = len(np.where(y == (1 - minorityclasslabel))[0])



    return maxclasslabel_count, minorityclasslabel_count





# @param: X The datapoints e.g.: [f1, f2, ... ,fn]

# @param: y the classlabels e.g: [0,1,1,1,0,...,Cn]

# @param ms: The amount of samples in the minority group

# @param ml: The amount of samples in the majority group

# @return: the G value, which indicates how many samples should be generated in total, this can be tuned with beta

def getG(ml, ms, beta):

    return (ml-ms)*beta





# @param: X The datapoints e.g.: [f1, f2, ... ,fn]

# @param: y the classlabels e.g: [0,1,1,1,0,...,Cn]

# @param: minorityclass: The minority class

# @param: K: The amount of neighbours for Knn

# @return: rlist: List of r values

def getRis(X,y,indicesMinority,minorityclasslabel,K):



    ymin = np.array(y)[indicesMinority]

    Xmin = np.array(X)[indicesMinority]

    neigh = NearestNeighbors(n_neighbors= K)

    neigh.fit(X)



    rlist = [0]*len(ymin)

    normalizedrlist = [0]*len(ymin)



    for i in xrange(len(ymin)):

        indices = neigh.kneighbors(Xmin[i],K,False)[0]

#         print'y[indices] == (1 - minorityclasslabel):'

#         print y[indices]

#         print len(np.where(y[indices] == ( 1- minorityclasslabel))[0])

        rlist[i] = len(np.where(y[indices] == ( 1- minorityclasslabel))[0])/(K + 0.0)



    normConst = sum(rlist)



    for j in xrange(len(rlist)):

        normalizedrlist[j] = (rlist[j]/normConst)



    return normalizedrlist



def get_indicesMinority(y, minorityclasslabel = 1):

    y_new = []

    for i in range(len(y)):

        if y[i] == 1:

            y_new.append(1)

        else:

            y_new.append(0)

    y_new = np.asarray(y_new)

    indicesMinority = np.where(y_new == minorityclasslabel)[0]



    return indicesMinority, y_new



def generateSamples(X, y, minorityclasslabel = 1, K =5,beta = 0.3):

    syntheticdata_X = []

    syntheticdata_y = []





    indicesMinority, y_new = get_indicesMinority(y)

    ymin = y[indicesMinority]

    Xmin = X[indicesMinority]



    rlist = getRis(X, y_new, indicesMinority, minorityclasslabel, K)

    ml, ms = get_class_count(y_new)

    G = getG(ml,ms, beta = beta)



    neigh = NearestNeighbors(n_neighbors=K)

    neigh.fit(Xmin)



    for k in xrange(len(ymin)):

        g = int(np.round(rlist[k]*G))



        neighb_indx = neigh.kneighbors(Xmin[k],K,False)[0]



        for l in xrange(g):

            ind = random.choice(neighb_indx)

            s = Xmin[k] + (Xmin[ind]-Xmin[k]) * random.random()

            syntheticdata_X.append(s)

            syntheticdata_y.append(ymin[k])



    print 'asyn, raw X size:',X.shape

    X = np.vstack((X,np.asarray(syntheticdata_X)))



    y = np.hstack((y,syntheticdata_y))

    print 'asyn, post X size:',X.shape



    return X , y
查看全文

相关阅读:
买点
 正则
 burp回放
 py打包问题
 运行elementUI相关组件的时候的问题
 客户端性能（转载）
客户端性能（转载）
关于Appium android input manager for Unicode 提示信息
 selenium 分布式 [WinError 10061] 由于目标计算机积极拒绝
 WPF数据绑定-依赖属性

原文地址：https://www.cnblogs.com/zhizhan/p/5042922.html