zoukankan      html  css  js  c++  java
  • 转载︱案例 基于贪心算法的特征选择

    转载︱案例 基于贪心算法的特征选择
     
     
     
     
    ————————————————————————————————————————————————————————

    greedy Algorithm Feature Selection


    贪心算法(又称贪婪算法)是指,在对问题求解时,总是做出在当前看来是最好的选择。也就是说,不从整体最优上加以考虑, 它所做出的是在某种意义上的局部最优解。贪心算法不是对所有问题都能得到整体最优解,关键是贪心策略的选择,选择的贪心 策略必须具备无后效性,即某个状态以前的过程不会影响以后的状态,只与当前状态有关。

    算法设计:

    1. 初始化问题的目标值

    2. while(实现优化目标的约束条件){

      利用筛选策略,求出解空间的一个可行解

      }

    3. 将所有可行解组合成目标解空间。

    options(warn = -1)
    require(magrittr)
    require(dplyr)
    require(glmnet)
    # Greedy Algorithm
    GreedyAlgorithm = function(dataSet) {
      # 基于逻辑回归,以AUC作为评价指标,采用贪心算法进行特征筛选
      # 
      # Args:
      #   dataSet: A dataframe that contains a feature "label"
      # 
      # Returns:
      #   A vector of selected features
      features = data.frame(name = colnames(dataSet)) %>%
        dplyr::filter(name != "label")  # select all features of the dataSet except "label"
      features = as.vector(features$name)    
      featureSelect = c("label")  # init the feature vector to be selected
      scoreBefore = data.frame()  # init the storage whice stores the (feature,aucScore) tuple from the end of each iteration 
      while((nrow(scoreBefore)<2||scoreBefore[length(scoreBefore),2]>
            scoreBefore[length(scoreBefore) - 1,2])&&nrow(scoreBefore)<length(features)){
        score = data.frame()
        for(feature in features){
          if(length(intersect(feature,featureSelect)) == 0){
            trainData = dataSet[,append(featureSelect,feature)]
            model = glm(label~.,family = "binomial",data = trainData,epsilon = 1e-10)
            prediction = predict(model,trainData)
            aucValue = auc(trainData$label,prediction)
            score = rbind(score,data.frame(feature = feature,aucValue = aucValue))
          }
        }
        featureSelect = unique(append(featureSelect,as.character(score[which.max(score$aucValue),1])))
        scoreBefore = rbind(scoreBefore,score[which.max(score$aucValue),])
      }
      featureSelect = head(featureSelect,length(featureSelect)-1)  # delete the last feature that can't fit the iteration condition 
      return(featureSelect[-1])  # reture the selected features except "label"
    }

    KS值表征了模型将正例和负例区分开来的能力。值越大,模型的预测准确性越好。通常情况下,KS>0.3即可认为模型有比较好的预测准确性。

    KS值计算方法:

    将所有样本根据预测得分从低到高排序均分成N组,分别计算这N组的实际好样本数、坏样本数、累积好样本数、累积坏样本数、累积好样本数占比、 累积坏样本数占比,差值。其中,实际好坏样本数分别为该组内的好坏样本数,累积好坏样本数为该组累积的好坏样本数,累积好坏样本数占比为 累积好坏样本数占总好坏样本数的比值,差值为累积坏样本数占比减去累计好样本数占比。KS指标为差值绝对值的最大值。

    # ksValue
    KsValue = function(prediction,n){
      # Compute the ks value of a model
      # 
      # Args:
      #   prediction: A vector that the prediction of a model
      #   n: The group number 
      # 
      # Returns:
      #   A vector that the difference value between the rate cumulative bad sample and the rate of cumulative good sample 
      dataResult = sort(prediction,decreasing = T)
      a = c()
      b = c()
      c = c()
      a[1] = 0
      b[1] = 0
      c[1] = 0
      if(length(dataResult)%%n==0){
        cut = length(dataResult)/n
        for (i in 2:(n+1)) {
          a[i] = sum(dataResult[(cut*(i-2)+1):(cut*(i-1))])
          b[i] = length(dataResult[(cut*(i-2)+1):(cut*(i-1))])-a[i]
        }
      }else{
        cut = round(length(dataResult)/n)
        for (i in 2:n) {
          a[i] = sum(dataResult[(cut*(i-2)+1):(cut*(i-1))])
          b[i] = length(dataResult[(cut*(i-2)+1):(cut*(i-1))])-a[i]
        }
        a[n+1] = sum(dataResult[(cut*(n-2)+1):(cut*(n-1))])
        b[n+1] = length(dataResult[(cut*(n-2)+1):(cut*(n-1))])-a[n+1]
      }
      c = abs(cumsum(a)/sum(a)-cumsum(b)/sum(b))
      return(c)
    }
    require(caret)
    require(pROC)
    data = read.csv("/data/workspace/Rworkspace/data_test.csv",encoding = "UTF-8")
    data %<>%
      mutate(label = ifelse(target>30,1,0))
    data = data[,-1]
    data = data.frame(apply(data, 2, function(x) ifelse(is.na(x),median(x,na.rm = T),x)))
    # 剔除近似常量的变量
    # feature1 = nearZeroVar(data)
    # data = data[,-feature1]
    # 剔除相关度过高的自变量
    # dataCor = cor(data)
    # highcor = findCorrelation(dataCor,0.8)
    # data = data[,-highcor]
    # 利用贪心算法进行特征选择
    # feature = GreedyAlgorithm(dataSet = data)
    load("/data/workspace/Rworkspace/featureSelect.RData")  # 数据量较大,生成html过程中该算法比较耗时,所以直接加载测试时已经选取的特征
    set.seed(521)
    ind = base::sample(2,nrow(data),replace=T,prob=c(0.7,0.3))
    trainData = data[ind==1,]
    testData = data[ind==2,]
    model = cv.glmnet(as.matrix(trainData[,feature]),trainData[,"label"],
                        family = "binomial",type.measure = "auc",alpha = 0,
                        lambda.min.ratio = 0.0001)
    prediction = predict(model,as.matrix(testData[,feature]),s="lambda.min",type="response")
    # compute ksValue
    ksValue = KsValue(prediction,10)
    par(mfrow = c(2,1))
    plot(density(ksValue),type = 'l',main = "ksValue Plot",xlab = "cutPoint",ylab = "density_ks")
    ks_value = max(ksValue)
    text(.2,1.0,paste("ksValue = ",ks_value))
    roc(testData$label, as.vector(prediction), auc = T,plot = T,print.auc=T)

    ## 
    ## Call:
    ## roc.default(response = testData$label, predictor = as.vector(prediction),     auc = T, plot = T, print.auc = T)
    ## 
    ## Data: as.vector(prediction) in 5130 controls (testData$label 0) < 429 cases (testData$label 1).
    ## Area under the curve: 0.7385
    par(mfrow=c(1,1))

     

  • 相关阅读:
    线性回归算法
    K均值算法--应用
    K均值算法
    机器学习相关数学基础
    机器学习概述
    语法制导的语义翻译
    算符优先分析
    自下而上语法分析
    实验二 递归下降语法分析
    LL(1)文法的判断
  • 原文地址:https://www.cnblogs.com/timssd/p/6416193.html
Copyright © 2011-2022 走看看