zoukankan      html  css  js  c++  java
  • R数据预处理(二)

    一、数据变换

    中心化、标准化原数据:

    > summary(sim.dat1)
          age           gender        income       house       store_exp       online_exp     
     Min.   :16.00   Female:554   Min.   : 41776   No :432   Min.   :155.8   Min.   :  68.82  
     1st Qu.:25.00   Male  :446   1st Qu.: 87896   Yes:568   1st Qu.:205.1   1st Qu.: 420.34  
     Median :36.00                Median : 93869             Median :329.8   Median :1941.86  
     Mean   :38.58                Mean   :109923             Mean   :373.1   Mean   :2120.18  
     3rd Qu.:53.00                3rd Qu.:119456             3rd Qu.:597.2   3rd Qu.:2440.78  
     Max.   :69.00                Max.   :319704             Max.   :597.3   Max.   :9479.44  
      store_trans     online_trans         Q1              Q2              Q3       
     Min.   : 1.00   Min.   : 1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
     1st Qu.: 3.00   1st Qu.: 6.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
     Median : 4.00   Median :14.00   Median :3.000   Median :1.000   Median :1.000  
     Mean   : 5.35   Mean   :13.55   Mean   :3.101   Mean   :1.823   Mean   :1.992  
     3rd Qu.: 7.00   3rd Qu.:20.00   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
     Max.   :20.00   Max.   :36.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
           Q4              Q5              Q6              Q7              Q8       
     Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
     1st Qu.:2.000   1st Qu.:1.750   1st Qu.:1.000   1st Qu.:2.500   1st Qu.:1.000  
     Median :3.000   Median :4.000   Median :2.000   Median :4.000   Median :2.000  
     Mean   :2.763   Mean   :2.945   Mean   :2.448   Mean   :3.434   Mean   :2.396  
     3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:3.000  
     Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
           Q9             Q10              segment   
     Min.   :1.000   Min.   :1.00   Conspicuous:200  
     1st Qu.:2.000   1st Qu.:1.00   Price      :250  
     Median :4.000   Median :2.00   Quality    :200  
     Mean   :3.085   Mean   :2.32   Style      :350  
     3rd Qu.:4.000   3rd Qu.:3.00                    
     Max.   :5.000   Max.   :5.00                    
    > standard1=preProcess(sim.dat1,method=c('center','scale'))#变量减去均值,再除以标准差
    > head(predict(standard1,sim.dat1))
            age gender      income house   store_exp online_exp store_trans online_trans        Q1
    1 1.2989557 Female  0.24179440   Yes  0.93607103  -1.049355  -0.9064934    -1.451057 0.6199408
    2 1.7219764 Female  0.26467429   Yes  0.62930797  -1.161404  -0.3653033    -1.451057 0.6199408
    3 1.4399626   Male  0.09372058   Yes  0.70613556  -1.063370   0.4464818    -1.451057 1.3095300
    4 1.5104660   Male  0.08088761   Yes -0.15185121  -1.142839   1.2582670    -1.451057 1.3095300
    5 0.8759349   Male  0.31382955   Yes  0.03904516  -1.159840  -0.3653033    -1.199705 0.6199408
    6 1.4399626   Male -0.04952922   Yes -0.20881124  -1.111638  -0.3653033    -1.074028 0.6199408

    log变换:

    apply(sim.dat1,1,log)
    #语法:apply(数据框,行列标识,使用的功能函数自己定义的功能函数也可以)
    > apply(sim.dat1[,c(1,3,5)],1,log)#上列中性别为类别变量,需要指定非类别变量来进行log

    apply 既能对行操作,又能对列操作,lapply不需要指定行列,默认对列进行操作

    head(data.frame(lapply(sim.dat1[,c(1,3,5)],log)))

    age income store_exp
    1 4.043051 11.70324 6.271242
    2 4.143135 11.71184 6.169623
    3 4.077537 11.64573 6.196059
    4 4.094345 11.64058 5.851653
    5 3.931826 11.73007 5.939186
    6 4.077537 11.58675 5.823979
    >

    分位数检验:可根据业务逻辑,判定高于或者低于某个分位数的值为异常并进行处理

    > quantile(sim.dat1$income,0.005,rm.na=T)
        0.5% 
    51047.79 
    > quantile(sim.dat1$income,0.999,rm.na=T)
       99.9% 
    317478.4 
    #将收入小于0.5%的值且不缺失的值填充为0.5%对应的值
    >sim.dat1$income[sim.dat1$income < quantile(sim.dat1$income,0.005,na.rm = T) & !is.na(sim.dat1$income)] <-51047.79
    #将收入高于99.9%且不为缺失的值赋值为99.9%对应的值
    >sim.dat1$income[sim.dat1$income > quantile(sim.dat1$income,0.999,na.rm = T) & !is.na(sdat$income)] <-317478.4

     二、共线性检测

    > library(corrplot)#去除类别变量
    > corrplot.mixed(cor(sim.dat1[,-c(2,4,19)]),order='hclust',upper = 'square')

    寻找相关性较高的列:

    > names(sim.dat1)[findCorrelation(cor(sim.dat1[, - c(2, 4, 19)]), cutoff = 0.8)]#找出相关系数大于0.8的并删除
    [1] "Q3"           "age"          "Q5"           "Q8"           "online_exp"  
    [6] "income"       "online_trans"

    三、稀疏变量:直接删除

    在原数据基础上构造一个稀疏变量值全为0,并且合并到原变量里

    > zero1<-rep(1,nrow(sim.dat1))> sim.dat1<-cbind(sim.dat1,zero1)
    > summary(sim.dat1)
    多了一列这个

    > nearZeroVar(sim.dat1, freqCut =95/5, uniqueCut = 10)
    [1] 20

    >sim.dat1 <- sim.dat1[,-nearZeroVar(sim.dat1,freqCut = 95/5,uniqueCut = 2)]#删除20列
    nearZeroVar(x,freqCut,uniqueCut)
    • x:数值类型,numeric vector,matrix,data frame
    • freqCut:第一众数与第二众数的比率的cutoff(临界值)
    • uniqueCut:剔重后的唯一值 与 样本总数量的百分比 (上例为 95/5),大于这个值不会被剔除

    名义变量:由于是ABCD类别不能进行运算,变成0和1的哑变量,便于应用在后续计算中

    单一哑变量

     >head(predict(dummyVars(~.,data = SegData),SegData,levelsOnly = F))# 用原变量名加上因子层级的名称作为新的名义变量名

    交互哑变量

    head(predict(dummyVars(~gender+house+income+income:gender,
                           data = SegData,
                           levelsOnly = F),SegData))

    Rdata数据存储读取

    > save.image('data_preprocessing.RData')
    > load('data_preprocessing.RData')
  • 相关阅读:
    Code Forces 650 C Table Compression(并查集)
    Code Forces 645B Mischievous Mess Makers
    POJ 3735 Training little cats(矩阵快速幂)
    POJ 3233 Matrix Power Series(矩阵快速幂)
    PAT 1026 Table Tennis (30)
    ZOJ 3609 Modular Inverse
    Java实现 LeetCode 746 使用最小花费爬楼梯(递推)
    Java实现 LeetCode 745 前缀和后缀搜索(使用Hash代替字典树)
    Java实现 LeetCode 745 前缀和后缀搜索(使用Hash代替字典树)
    Java实现 LeetCode 745 前缀和后缀搜索(使用Hash代替字典树)
  • 原文地址:https://www.cnblogs.com/keepgoingon/p/7159466.html
Copyright © 2011-2022 走看看