zoukankan      html  css  js  c++  java
  • R数据预处理(二)

    一、数据变换

    中心化、标准化原数据:

    > summary(sim.dat1)
          age           gender        income       house       store_exp       online_exp     
     Min.   :16.00   Female:554   Min.   : 41776   No :432   Min.   :155.8   Min.   :  68.82  
     1st Qu.:25.00   Male  :446   1st Qu.: 87896   Yes:568   1st Qu.:205.1   1st Qu.: 420.34  
     Median :36.00                Median : 93869             Median :329.8   Median :1941.86  
     Mean   :38.58                Mean   :109923             Mean   :373.1   Mean   :2120.18  
     3rd Qu.:53.00                3rd Qu.:119456             3rd Qu.:597.2   3rd Qu.:2440.78  
     Max.   :69.00                Max.   :319704             Max.   :597.3   Max.   :9479.44  
      store_trans     online_trans         Q1              Q2              Q3       
     Min.   : 1.00   Min.   : 1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
     1st Qu.: 3.00   1st Qu.: 6.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
     Median : 4.00   Median :14.00   Median :3.000   Median :1.000   Median :1.000  
     Mean   : 5.35   Mean   :13.55   Mean   :3.101   Mean   :1.823   Mean   :1.992  
     3rd Qu.: 7.00   3rd Qu.:20.00   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
     Max.   :20.00   Max.   :36.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
           Q4              Q5              Q6              Q7              Q8       
     Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
     1st Qu.:2.000   1st Qu.:1.750   1st Qu.:1.000   1st Qu.:2.500   1st Qu.:1.000  
     Median :3.000   Median :4.000   Median :2.000   Median :4.000   Median :2.000  
     Mean   :2.763   Mean   :2.945   Mean   :2.448   Mean   :3.434   Mean   :2.396  
     3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:3.000  
     Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
           Q9             Q10              segment   
     Min.   :1.000   Min.   :1.00   Conspicuous:200  
     1st Qu.:2.000   1st Qu.:1.00   Price      :250  
     Median :4.000   Median :2.00   Quality    :200  
     Mean   :3.085   Mean   :2.32   Style      :350  
     3rd Qu.:4.000   3rd Qu.:3.00                    
     Max.   :5.000   Max.   :5.00                    
    > standard1=preProcess(sim.dat1,method=c('center','scale'))#变量减去均值,再除以标准差
    > head(predict(standard1,sim.dat1))
            age gender      income house   store_exp online_exp store_trans online_trans        Q1
    1 1.2989557 Female  0.24179440   Yes  0.93607103  -1.049355  -0.9064934    -1.451057 0.6199408
    2 1.7219764 Female  0.26467429   Yes  0.62930797  -1.161404  -0.3653033    -1.451057 0.6199408
    3 1.4399626   Male  0.09372058   Yes  0.70613556  -1.063370   0.4464818    -1.451057 1.3095300
    4 1.5104660   Male  0.08088761   Yes -0.15185121  -1.142839   1.2582670    -1.451057 1.3095300
    5 0.8759349   Male  0.31382955   Yes  0.03904516  -1.159840  -0.3653033    -1.199705 0.6199408
    6 1.4399626   Male -0.04952922   Yes -0.20881124  -1.111638  -0.3653033    -1.074028 0.6199408

    log变换:

    apply(sim.dat1,1,log)
    #语法:apply(数据框,行列标识,使用的功能函数自己定义的功能函数也可以)
    > apply(sim.dat1[,c(1,3,5)],1,log)#上列中性别为类别变量,需要指定非类别变量来进行log

    apply 既能对行操作,又能对列操作,lapply不需要指定行列,默认对列进行操作

    head(data.frame(lapply(sim.dat1[,c(1,3,5)],log)))

    age income store_exp
    1 4.043051 11.70324 6.271242
    2 4.143135 11.71184 6.169623
    3 4.077537 11.64573 6.196059
    4 4.094345 11.64058 5.851653
    5 3.931826 11.73007 5.939186
    6 4.077537 11.58675 5.823979
    >

    分位数检验:可根据业务逻辑,判定高于或者低于某个分位数的值为异常并进行处理

    > quantile(sim.dat1$income,0.005,rm.na=T)
        0.5% 
    51047.79 
    > quantile(sim.dat1$income,0.999,rm.na=T)
       99.9% 
    317478.4 
    #将收入小于0.5%的值且不缺失的值填充为0.5%对应的值
    >sim.dat1$income[sim.dat1$income < quantile(sim.dat1$income,0.005,na.rm = T) & !is.na(sim.dat1$income)] <-51047.79
    #将收入高于99.9%且不为缺失的值赋值为99.9%对应的值
    >sim.dat1$income[sim.dat1$income > quantile(sim.dat1$income,0.999,na.rm = T) & !is.na(sdat$income)] <-317478.4

     二、共线性检测

    > library(corrplot)#去除类别变量
    > corrplot.mixed(cor(sim.dat1[,-c(2,4,19)]),order='hclust',upper = 'square')

    寻找相关性较高的列:

    > names(sim.dat1)[findCorrelation(cor(sim.dat1[, - c(2, 4, 19)]), cutoff = 0.8)]#找出相关系数大于0.8的并删除
    [1] "Q3"           "age"          "Q5"           "Q8"           "online_exp"  
    [6] "income"       "online_trans"

    三、稀疏变量:直接删除

    在原数据基础上构造一个稀疏变量值全为0,并且合并到原变量里

    > zero1<-rep(1,nrow(sim.dat1))> sim.dat1<-cbind(sim.dat1,zero1)
    > summary(sim.dat1)
    多了一列这个

    > nearZeroVar(sim.dat1, freqCut =95/5, uniqueCut = 10)
    [1] 20

    >sim.dat1 <- sim.dat1[,-nearZeroVar(sim.dat1,freqCut = 95/5,uniqueCut = 2)]#删除20列
    nearZeroVar(x,freqCut,uniqueCut)
    • x:数值类型,numeric vector,matrix,data frame
    • freqCut:第一众数与第二众数的比率的cutoff(临界值)
    • uniqueCut:剔重后的唯一值 与 样本总数量的百分比 (上例为 95/5),大于这个值不会被剔除

    名义变量:由于是ABCD类别不能进行运算,变成0和1的哑变量,便于应用在后续计算中

    单一哑变量

     >head(predict(dummyVars(~.,data = SegData),SegData,levelsOnly = F))# 用原变量名加上因子层级的名称作为新的名义变量名

    交互哑变量

    head(predict(dummyVars(~gender+house+income+income:gender,
                           data = SegData,
                           levelsOnly = F),SegData))

    Rdata数据存储读取

    > save.image('data_preprocessing.RData')
    > load('data_preprocessing.RData')
  • 相关阅读:
    图片不能显示
    Lambda表达式where过滤数据
    存储文本到一个文件里
    获取用户临时文件夹路径
    判断某一个字符串是否存在另一个字符串中
    使用反射为特性赋值
    字符串与数据流之间的转换
    控制台应用程序获取计算机名
    重复输出字符或字符串
    使用HashSet<>去除重复元素的集合
  • 原文地址:https://www.cnblogs.com/keepgoingon/p/7159466.html
Copyright © 2011-2022 走看看