zoukankan      html  css  js  c++  java
  • 93、R语言教程详解

    加载数据
    > w<-read.table("test.prn",header = T)
    > w
      X.. X...1
    1   A     2
    2   B     3
    3   C     5
    4   D     5
    > library(readxl)
    > dat<-read_excel("test.xlsx")
    > dat
    # A tibble: 4 x 2
      `商品` `价格`
       <chr>  <dbl>
    1      A      2
    2      B      3
    3      C      5
    4      D      5
    > bank=read.table("bank-full.csv",header = TRUE,sep=",")
    查看数据结构
    > str(bank)
    'data.frame':    41188 obs. of  21 variables:
     $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
     $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
     $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
     $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
     $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
     $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
     $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
     $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
     $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
     $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
     $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
     $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
     $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
     $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
     $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
     $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
     $ cons.price.idx: num  94 94 94 94 94 ...
     $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
     $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
     $ nr.employed   : num  5191 5191 5191 5191 5191 ...
     $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
    查看数据的最小值,最大值,中位数,平均数,分位数
    > summary(bank)
          age                 job            marital     
     Min.   :17.00   admin.     :10422   divorced: 4612  
     1st Qu.:32.00   blue-collar: 9254   married :24928  
     Median :38.00   technician : 6743   single  :11568  
     Mean   :40.02   services   : 3969   unknown :   80  
     3rd Qu.:47.00   management : 2924                   
     Max.   :98.00   retired    : 1720                   
                     (Other)    : 6156                   
                   education        default         housing     
     university.degree  :12168   no     :32588   no     :18622  
     high.school        : 9515   unknown: 8597   unknown:  990  
     basic.9y           : 6045   yes    :    3   yes    :21576  
     professional.course: 5243                                  
     basic.4y           : 4176                                  
     basic.6y           : 2292                                  
     (Other)            : 1749                                  
          loan            contact          month       day_of_week
     no     :33950   cellular :26144   may    :13769   fri:7827   
     unknown:  990   telephone:15044   jul    : 7174   mon:8514   
     yes    : 6248                     aug    : 6178   thu:8623   
                                       jun    : 5318   tue:8090   
                                       nov    : 4101   wed:8134   
                                       apr    : 2632              
                                       (Other): 2016              
        duration         campaign          pdays      
     Min.   :   0.0   Min.   : 1.000   Min.   :  0.0  
     1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0  
     Median : 180.0   Median : 2.000   Median :999.0  
     Mean   : 258.3   Mean   : 2.568   Mean   :962.5  
     3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0  
     Max.   :4918.0   Max.   :56.000   Max.   :999.0  
                                                      
        previous            poutcome      emp.var.rate     
     Min.   :0.000   failure    : 4252   Min.   :-3.40000  
     1st Qu.:0.000   nonexistent:35563   1st Qu.:-1.80000  
     Median :0.000   success    : 1373   Median : 1.10000  
     Mean   :0.173                       Mean   : 0.08189  
     3rd Qu.:0.000                       3rd Qu.: 1.40000  
     Max.   :7.000                       Max.   : 1.40000  
                                                           
     cons.price.idx  cons.conf.idx     euribor3m    
     Min.   :92.20   Min.   :-50.8   Min.   :0.634  
     1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
     Median :93.75   Median :-41.8   Median :4.857  
     Mean   :93.58   Mean   :-40.5   Mean   :3.621  
     3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961  
     Max.   :94.77   Max.   :-26.9   Max.   :5.045  
                                                    
      nr.employed     y        
     Min.   :4964   no :36548  
     1st Qu.:5099   yes: 4640  
     Median :5191              
     Mean   :5167              
     3rd Qu.:5228              
     Max.   :5228              
                               
    > psych::describe(bank)
            方差  个数    平均值  标准差  均值    去掉最大   中位数   最小值  最大值  极差    偏差        峰度
                                      绝对偏差
                                 最小值
                                之后
                                的平均数
    
                   vars     n    mean     sd  median trimmed   mad     min     max   range  skew    kurtosis
    age               1 41188   40.02  10.42   38.00   39.30  10.38   17.00   98.00   81.00  0.78     0.79
    job*              2 41188    4.72   3.59    3.00    4.48   2.97    1.00   12.00   11.00  0.45    -1.39
    marital*          3 41188    2.17   0.61    2.00    2.21   0.00    1.00    4.00    3.00 -0.06    -0.34
    education*        4 41188    4.75   2.14    4.00    4.88   2.97    1.00    8.00    7.00 -0.24    -1.21
    default*          5 41188    1.21   0.41    1.00    1.14   0.00    1.00    3.00    2.00  1.44     0.07
    housing*          6 41188    2.07   0.99    3.00    2.09   0.00    1.00    3.00    2.00 -0.14    -1.95
    loan*             7 41188    1.33   0.72    1.00    1.16   0.00    1.00    3.00    2.00  1.82     1.38
    contact*          8 41188    1.37   0.48    1.00    1.33   0.00    1.00    2.00    1.00  0.56    -1.69
    month*            9 41188    5.23   2.32    5.00    5.31   2.97    1.00   10.00    9.00 -0.31    -1.03
    day_of_week*     10 41188    3.00   1.40    3.00    3.01   1.48    1.00    5.00    4.00  0.01    -1.27
    duration         11 41188  258.29 259.28  180.00  210.61 139.36    0.00 4918.00 4918.00  3.26    20.24
    campaign         12 41188    2.57   2.77    2.00    1.99   1.48    1.00   56.00   55.00  4.76    36.97
    pdays            13 41188  962.48 186.91  999.00  999.00   0.00    0.00  999.00  999.00 -4.92    22.23
    previous         14 41188    0.17   0.49    0.00    0.05   0.00    0.00    7.00    7.00  3.83    20.11
    poutcome*        15 41188    1.93   0.36    2.00    2.00   0.00    1.00    3.00    2.00 -0.88     3.98
    emp.var.rate     16 41188    0.08   1.57    1.10    0.27   0.44   -3.40    1.40    4.80 -0.72    -1.06
    cons.price.idx   17 41188   93.58   0.58   93.75   93.58   0.56   92.20   94.77    2.57 -0.23    -0.83
    cons.conf.idx    18 41188  -40.50   4.63  -41.80  -40.60   6.52  -50.80  -26.90   23.90  0.30    -0.36
    euribor3m        19 41188    3.62   1.73    4.86    3.81   0.16    0.63    5.04    4.41 -0.71    -1.41
    nr.employed      20 41188 5167.04  72.25 5191.00 5178.43  55.00 4963.60 5228.10  264.50 -1.04     0.00
    y*               21 41188    1.11   0.32    1.00    1.02   0.00    1.00    2.00    1.00  2.45     4.00
    
                   se
    age            0.05
    job*           0.02
    marital*       0.00
    education*     0.01
    default*       0.00
    housing*       0.00
    loan*          0.00
    contact*       0.00
    month*         0.01
    day_of_week*   0.01
    duration       1.28
    campaign       0.01
    pdays          0.92
    previous       0.00
    poutcome*      0.00
    emp.var.rate   0.01
    cons.price.idx 0.00
    cons.conf.idx  0.02
    euribor3m      0.01
    nr.employed    0.36
    y*             0.00
    
    查看数据是否有缺失值
    > sapply(bank,anyNA)
               age            job        marital      education 
             FALSE          FALSE          FALSE          FALSE 
           default        housing           loan        contact 
             FALSE          FALSE          FALSE          FALSE 
             month    day_of_week       duration       campaign 
             FALSE          FALSE          FALSE          FALSE 
             pdays       previous       poutcome   emp.var.rate 
             FALSE          FALSE          FALSE          FALSE 
    cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
             FALSE          FALSE          FALSE          FALSE 
                 y 
             FALSE 
    
    成功与不成功的个数
    > table(bank$y)
    
       no   yes 
    36548  4640 
    
    在是否结婚这个属性的取值与
    是否成功的数量比较
    > table(bank$y,bank$marital)
         
          divorced married single unknown
      no      4136   22396   9948      68
      yes      476    2532   1620      12
    
    > xtabs(~y+marital,data=bank)
         marital
    y     divorced married single unknown
      no      4136   22396   9948      68
      yes      476    2532   1620      12
    > tab=table(bank$y,bank$marital)
    > tab
         
          divorced married single unknown
      no      4136   22396   9948      68
      yes      476    2532   1620      12
    
    在是否结婚这个属性上的取值
    > margin.table(tab,2)
    
    divorced  married   single  unknown 
        4612    24928    11568       80 
    > margin.table(tab,1)
    
       no   yes 
    36548  4640 
    
    在是否结婚这个属性上横向看概率
    > prop.table(tab,1)
         
             divorced     married      single     unknown
      no  0.113166247 0.612783189 0.272189997 0.001860567
      yes 0.102586207 0.545689655 0.349137931 0.002586207
    在是否结婚这个属性上纵向看概率
    
    > prop.table(tab,2)
         
           divorced   married    single   unknown
      no  0.8967910 0.8984275 0.8599585 0.8500000
      yes 0.1032090 0.1015725 0.1400415 0.1500000
    
    
    平的列联表
    以第一列和第二列,展开分类group by 1,2
    以col.vars 的取值 进行次数统计
    > ftable(bank[,c(3,4,21)],row.vars = 1:2,col.vars = "y")
                                 y   no  yes
    marital  education                      
    divorced basic.4y               406   83
             basic.6y               169   13
             basic.9y               534   31
             high.school           1086  107
             illiterate               1    1
             professional.course    596   61
             university.degree     1177  160
             unknown                167   20
    married  basic.4y              2915  313
             basic.6y              1628  139
             basic.9y              3858  298
             high.school           4683  475
             illiterate              12    3
             professional.course   2799  357
             university.degree     5573  821
             unknown                928  126
    single   basic.4y               422   31
             basic.6y               301   36
             basic.9y              1174  142
             high.school           2702  448
             illiterate               1    0
             professional.course   1247  177
             university.degree     3723  683
             unknown                378  103
    unknown  basic.4y                 5    1
             basic.6y                 6    0
             basic.9y                 6    2
             high.school             13    1
             illiterate               0    0
             professional.course      6    0
             university.degree       25    6
             unknown                  7    2
    
    卡方检验,在p值小于2.2e-16时,拒绝原假设,认为数据不服从卡方分布
    > chisq.test(tab)
    
        Pearson's Chi-squared test
    
    data:  tab
    X-squared = 122.66, df = 3, p-value < 2.2e-16
    
    画直方图
    > hist(bank$age)
    > library(lattice)
    
    画连续变量的分布,就是把直方图的中位数连接起来
    以年龄为横轴,y为纵轴,数据是bank,画图,auto.key是否有图例
    > densityplot(~age,groups = y,data=bank,plot.point=FALSE,auto.key = TRUE)
    
    画Box图
    > boxplot(age~y,data=bank)
    
    双样本t分布检验,p值小于0.05时拒绝原假设
    这里的原假设是两个样本没有相关性
    得到的结果是p值为1.805e-06,拒绝两个样本没有相关性的假设
    这里认为两个样本有相关性
    > t.test(age~y,data=bank,alternative="two.sided",var.equal=FALSE)
    
        Welch Two Sample t-test
    
    data:  age by y
    t = -4.7795, df = 5258.5, p-value = 1.805e-06
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
     -1.4129336 -0.5909889
    sample estimates:
     mean in group no mean in group yes 
             39.91119          40.91315 
    
    
    数据可视化
    画饼图
    > tab=table(bank$marital)
    > pie(tab)
    
    画直方图
    > tab=table(bank$marital)
    > barplot(tab)
    
    画下面这个图
    > tab=table(bank$marital,bank$y)
    > plot(tab)
     
    
    
    画层叠直方图
    > tab=table(bank$marital,bank$y)
    > lattice::barchart(tab,auto.key=TRUE)
     
    
    
    加载这个包,准备画图
    > library(dplyr)
    > data=group_by(bank,marital,y)
    > data=tally(data)
    !!!!!!!!!!!!!
    > ggplot2::ggplot(data=data,mapping=aes(marital,n))+geom_bar(mapping=aes(fill=y),position="dodge",stat="identity")
    
    
    
    数据预处理
    分组之后再画图
    > labels=c('青年','中年','老年')
    > bank$age_group=cut(bank$age,breaks = c(0,35,55,100),right = FALSE,labels = labels)
    > library(ggplot2)
    > ggplot(data=bank,mapping = aes(age_group))+geom_bar(mapping = aes(fill=y),position="dodge",stat="count")
     
    
    
    
    
    
    
    衍生变量
    直接使用$符向原数据框添加新的变量
    > bank$log.cons.price.idx=log(bank$cons.price.idx)
    使用transform函数向原数据框添加变量
    > bank<-transform(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))
    使用dplyr包里的mutate函数增加变量
    > bank<-dplyr::mutate(bank,log.cons.price.idx=log(cons.price.idx))
    使用dplyr包里的transmute函数只保留新生成的变量
    > bank2<-dplyr::transmute(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))
    
    中心化
    
    > v=1:10
    > v1=v-mean(v)
    > v2=scale(v,center=TRUE,scale = FALSE)
    
    无量纲化
    
    > V1=v/sqrt(sum(v^2)/(length(v)-1))
    > v2=scale(v,center=FALSE,scale=TRUE)
    
    根据最大最小值进行归一化
    
    > v3=(v-min(v))/(max(v)-min(v))
    
    
    进行标准正态化
    
    
    > v1=(v-mean(v))/sd(v)
    > v2=scale(v,center = TRUE,scale=TRUE)
    
    
    
    
    Box-Cox变换
    使用car包里的boxCox函数
    > install.packages("car")
    > library(car)
    > boxCox(age~.,data=bank)
     
     
    
    
    
    
    
    
    使用caret包,做Box-Cox变换
    > install.packages("caret")
    > library(caret)
    > dat<-subset(bank,select="age")
    > trans<-preProcess(dat,method=C("BoxCox"))
    
    
    数据预处理下
    违反常识的异常值
    基于数据分布的异常值(离群点)识别
    bank.dirty=read.csv("bank-dirty.csv")
    summary(bank.dirty)
    
         age                  job            marital                    education    
     Min.   : 17.00   admin.     :10422   divorced: 4612   university.degree  :12165  
     1st Qu.: 32.00   blue-collar: 9254   married :24928   high.school        : 9515  
     Median : 38.00   technician : 6743   single  :11568   basic.9y           : 6043  
     Mean   : 40.03   services   : 3969   NA's    :   80   professional.course: 5242  
     3rd Qu.: 47.00   management : 2924                    basic.4y           : 4175  
     Max.   :123.00   (Other)    : 7546                    (Other)            : 2310  
     NA's   :2        NA's       :  330                    NA's               : 1738  
     default      housing        loan            contact          month      
     no  :32588   no  :18622   no  :33950   cellular :26144   may    :13769  
     yes :    3   yes :21576   yes : 6248   telephone:15044   jul    : 7174  
     NA's: 8597   NA's:  990   NA's:  990                     aug    : 6178  
                                                              jun    : 5318  
                                                              nov    : 4101  
                                                              apr    : 2632  
                                                              (Other): 2016  
     day_of_week    duration         campaign          pdays          previous    
     fri:7827    Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.000  
     mon:8514    1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000  
     thu:8623    Median : 180.0   Median : 2.000   Median :999.0   Median :0.000  
     tue:8090    Mean   : 258.3   Mean   : 2.568   Mean   :962.5   Mean   :0.173  
     wed:8134    3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000  
                 Max.   :4918.0   Max.   :56.000   Max.   :999.0   Max.   :7.000  
                                                                                  
            poutcome      emp.var.rate      cons.price.idx  cons.conf.idx  
     failure    : 4252   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
     nonexistent:35563   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
     success    : 1373   Median : 1.10000   Median :93.75   Median :-41.8  
                         Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
                         3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
                         Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
                                                                           
       euribor3m      nr.employed     y        
     Min.   :0.634   Min.   :4964   no :36548  
     1st Qu.:1.344   1st Qu.:5099   yes: 4640  
     Median :4.857   Median :5191              
     Mean   :3.621   Mean   :5167              
     3rd Qu.:4.961   3rd Qu.:5228              
     Max.   :5.045   Max.   :5228              
    
    
    常识告诉我们,虽然123岁的老人存在,但概率也极低,也不太可能是银行的客户
    找出在年龄这一列的上离群值和下离群值
    
    > head(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)
          age
    39494 123
    38453  98
    38456  98
    27827  95
    38922  94
    > tail(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)
          age
    37559  17
    37580  17
    38275  17
    120    NA
    156    NA
    
    异常值的处理
    当作缺失值处理
    > bank.dirty$age[which(bank.dirty$age>98)]<-NA
    删除或者插补
    
    
    重编码
    职业类型有12个分类,不利于后续分析,把除了unknown以外的分类进行重新编码,简化成4类
    Month有12个分类,把它转化成季度
    Education的分类,除了unknow之外有7类
    
    进行重编码
    levels(bank.dirty$job) <- c( "management","services","entrepreneur","entrepreneur",
                            "management","unemployed",  "entrepreneur","services",
                            "unemployed","services","unemployed","unknown" )
    > levels(bank.dirty$month) <- c("Q2","Q3","Q4","Q3","Q2",
                             "Q1","Q2","Q4","Q4","Q3")
    > 
    > levels(bank.dirty$education) <- c( "primary","primary","primary","secondary",
                                  "primary","tertiary","tertiary","unknown")
    
    
    缺失值
    分类较多,分类是unknown,不能给我们提供信息
    有些模型不能处理缺失值,比如Logistic回归
    缺失值插补的方法
    1、    用中位数或众数插补
    > library(imputeMissings)
    > bank.clean<-impute(bank.dirty,object = compute(bank.dirty,method = "median/mode"))
    2、    最邻近(knn)插补
    library(DMwR)
    bank.clean=knnImputation(bank.dirty,k=5)
    
    3、    随机森林插补
    library(missForest)
     Imp = missForest(bank.dirty)
     bank.clean = Imp$ximp
    
    缺失值插补的R包
    1、    imputeMissings包
    2、    DMwR包
    
    
    
    
    
    
    用Logistic回归建立客户响应模型
    1、    广义线性模型
    广义线性模型擅长于处理因变量不是连续变量的问题
    1)    Y是分类变量
    2)    Y是定序变量
    3)    Y是离散取值
    2、    当Y取值是0-1二分类变量是,就是Logistic回归
    
    Logistic回归在R中的实现
    数据重编码
    bank$y=ifelse(bank$y=='yes',1,0)
    改成以Q1为参考因子
    bank$month<-relevel(bank$month,ref="Q1")
    构建Logistic回归模型
    > model<-glm(y~.,data=bank,family = 'binomial')
    > summary(model)
    
    Call:
    glm(formula = y ~ ., family = "binomial", data = bank)
    
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -5.9958  -0.3082  -0.1887  -0.1333   3.4283  
    
    Coefficients: (1 not defined because of singularities)
                                   Estimate Std. Error z value Pr(>|z|)    
    (Intercept)                  -1.957e+02  1.935e+01 -10.116  < 2e-16 ***
    age                           1.851e-03  2.415e-03   0.767 0.443289    
    jobblue-collar               -2.659e-01  7.942e-02  -3.348 0.000814 ***
    jobentrepreneur              -2.029e-01  1.248e-01  -1.626 0.103924    
    jobhousemaid                 -3.628e-02  1.475e-01  -0.246 0.805705    
    jobmanagement                -8.054e-02  8.501e-02  -0.947 0.343423    
    jobretired                    2.928e-01  1.067e-01   2.743 0.006092 ** 
    jobself-employed             -1.680e-01  1.176e-01  -1.428 0.153332    
    jobservices                  -1.497e-01  8.552e-02  -1.751 0.079969 .  
    jobstudent                    2.674e-01  1.106e-01   2.416 0.015680 *  
    jobtechnician                 3.462e-03  7.096e-02   0.049 0.961086    
    jobunemployed                 8.514e-03  1.273e-01   0.067 0.946686    
    jobunknown                   -8.046e-02  2.390e-01  -0.337 0.736420    
    maritalmarried                1.567e-02  6.824e-02   0.230 0.818420    
    maritalsingle                 6.620e-02  7.791e-02   0.850 0.395473    
    maritalunknown                6.303e-02  4.113e-01   0.153 0.878211    
    educationbasic.6y             9.647e-02  1.202e-01   0.803 0.422195    
    educationbasic.9y            -2.154e-02  9.494e-02  -0.227 0.820557    
    educationhigh.school          3.381e-02  9.188e-02   0.368 0.712895    
    educationilliterate           1.132e+00  7.395e-01   1.531 0.125887    
    educationprofessional.course  1.136e-01  1.013e-01   1.121 0.262175    
    educationuniversity.degree    2.134e-01  9.188e-02   2.322 0.020211 *  
    educationunknown              1.361e-01  1.196e-01   1.138 0.255314    
    defaultunknown               -3.055e-01  6.712e-02  -4.552 5.32e-06 ***
    defaultyes                   -7.150e+00  1.135e+02  -0.063 0.949784    
    housingunknown               -7.385e-02  1.390e-01  -0.531 0.595260    
    housingyes                   -3.740e-03  4.121e-02  -0.091 0.927695    
    loanunknown                          NA         NA      NA       NA    
    loanyes                      -6.362e-02  5.725e-02  -1.111 0.266454    
    contacttelephone             -6.068e-01  7.124e-02  -8.518  < 2e-16 ***
    monthQ2                      -2.192e+00  1.125e-01 -19.479  < 2e-16 ***
    monthQ3                      -1.463e+00  1.148e-01 -12.747  < 2e-16 ***
    monthQ4                      -1.995e+00  1.240e-01 -16.088  < 2e-16 ***
    day_of_weekmon               -1.216e-01  6.588e-02  -1.846 0.064887 .  
    day_of_weekthu                6.375e-02  6.382e-02   0.999 0.317842    
    day_of_weektue                6.867e-02  6.545e-02   1.049 0.294118    
    day_of_weekwed                1.436e-01  6.530e-02   2.199 0.027911 *  
    duration                      4.667e-03  7.397e-05  63.092  < 2e-16 ***
    campaign                     -4.543e-02  1.158e-02  -3.922 8.77e-05 ***
    pdays                        -9.627e-04  2.162e-04  -4.452 8.50e-06 ***
    previous                     -5.806e-02  5.879e-02  -0.988 0.323369    
    poutcomenonexistent           4.507e-01  9.372e-02   4.809 1.51e-06 ***
    poutcomesuccess               9.371e-01  2.106e-01   4.451 8.56e-06 ***
    emp.var.rate                 -1.389e+00  7.693e-02 -18.057  < 2e-16 ***
    cons.price.idx                1.815e+00  1.193e-01  15.218  < 2e-16 ***
    cons.conf.idx                 3.353e-02  6.664e-03   5.033 4.84e-07 ***
    euribor3m                     6.054e-02  1.126e-01   0.537 0.590987    
    nr.employed                   4.937e-03  1.873e-03   2.635 0.008413 ** 
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 28999  on 41187  degrees of freedom
    Residual deviance: 17199  on 41141  degrees of freedom
    AIC: 17293
    
    Number of Fisher Scoring iterations: 10
    
    
    > exp(coef(model))
                     (Intercept)                          age               jobblue-collar 
                    9.856544e-86                 1.001853e+00                 7.665077e-01 
                 jobentrepreneur                 jobhousemaid                jobmanagement 
                    8.163314e-01                 9.643733e-01                 9.226187e-01 
                      jobretired             jobself-employed                  jobservices 
                    1.340142e+00                 8.453874e-01                 8.609387e-01 
                      jobstudent                jobtechnician                jobunemployed 
                    1.306514e+00                 1.003468e+00                 1.008550e+00 
                      jobunknown               maritalmarried                maritalsingle 
                    9.226922e-01                 1.015789e+00                 1.068445e+00 
                  maritalunknown            educationbasic.6y            educationbasic.9y 
                    1.065061e+00                 1.101276e+00                 9.786948e-01 
            educationhigh.school          educationilliterate educationprofessional.course 
                    1.034388e+00                 3.101297e+00                 1.120248e+00 
      educationuniversity.degree             educationunknown               defaultunknown 
                    1.237856e+00                 1.145744e+00                 7.367445e-01 
                      defaultyes               housingunknown                   housingyes 
                    7.851906e-04                 9.288126e-01                 9.962671e-01 
                     loanunknown                      loanyes             contacttelephone 
                              NA                 9.383587e-01                 5.450980e-01 
                         monthQ2                      monthQ3                      monthQ4 
                    1.116739e-01                 2.314802e-01                 1.360620e-01 
                  day_of_weekmon               day_of_weekthu               day_of_weektue 
                    8.854888e-01                 1.065828e+00                 1.071082e+00 
                  day_of_weekwed                     duration                     campaign 
                    1.154380e+00                 1.004678e+00                 9.555850e-01 
                           pdays                     previous          poutcomenonexistent 
                    9.990378e-01                 9.435960e-01                 1.569466e+00 
                 poutcomesuccess                 emp.var.rate               cons.price.idx 
                    2.552531e+00                 2.493091e-01                 6.140533e+00 
                   cons.conf.idx                    euribor3m                  nr.employed 
                    1.034103e+00                 1.062408e+00                 1.004949e+00 
    
    
    Job变量的基准水平是management,从上面的结果看,服务业和自主劳动者购买银行产品的几率(odds)是管理岗从业人员的0.88倍,未就业人员购买银行产品的几率是管理岗人员的1.25倍
    
    
    > summary(model.step)
    向前逐步回归
    > model.step=step(model,direction = "backward")
    向后逐步回归
    > model.step = step(model, direction = "forward")
    双向逐步回归
    > model.step = step(model, direction = "both")
    > summary(model.step)
    
    Call:
    glm(formula = y ~ job + education + default + contact + month + 
        day_of_week + duration + campaign + pdays + poutcome + emp.var.rate + 
        cons.price.idx + cons.conf.idx + nr.employed, family = "binomial", 
        data = bank)
    
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -5.9884  -0.3088  -0.1887  -0.1332   3.4026  
    
    Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
    (Intercept)                  -2.031e+02  1.426e+01 -14.246  < 2e-16 ***
    jobblue-collar               -2.700e-01  7.917e-02  -3.411 0.000648 ***
    jobentrepreneur              -2.043e-01  1.242e-01  -1.645 0.100003    
    jobhousemaid                 -2.832e-02  1.464e-01  -0.193 0.846590    
    jobmanagement                -8.368e-02  8.409e-02  -0.995 0.319670    
    jobretired                    3.234e-01  9.130e-02   3.542 0.000397 ***
    jobself-employed             -1.670e-01  1.176e-01  -1.421 0.155435    
    jobservices                  -1.528e-01  8.545e-02  -1.789 0.073666 .  
    jobstudent                    2.682e-01  1.046e-01   2.565 0.010316 *  
    jobtechnician                 4.389e-03  7.093e-02   0.062 0.950665    
    jobunemployed                 8.975e-03  1.271e-01   0.071 0.943715    
    jobunknown                   -6.363e-02  2.378e-01  -0.268 0.789057    
    educationbasic.6y             8.993e-02  1.196e-01   0.752 0.452024    
    educationbasic.9y            -2.716e-02  9.416e-02  -0.288 0.772992    
    educationhigh.school          2.890e-02  9.053e-02   0.319 0.749573    
    educationilliterate           1.118e+00  7.398e-01   1.511 0.130744    
    educationprofessional.course  1.084e-01  1.004e-01   1.079 0.280686    
    educationuniversity.degree    2.103e-01  9.017e-02   2.332 0.019678 *  
    educationunknown              1.363e-01  1.195e-01   1.140 0.254110    
    defaultunknown               -3.017e-01  6.666e-02  -4.526 6.02e-06 ***
    defaultyes                   -7.141e+00  1.135e+02  -0.063 0.949831    
    contacttelephone             -6.011e-01  7.069e-02  -8.504  < 2e-16 ***
    monthQ2                      -2.210e+00  1.108e-01 -19.939  < 2e-16 ***
    monthQ3                      -1.475e+00  1.146e-01 -12.869  < 2e-16 ***
    monthQ4                      -1.982e+00  1.183e-01 -16.755  < 2e-16 ***
    day_of_weekmon               -1.210e-01  6.584e-02  -1.837 0.066174 .  
    day_of_weekthu                6.208e-02  6.374e-02   0.974 0.330066    
    day_of_weektue                6.851e-02  6.538e-02   1.048 0.294651    
    day_of_weekwed                1.420e-01  6.525e-02   2.176 0.029592 *  
    duration                      4.667e-03  7.396e-05  63.099  < 2e-16 ***
    campaign                     -4.587e-02  1.158e-02  -3.960 7.49e-05 ***
    pdays                        -8.822e-04  2.024e-04  -4.358 1.31e-05 ***
    poutcomenonexistent           5.219e-01  6.356e-02   8.211  < 2e-16 ***
    poutcomesuccess               9.996e-01  2.028e-01   4.928 8.31e-07 ***
    emp.var.rate                 -1.376e+00  6.885e-02 -19.980  < 2e-16 ***
    cons.price.idx                1.845e+00  1.041e-01  17.725  < 2e-16 ***
    cons.conf.idx                 3.622e-02  4.853e-03   7.464 8.42e-14 ***
    nr.employed                   5.883e-03  9.765e-04   6.024 1.70e-09 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 28999  on 41187  degrees of freedom
    Residual deviance: 17203  on 41150  degrees of freedom
    AIC: 17279
    
    Number of Fisher Scoring iterations: 10
    
    
    
    
    
    模型预测
    用predict函数,参数type=’response’
    Newdata参数是要预测的数据集
    
    > prob<-predict(model.step,type = 'response')
    > head(prob)
              1           2           3           4           5           6 
    0.015029328 0.006044212 0.011640349 0.010173952 0.016897254 0.007174804 
    
    假设以0.5为临界值
    > pre<-ifelse(prob>0.5,1,0)
    > table(pre,bank$y)
       
    pre     0     1
      0 35596  2667
      1   952  1973
    
    > 
    
    预测的准确率
    > (35592+1964)/(35592+2676+956+1964)
    [1] 0.911819
    
    
    
    实际有响应的客户被识别出了多少
    > 1964/(1964+2676)
    [1] 0.4232759
    
    
    
    
    模型评估
    
    > confusionMatrix(bank$y,pre,pos='1')
    Confusion Matrix and Statistics
    
              Reference
    Prediction     0     1
             0 35596   952
             1  2667  1973
                                              
                   Accuracy : 0.9121          
                     95% CI : (0.9094, 0.9149)
        No Information Rate : 0.929           
        P-Value [Acc > NIR] : 1               
                                              
                      Kappa : 0.476           
     Mcnemar's Test P-Value : <2e-16          
                                              
                Sensitivity : 0.67453         
                Specificity : 0.93030         
             Pos Pred Value : 0.42522         
             Neg Pred Value : 0.97395         
                 Prevalence : 0.07102         
             Detection Rate : 0.04790         
       Detection Prevalence : 0.11265         
          Balanced Accuracy : 0.80241         
                                              
           'Positive' Class : 1               
                                        
    Kappa 统计量(kappa statistic)
    用于评判分类器的分类结果与随机分类的差异度
    用Kappa统计量评价:
        较差:小于0.20
        一般:0.20至0.40
        稳健:0.40至0.60
        好的:0.60至0.80
    很好的:0.80至1.00
    
    
    ROC曲线
    pred<-prediction(prob,bank$y)
    perf<-performance(pred,measure = "tpr",x="fpr")
    plot(perf)
     
    
    
    
    
    
    
    
    
    
    
    
    
    RandomForest
    加载数据列
    
    > data=read.table("input.txt",header = TRUE)
    > str(data)
    'data.frame':    222 obs. of  23 variables:
     $ Acti_Profile             : num  0 0 0 0 0 0 0 0 0 0 ...
     $ Activity                 : num  1.25 0 0.938 6.562 0 ...
     $ Diastolic_PTT            : num  256 240 253 0 241 ...
     $ Diastolic                : num  73.2 78.6 74 0 78.4 ...
     $ Heart_Rate_Curve         : num  81.2 69.7 77.6 95 83.6 ...
     $ Heart_Rate_Variability_HF: num  131 250 135 144 141 ...
     $ Heart_Rate_Variability_LF: num  311 218 203 301 244 ...
     $ MAP                      : num  86 93.5 86.9 0 91.7 ...
     $ Position                 : num  0 0 0 1 0 0 0 0 0 0 ...
     $ PTT_Raw                  : num  308 288 308 0 295 ...
     $ RR_Interval              : num  734 878 773 632 714 ...
     $ Sleep_Wake               : num  1 1 1 1 1 0 1 1 0 0 ...
     $ SpO2                     : num  0 0 99 0 98.4 ...
     $ Sympatho_Vagal_Balance   : num  23 8.17 14.5 20.4 16.88 ...
     $ Systolic_PTT             : num  308 288 307 0 295 ...
     $ Systolic                 : num  113 124 113 0 119 ...
     $ Autonomic_arousals       : num  0 0 0 0 0 0 0 0 0 0 ...
     $ Cardio_complex           : num  0 0 0 1 0 0 0 0 0 0 ...
     $ Cardio_rhythm            : num  0 0 2 0 0 0 0 0 0 0 ...
     $ Classification_Arousal   : num  0 0 0 0 0 0 0 0 0 0 ...
     $ PTT_Events               : num  1 0 2 0 0 0 0 0 0 0 ...
     $ Systolic_Events          : num  1 0 1 0 0 0 0 0 0 0 ...
     $ y                        : num  1 0 1 0 0 0 0 0 0 0 ...
    加载随机森林包
    > library(randomForest)
    进行训练  以y作为因变量,其余数据作为自变量
    > rf <- randomForest(y ~ ., data=data, ntree=100, proximity=TRUE,importance=TRUE)
    > plot(rf)
     
    重要性检测
    衡量把一个变量的取值变为随机数,随机森林预测准确性的降低程度
    > importance(rf,type=1)
                                  %IncMSE
    Acti_Profile               0.00000000
    Activity                   0.99353251
    Diastolic_PTT              0.32193611
    Diastolic                  1.99891809
    Heart_Rate_Curve           0.92001352
    Heart_Rate_Variability_HF  2.07870722
    Heart_Rate_Variability_LF -0.24957163
    MAP                        0.48142975
    Position                   1.86876751
    PTT_Raw                    1.94648914
    RR_Interval                0.60557964
    Sleep_Wake                 1.00503782
    SpO2                       0.25396165
    Sympatho_Vagal_Balance     1.42906765
    Systolic_PTT               1.27965813
    Systolic                   0.77382673
    Autonomic_arousals         0.00000000
    Cardio_complex             1.00503782
    Cardio_rhythm              1.14283152
    Classification_Arousal    -0.04383997
    PTT_Events                 4.63980680
    Systolic_Events           33.29461169
    
    输出随机森林的模型
    > print(rf)
    
    Call:
     randomForest(formula = y ~ ., data = data, ntree = 100, proximity = TRUE,      importance = TRUE) 
                   Type of random forest: regression
                         Number of trees: 100
    No. of variables tried at each split: 7
    
              Mean of squared residuals: 0.003226897     残差平方和SSE
                        % Var explained: 98.7
    
    > 
    总平方和(SST):(样本数据-样本均值)的平方和
    回归平方和(SSR):(预测数据-样本均值)的平方和
    残差平方和(SSE):(样本数据-预测数据均值)的平方和
    
    SST = SSR + SSE   
    
    
    
    
    
    基尼指数:
    
    > importance(rf,type=2)
                              IncNodePurity
    Acti_Profile                0.000000000
    Activity                    0.445181480
    Diastolic_PTT               0.452221870
    Diastolic                   0.449372186
    Heart_Rate_Curve            0.473113852
    Heart_Rate_Variability_HF   0.226815300
    Heart_Rate_Variability_LF   0.205457353
    MAP                         0.536977574
    Position                    0.307333210
    PTT_Raw                     0.656726800
    RR_Interval                 0.452738011
    Sleep_Wake                  0.014423077
    SpO2                        1.793361279
    Sympatho_Vagal_Balance      0.352759689
    Systolic_PTT                0.851951505
    Systolic                    0.823955781
    Autonomic_arousals          0.000000000
    Cardio_complex              0.008047619
    Cardio_rhythm               0.141907084
    Classification_Arousal      0.085739429
    PTT_Events                  7.468690820
    Systolic_Events            39.000163018
    
    > 
    进行预测
    prediction <- predict(rf, data[,],type="response")
    输出预测结果
    table(observed =data$y,predicted=prediction) 
    plot(prediction)
    
     
    
    
    
    
    
    支持向量机
    library(e1071)
    svmfit<-svm(y~.,data=data,kernel="linear",cost=10,scale=FALSE)
    > print(svmfit)
    
    Call:
    svm(formula = y ~ ., data = data, kernel = "linear", cost = 10, scale = FALSE)
    
    
    Parameters:
       SVM-Type:  eps-regression 
     SVM-Kernel:  linear 
           cost:  10 
          gamma:  0.04545455 
        epsilon:  0.1 
    
    
    Number of Support Vectors:  20
    > plot(svmfit,data)
     
    
    
    神经网络
    
    > concrete<-read_excel("Concrete_Data.xls")
    > str(concrete)
    Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    1030 obs. of  9 variables:
     $ Cement      : num  540 540 332 332 199 ...
     $ Slag        : num  0 0 142 142 132 ...
     $ Ash         : num  0 0 0 0 0 0 0 0 0 0 ...
     $ water       : num  162 162 228 228 192 228 228 228 228 228 ...
     $ superplastic: num  2.5 2.5 0 0 0 0 0 0 0 0 ...
     $ coarseagg   : num  1040 1055 932 932 978 ...
     $ fineagg     : num  676 676 594 594 826 ...
     $ age         : num  28 28 270 365 360 90 365 28 28 28 ...
     $ strength    : num  80 61.9 40.3 41.1 44.3 ...
    
    
    > normalize <- function(x){ return ((x-min(x))/(max(x)-min(x)))}
    > concrete_norm <- as.data.frame(lapply(concrete,normalize))
    
    
    > concrete_train <- concrete_norm[1:773,]
    > concrete_test <- concrete_norm[774:1030,]
    
    
    > library(neuralnet)
    > concrete_model <- neuralnet(strength ~ Cement+Slag+Ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train)
    > plot(concrete_model)
     
    
    
    
    
    
    
    model_results <- compute(concrete_model,concrete_test[1:8])
    predicted_strength <- model_results$net.result
    > cor(predicted_strength,concrete_test$strength)
                 [,1]
    [1,] 0.7205120076
    > concrete_model2 <- neuralnet(strength ~ Cement+Slag+Ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train,hidden=5)
    > plot(concrete_model2)
     
    计算误差
    > model_results2 <- compute(concrete_model2,concrete_test[1:8])
    > predicted_strength2 <- model_results2$net.result
    > cor(predicted_strength2,concrete_test$strength)
                 [,1]
    [1,] 0.6727155609
    
    > 
    
    
    
    
    
    主成分分析
    身高、体重、胸围、坐高
    > test<-data.frame(
    +     X1=c(148, 139, 160, 149, 159, 142, 153, 150, 151, 139,
    +          140, 161, 158, 140, 137, 152, 149, 145, 160, 156,
    +          151, 147, 157, 147, 157, 151, 144, 141, 139, 148),
    +     X2=c(41, 34, 49, 36, 45, 31, 43, 43, 42, 31,
    +          29, 47, 49, 33, 31, 35, 47, 35, 47, 44,
    +          42, 38, 39, 30, 48, 36, 36, 30, 32, 38),
    +     X3=c(72, 71, 77, 67, 80, 66, 76, 77, 77, 68,
    +          64, 78, 78, 67, 66, 73, 82, 70, 74, 78,
    +          73, 73, 68, 65, 80, 74, 68, 67, 68, 70),
    +     X4=c(78, 76, 86, 79, 86, 76, 83, 79, 80, 74,
    +          74, 84, 83, 77, 73, 79, 79, 77, 87, 85,
    +          82, 78, 80, 75, 88, 80, 76, 76, 73, 78)
    + )
    > test.pr<-princomp(test,cor=TRUE)
    > summary(test.pr,loadings=TRUE)
    Importance of components:
                                 Comp.1        Comp.2        Comp.3        Comp.4
    Standard deviation     1.8817805390 0.55980635717 0.28179594325 0.25711843909
    Proportion of Variance 0.8852744993 0.07834578938 0.01985223841 0.01652747293
    Cumulative Proportion  0.8852744993 0.96362028866 0.98347252707 1.00000000000
    
    Loadings:
       Comp.1 Comp.2 Comp.3 Comp.4
    X1  0.497  0.543 -0.450  0.506
    X2  0.515 -0.210 -0.462 -0.691
    X3  0.481 -0.725  0.175  0.461
    X4  0.507  0.368  0.744 -0.232
    
    
    前两个主成分的累计贡献率已经达到96% 可以舍去另外两个主成分 达到降维的目的
    因此可以得到函数表达式 Z1=-0.497X'1-0.515X'2-0.481X'3-0.507X'4
                                           Z2=  0.543X'1-0.210X'2-0.725X'3-0.368X'4
    4.画主成分的碎石图并预测
     
    > screeplot(test.pr,type="lines")
    > p<-predict(test.pr)
    > p
                  Comp.1         Comp.2         Comp.3          Comp.4
     [1,] -0.06990949737 -0.23813701272 -0.35509247634 -0.266120139417
     [2,] -1.59526339772 -0.71847399061  0.32813232022 -0.118056645885
     [3,]  2.84793151061  0.38956678680 -0.09731731272 -0.279482487139
     [4,] -0.75996988424  0.80604334819 -0.04945721875 -0.162949297761
     [5,]  2.73966776853  0.01718087263  0.36012614873  0.358653043787
     [6,] -2.10583167924  0.32284393414  0.18600422367 -0.036456083707
     [7,]  1.42105591247 -0.06053164925  0.21093320662 -0.044223092351
     [8,]  0.82583976981 -0.78102575640 -0.27557797533  0.057288571933
     [9,]  0.93464401954 -0.58469241699 -0.08814135786  0.181037745585
    [10,] -2.36463819933 -0.36532199291  0.08840476284  0.045520127461
    [11,] -2.83741916086  0.34875841111  0.03310422938 -0.031146930047
    [12,]  2.60851223537  0.21278727930 -0.33398036623  0.210157574387
    [13,]  2.44253342081 -0.16769495893 -0.46918095412 -0.162987829937
    [14,] -1.86630668724  0.05021383642  0.37720280364 -0.358821916178
    [15,] -2.81347420580 -0.31790107093 -0.03291329149 -0.222035112399
    [16,] -0.06392982655  0.20718447599  0.04334339948  0.703533623798
    [17,]  1.55561022242 -1.70439673831 -0.33126406220  0.007551878960
    [18,] -1.07392250663 -0.06763418320  0.02283648409  0.048606680158
    [19,]  2.52174211878  0.97274300950  0.12164633439 -0.390667990681
    [20,]  2.14072377494  0.02217881219  0.37410972458  0.129548959692
    [21,]  0.79624421805  0.16307887263  0.12781269571 -0.294140762463
    [22,] -0.28708320594 -0.35744666106 -0.03962115883  0.080991988802
    [23,]  0.25151075072  1.25555187663 -0.55617324819  0.109068938725
    [24,] -2.05706031616  0.78894493512 -0.26552109297  0.388088642937
    [25,]  3.08596854773 -0.05775318018  0.62110421208 -0.218939612456
    [26,]  0.16367554630  0.04317931667  0.24481850312  0.560248997030
    [27,] -1.37265052598  0.02220972121 -0.23378320040 -0.257399715466
    [28,] -2.16097778154  0.13733232981  0.35589738735  0.093123683044
    [29,] -2.40434826507 -0.48613137190 -0.16154440788 -0.007914021222
    [30,] -0.50287467640  0.14734316507 -0.20590831261 -0.122078819188
    
    > 

    加载数据

    > w<-read.table("test.prn",header = T)

    > w

      X.. X...1

    1   A     2

    2   B     3

    3   C     5

    4   D     5

    > library(readxl)

    > dat<-read_excel("test.xlsx")

    > dat

    # A tibble: 4 x 2

      `商品` `价格`

       <chr>  <dbl>

    1      A      2

    2      B      3

    3      C      5

    4      D      5

    > bank=read.table("bank-full.csv",header = TRUE,sep=",")

    查看数据结构

    > str(bank)

    'data.frame':  41188 obs. of  21 variables:

     $ age           : int  56 57 37 40 56 45 59 41 24 25 ...

     $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...

     $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...

     $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...

     $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...

     $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...

     $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...

     $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...

     $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...

     $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...

     $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...

     $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...

     $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...

     $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...

     $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...

     $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...

     $ cons.price.idx: num  94 94 94 94 94 ...

     $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...

     $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...

     $ nr.employed   : num  5191 5191 5191 5191 5191 ...

     $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

    查看数据的最小值,最大值,中位数,平均数,分位数

    > summary(bank)

          age                 job            marital    

     Min.   :17.00   admin.     :10422   divorced: 4612 

     1st Qu.:32.00   blue-collar: 9254   married :24928 

     Median :38.00   technician : 6743   single  :11568 

     Mean   :40.02   services   : 3969   unknown :   80 

     3rd Qu.:47.00   management : 2924                  

     Max.   :98.00   retired    : 1720                  

                     (Other)    : 6156                  

                   education        default         housing    

     university.degree  :12168   no     :32588   no     :18622 

     high.school        : 9515   unknown: 8597   unknown:  990 

     basic.9y           : 6045   yes    :    3   yes    :21576 

     professional.course: 5243                                 

     basic.4y           : 4176                                 

     basic.6y           : 2292                                  

     (Other)            : 1749                                 

          loan            contact          month       day_of_week

     no     :33950   cellular :26144   may    :13769   fri:7827  

     unknown:  990   telephone:15044   jul    : 7174   mon:8514   

     yes    : 6248                     aug    : 6178   thu:8623  

                                       jun    : 5318   tue:8090  

                                       nov    : 4101   wed:8134  

                                       apr    : 2632             

                                       (Other): 2016             

        duration         campaign          pdays     

     Min.   :   0.0   Min.   : 1.000   Min.   :  0.0 

     1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0 

     Median : 180.0   Median : 2.000   Median :999.0 

     Mean   : 258.3   Mean   : 2.568   Mean   :962.5 

     3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0 

     Max.   :4918.0   Max.   :56.000   Max.   :999.0 

                                                     

        previous            poutcome      emp.var.rate    

     Min.   :0.000   failure    : 4252   Min.   :-3.40000 

     1st Qu.:0.000   nonexistent:35563   1st Qu.:-1.80000 

     Median :0.000   success    : 1373   Median : 1.10000 

     Mean   :0.173                       Mean   : 0.08189 

     3rd Qu.:0.000                       3rd Qu.: 1.40000 

     Max.   :7.000                       Max.   : 1.40000 

                                                          

     cons.price.idx  cons.conf.idx     euribor3m   

     Min.   :92.20   Min.   :-50.8   Min.   :0.634 

     1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344 

     Median :93.75   Median :-41.8   Median :4.857 

     Mean   :93.58   Mean   :-40.5   Mean   :3.621 

     3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961 

     Max.   :94.77   Max.   :-26.9   Max.   :5.045 

                                                   

      nr.employed     y       

     Min.   :4964   no :36548 

     1st Qu.:5099   yes: 4640 

     Median :5191             

     Mean   :5167             

     3rd Qu.:5228             

     Max.   :5228             

                              

    > psych::describe(bank)

                   方差  个数    平均值  标准差  均值    去掉最大   中位数   最小值  最大值  极差    偏差        峰度

                                                               绝对偏差

                                                      最小值

                                                      之后

                                                      的平均数

     

                   vars     n    mean     sd  median trimmed   mad     min     max   range  skew    kurtosis

    age               1 41188   40.02  10.42   38.00   39.30  10.38   17.00   98.00   81.00  0.78     0.79

    job*              2 41188    4.72   3.59    3.00    4.48   2.97    1.00   12.00   11.00  0.45    -1.39

    marital*          3 41188    2.17   0.61    2.00    2.21   0.00    1.00    4.00    3.00 -0.06    -0.34

    education*        4 41188    4.75   2.14    4.00    4.88   2.97    1.00    8.00    7.00 -0.24    -1.21

    default*          5 41188    1.21   0.41    1.00    1.14   0.00    1.00    3.00    2.00  1.44     0.07

    housing*          6 41188    2.07   0.99    3.00    2.09   0.00    1.00    3.00    2.00 -0.14    -1.95

    loan*             7 41188    1.33   0.72    1.00    1.16   0.00    1.00    3.00    2.00  1.82     1.38

    contact*          8 41188    1.37   0.48    1.00    1.33   0.00    1.00    2.00    1.00  0.56    -1.69

    month*            9 41188    5.23   2.32    5.00    5.31   2.97    1.00   10.00    9.00 -0.31    -1.03

    day_of_week*     10 41188    3.00   1.40    3.00    3.01   1.48    1.00    5.00    4.00  0.01    -1.27

    duration         11 41188  258.29 259.28  180.00  210.61 139.36    0.00 4918.00 4918.00  3.26    20.24

    campaign         12 41188    2.57   2.77    2.00    1.99   1.48    1.00   56.00   55.00  4.76    36.97

    pdays            13 41188  962.48 186.91  999.00  999.00   0.00    0.00  999.00  999.00 -4.92    22.23

    previous         14 41188    0.17   0.49    0.00    0.05   0.00    0.00    7.00    7.00  3.83    20.11

    poutcome*        15 41188    1.93   0.36    2.00    2.00   0.00    1.00    3.00    2.00 -0.88     3.98

    emp.var.rate     16 41188    0.08   1.57    1.10    0.27   0.44   -3.40    1.40    4.80 -0.72    -1.06

    cons.price.idx   17 41188   93.58   0.58   93.75   93.58   0.56   92.20   94.77    2.57 -0.23    -0.83

    cons.conf.idx    18 41188  -40.50   4.63  -41.80  -40.60   6.52  -50.80  -26.90   23.90  0.30    -0.36

    euribor3m        19 41188    3.62   1.73    4.86    3.81   0.16    0.63    5.04    4.41 -0.71    -1.41

    nr.employed      20 41188 5167.04  72.25 5191.00 5178.43  55.00 4963.60 5228.10  264.50 -1.04     0.00

    y*               21 41188    1.11   0.32    1.00    1.02   0.00    1.00    2.00    1.00  2.45     4.00

     

                   se

    age            0.05

    job*           0.02

    marital*       0.00

    education*     0.01

    default*       0.00

    housing*       0.00

    loan*          0.00

    contact*       0.00

    month*         0.01

    day_of_week*   0.01

    duration       1.28

    campaign       0.01

    pdays          0.92

    previous       0.00

    poutcome*      0.00

    emp.var.rate   0.01

    cons.price.idx 0.00

    cons.conf.idx  0.02

    euribor3m      0.01

    nr.employed    0.36

    y*             0.00

     

    查看数据是否有缺失值

    > sapply(bank,anyNA)

               age            job        marital      education

             FALSE          FALSE          FALSE          FALSE

           default        housing           loan        contact

             FALSE          FALSE          FALSE          FALSE

             month    day_of_week       duration       campaign

             FALSE          FALSE          FALSE          FALSE

             pdays       previous       poutcome   emp.var.rate

             FALSE          FALSE          FALSE          FALSE

    cons.price.idx  cons.conf.idx      euribor3m    nr.employed

             FALSE          FALSE          FALSE          FALSE

                 y

             FALSE

     

    成功与不成功的个数

    > table(bank$y)

     

       no   yes

    36548  4640

     

    在是否结婚这个属性的取值与

    是否成功的数量比较

    > table(bank$y,bank$marital)

        

          divorced married single unknown

      no      4136   22396   9948      68

      yes      476    2532   1620      12

     

    > xtabs(~y+marital,data=bank)

         marital

    y     divorced married single unknown

      no      4136   22396   9948      68

      yes      476    2532   1620      12

    > tab=table(bank$y,bank$marital)

    > tab

        

          divorced married single unknown

      no      4136   22396   9948      68

      yes      476    2532   1620      12

     

    在是否结婚这个属性上的取值

    > margin.table(tab,2)

     

    divorced  married   single  unknown

        4612    24928    11568       80

    > margin.table(tab,1)

     

       no   yes

    36548  4640

     

    在是否结婚这个属性上横向看概率

    > prop.table(tab,1)

        

             divorced     married      single     unknown

      no  0.113166247 0.612783189 0.272189997 0.001860567

      yes 0.102586207 0.545689655 0.349137931 0.002586207

    在是否结婚这个属性上纵向看概率

     

    > prop.table(tab,2)

        

           divorced   married    single   unknown

      no  0.8967910 0.8984275 0.8599585 0.8500000

      yes 0.1032090 0.1015725 0.1400415 0.1500000

     

     

    平的列联表

    以第一列和第二列,展开分类group by 1,2

    col.vars 的取值进行次数统计

    > ftable(bank[,c(3,4,21)],row.vars = 1:2,col.vars = "y")

                                 y   no  yes

    marital  education                     

    divorced basic.4y               406   83

             basic.6y               169   13

             basic.9y               534   31

             high.school           1086  107

             illiterate               1    1

             professional.course    596   61

             university.degree     1177  160

             unknown                167   20

    married  basic.4y              2915  313

             basic.6y              1628  139

             basic.9y              3858  298

             high.school           4683  475

             illiterate              12    3

             professional.course   2799  357

             university.degree     5573  821

             unknown                928  126

    single   basic.4y               422   31

             basic.6y               301   36

             basic.9y              1174  142

             high.school           2702  448

             illiterate               1    0

             professional.course   1247  177

             university.degree     3723  683

             unknown                378  103

    unknown  basic.4y                 5    1

             basic.6y                 6    0

             basic.9y                 6    2

             high.school             13    1

             illiterate               0    0

             professional.course      6    0

             university.degree       25    6

             unknown                  7    2

     

    卡方检验,在p值小于2.2e-16时,拒绝原假设,认为数据不服从卡方分布

    > chisq.test(tab)

     

            Pearson's Chi-squared test

     

    data:  tab

    X-squared = 122.66, df = 3, p-value < 2.2e-16

     

    画直方图

    > hist(bank$age)

    > library(lattice)

     

    画连续变量的分布,就是把直方图的中位数连接起来

    以年龄为横轴,y为纵轴,数据是bank,画图,auto.key是否有图例

    > densityplot(~age,groups = y,data=bank,plot.point=FALSE,auto.key = TRUE)

     

    Box

    > boxplot(age~y,data=bank)

     

    双样本t分布检验,p值小于0.05时拒绝原假设

    这里的原假设是两个样本没有相关性

    得到的结果是p值为1.805e-06,拒绝两个样本没有相关性的假设

    这里认为两个样本有相关性

    > t.test(age~y,data=bank,alternative="two.sided",var.equal=FALSE)

     

            Welch Two Sample t-test

     

    data:  age by y

    t = -4.7795, df = 5258.5, p-value = 1.805e-06

    alternative hypothesis: true difference in means is not equal to 0

    95 percent confidence interval:

     -1.4129336 -0.5909889

    sample estimates:

     mean in group no mean in group yes

             39.91119          40.91315

     

     

    数据可视化

    画饼图

    > tab=table(bank$marital)

    > pie(tab)

     

    画直方图

    > tab=table(bank$marital)

    > barplot(tab)

     

    画下面这个图

    > tab=table(bank$marital,bank$y)

    > plot(tab)

     

     

    画层叠直方图

    > tab=table(bank$marital,bank$y)

    > lattice::barchart(tab,auto.key=TRUE)

     

     

    加载这个包,准备画图

    > library(dplyr)

    > data=group_by(bank,marital,y)

    > data=tally(data)

    !!!!!!!!!!!!!

    > ggplot2::ggplot(data=data,mapping=aes(marital,n))+geom_bar(mapping=aes(fill=y),position="dodge",stat="identity")
     
     
     
    数据预处理
    分组之后再画图

    > labels=c('青年','中年','老年')

    > bank$age_group=cut(bank$age,breaks = c(0,35,55,100),right = FALSE,labels = labels)

    > library(ggplot2)

    > ggplot(data=bank,mapping = aes(age_group))+geom_bar(mapping = aes(fill=y),position="dodge",stat="count")

     

     

     

     

     

     

    衍生变量
    直接使用$符向原数据框添加新的变量

    > bank$log.cons.price.idx=log(bank$cons.price.idx)

    使用transform函数向原数据框添加变量

    > bank<-transform(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))

    使用dplyr包里的mutate函数增加变量

    > bank<-dplyr::mutate(bank,log.cons.price.idx=log(cons.price.idx))

    使用dplyr包里的transmute函数只保留新生成的变量

    > bank2<-dplyr::transmute(bank,log.cons.price.idx=log(cons.price.idx),log.nr.employed=log(nr.employed))

     

    中心化

     

    > v=1:10

    > v1=v-mean(v)

    > v2=scale(v,center=TRUE,scale = FALSE)

     

    无量纲化

     

    > V1=v/sqrt(sum(v^2)/(length(v)-1))

    > v2=scale(v,center=FALSE,scale=TRUE)

     

    根据最大最小值进行归一化

     

    > v3=(v-min(v))/(max(v)-min(v))

     

     

    进行标准正态化

     

     

    > v1=(v-mean(v))/sd(v)

    > v2=scale(v,center = TRUE,scale=TRUE)

     

     

     

     

    Box-Cox变换

    使用car包里的boxCox函数

    > install.packages("car")

    > library(car)

    > boxCox(age~.,data=bank)

     

     

     

     

     

     

    使用caret包,做Box-Cox变换

    > install.packages("caret")

    > library(caret)

    > dat<-subset(bank,select="age")

    > trans<-preProcess(dat,method=C("BoxCox"))

     

     

    数据预处理下

    违反常识的异常值

    基于数据分布的异常值(离群点)识别

    bank.dirty=read.csv("bank-dirty.csv")
    summary(bank.dirty)

     

         age                  job            marital                    education    
     Min.   : 17.00   admin.     :10422   divorced: 4612   university.degree  :12165  
     1st Qu.: 32.00   blue-collar: 9254   married :24928   high.school        : 9515  
     Median : 38.00   technician : 6743   single  :11568   basic.9y           : 6043  
     Mean   : 40.03   services   : 3969   NA's    :   80   professional.course: 5242  
     3rd Qu.: 47.00   management : 2924                    basic.4y           : 4175  
     Max.   :123.00   (Other)    : 7546                    (Other)            : 2310  
     NA's   :2        NA's       :  330                    NA's               : 1738  
     default      housing        loan            contact          month      
     no  :32588   no  :18622   no  :33950   cellular :26144   may    :13769  
     yes :    3   yes :21576   yes : 6248   telephone:15044   jul    : 7174  
     NA's: 8597   NA's:  990   NA's:  990                     aug    : 6178  
                                                              jun    : 5318  
                                                              nov    : 4101  
                                                              apr    : 2632  
                                                              (Other): 2016  
     day_of_week    duration         campaign          pdays          previous    
     fri:7827    Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.000  
     mon:8514    1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000  
     thu:8623    Median : 180.0   Median : 2.000   Median :999.0   Median :0.000  
     tue:8090    Mean   : 258.3   Mean   : 2.568   Mean   :962.5   Mean   :0.173  
     wed:8134    3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000  
                 Max.   :4918.0   Max.   :56.000   Max.   :999.0   Max.   :7.000  
                                                                                  
            poutcome      emp.var.rate      cons.price.idx  cons.conf.idx  
     failure    : 4252   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
     nonexistent:35563   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
     success    : 1373   Median : 1.10000   Median :93.75   Median :-41.8  
                         Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
                         3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
                         Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
                                                                           
       euribor3m      nr.employed     y        
     Min.   :0.634   Min.   :4964   no :36548  
     1st Qu.:1.344   1st Qu.:5099   yes: 4640  
     Median :4.857   Median :5191              
     Mean   :3.621   Mean   :5167              
     3rd Qu.:4.961   3rd Qu.:5228              
     Max.   :5.045   Max.   :5228              
     
     

    常识告诉我们,虽然123岁的老人存在,但概率也极低,也不太可能是银行的客户

    找出在年龄这一列的上离群值和下离群值

     

    > head(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)

          age

    39494 123

    38453  98

    38456  98

    27827  95

    38922  94

    > tail(bank.dirty[order(bank.dirty$age,decreasing = TRUE),'age',drop=FALSE],n=5)

          age

    37559  17

    37580  17

    38275  17

    120    NA

    156    NA

     

    异常值的处理

    当作缺失值处理
    > bank.dirty$age[which(bank.dirty$age>98)]<-NA

    删除或者插补

     

     

    重编码

    职业类型有12个分类,不利于后续分析,把除了unknown以外的分类进行重新编码,简化成4

    Month12个分类,把它转化成季度

    Education的分类,除了unknow之外有7

     

    进行重编码

    levels(bank.dirty$job) <- c( "management","services","entrepreneur","entrepreneur",
                           "management","unemployed",  "entrepreneur","services",
                           "unemployed","services","unemployed","unknown" )
    > levels(bank.dirty$month) <- c("Q2","Q3","Q4","Q3","Q2",
                            "Q1","Q2","Q4","Q4","Q3")
    > 
    > levels(bank.dirty$education) <- c( "primary","primary","primary","secondary",
                                 "primary","tertiary","tertiary","unknown")
     
     

    缺失值

    分类较多,分类是unknown,不能给我们提供信息

    有些模型不能处理缺失值,比如Logistic回归

    缺失值插补的方法

    1、  用中位数或众数插补

    > library(imputeMissings)
    > bank.clean<-impute(bank.dirty,object = compute(bank.dirty,method = "median/mode"))

    2、  最邻近(knn)插补

    library(DMwR)
    bank.clean=knnImputation(bank.dirty,k=5)

     

    3、  随机森林插补

    library(missForest)

     Imp = missForest(bank.dirty)

     bank.clean = Imp$ximp

     

    缺失值插补的R

    1、  imputeMissings

    2、  DMwR

     

     

     

     

     

     

    Logistic回归建立客户响应模型

    1、广义线性模型

    广义线性模型擅长于处理因变量不是连续变量的问题

    1)  Y是分类变量

    2)  Y是定序变量

    3)  Y是离散取值

    2、Y取值是0-1二分类变量是,就是Logistic回归

     

    Logistic回归在R中的实现

    数据重编码

    bank$y=ifelse(bank$y=='yes',1,0)

    改成以Q1为参考因子

    bank$month<-relevel(bank$month,ref="Q1")

    构建Logistic回归模型

    > model<-glm(y~.,data=bank,family = 'binomial')
    > summary(model)
     
    Call:
    glm(formula = y ~ ., family = "binomial", data = bank)
     
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -5.9958  -0.3082  -0.1887  -0.1333   3.4283  
     
    Coefficients: (1 not defined because of singularities)
                                   Estimate Std. Error z value Pr(>|z|)    
    (Intercept)                  -1.957e+02  1.935e+01 -10.116  < 2e-16 ***
    age                           1.851e-03  2.415e-03   0.767 0.443289    
    jobblue-collar               -2.659e-01  7.942e-02  -3.348 0.000814 ***
    jobentrepreneur              -2.029e-01  1.248e-01  -1.626 0.103924    
    jobhousemaid                 -3.628e-02  1.475e-01  -0.246 0.805705    
    jobmanagement                -8.054e-02  8.501e-02  -0.947 0.343423    
    jobretired                    2.928e-01  1.067e-01   2.743 0.006092 ** 
    jobself-employed             -1.680e-01  1.176e-01  -1.428 0.153332    
    jobservices                  -1.497e-01  8.552e-02  -1.751 0.079969 .  
    jobstudent                    2.674e-01  1.106e-01   2.416 0.015680 *  
    jobtechnician                 3.462e-03  7.096e-02   0.049 0.961086    
    jobunemployed                 8.514e-03  1.273e-01   0.067 0.946686    
    jobunknown                   -8.046e-02  2.390e-01  -0.337 0.736420    
    maritalmarried                1.567e-02  6.824e-02   0.230 0.818420    
    maritalsingle                 6.620e-02  7.791e-02   0.850 0.395473    
    maritalunknown                6.303e-02  4.113e-01   0.153 0.878211    
    educationbasic.6y             9.647e-02  1.202e-01   0.803 0.422195    
    educationbasic.9y            -2.154e-02  9.494e-02  -0.227 0.820557    
    educationhigh.school          3.381e-02  9.188e-02   0.368 0.712895    
    educationilliterate           1.132e+00  7.395e-01   1.531 0.125887    
    educationprofessional.course  1.136e-01  1.013e-01   1.121 0.262175    
    educationuniversity.degree    2.134e-01  9.188e-02   2.322 0.020211 *  
    educationunknown              1.361e-01  1.196e-01   1.138 0.255314    
    defaultunknown               -3.055e-01  6.712e-02  -4.552 5.32e-06 ***
    defaultyes                   -7.150e+00  1.135e+02  -0.063 0.949784    
    housingunknown               -7.385e-02  1.390e-01  -0.531 0.595260    
    housingyes                   -3.740e-03  4.121e-02  -0.091 0.927695    
    loanunknown                          NA         NA      NA       NA    
    loanyes                      -6.362e-02  5.725e-02  -1.111 0.266454    
    contacttelephone             -6.068e-01  7.124e-02  -8.518  < 2e-16 ***
    monthQ2                      -2.192e+00  1.125e-01 -19.479  < 2e-16 ***
    monthQ3                      -1.463e+00  1.148e-01 -12.747  < 2e-16 ***
    monthQ4                      -1.995e+00  1.240e-01 -16.088  < 2e-16 ***
    day_of_weekmon               -1.216e-01  6.588e-02  -1.846 0.064887 .  
    day_of_weekthu                6.375e-02  6.382e-02   0.999 0.317842    
    day_of_weektue                6.867e-02  6.545e-02   1.049 0.294118    
    day_of_weekwed                1.436e-01  6.530e-02   2.199 0.027911 *  
    duration                      4.667e-03  7.397e-05  63.092  < 2e-16 ***
    campaign                     -4.543e-02  1.158e-02  -3.922 8.77e-05 ***
    pdays                        -9.627e-04  2.162e-04  -4.452 8.50e-06 ***
    previous                     -5.806e-02  5.879e-02  -0.988 0.323369    
    poutcomenonexistent           4.507e-01  9.372e-02   4.809 1.51e-06 ***
    poutcomesuccess               9.371e-01  2.106e-01   4.451 8.56e-06 ***
    emp.var.rate                 -1.389e+00  7.693e-02 -18.057  < 2e-16 ***
    cons.price.idx                1.815e+00  1.193e-01  15.218  < 2e-16 ***
    cons.conf.idx                 3.353e-02  6.664e-03   5.033 4.84e-07 ***
    euribor3m                     6.054e-02  1.126e-01   0.537 0.590987    
    nr.employed                   4.937e-03  1.873e-03   2.635 0.008413 ** 
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
     
    (Dispersion parameter for binomial family taken to be 1)
     
        Null deviance: 28999  on 41187  degrees of freedom
    Residual deviance: 17199  on 41141  degrees of freedom
    AIC: 17293
     
    Number of Fisher Scoring iterations: 10

     

     

    > exp(coef(model))
                     (Intercept)                          age               jobblue-collar 
                    9.856544e-86                 1.001853e+00                 7.665077e-01 
                 jobentrepreneur                 jobhousemaid                jobmanagement 
                    8.163314e-01                 9.643733e-01                 9.226187e-01 
                      jobretired             jobself-employed                  jobservices 
                    1.340142e+00                 8.453874e-01                 8.609387e-01 
                      jobstudent                jobtechnician                jobunemployed 
                    1.306514e+00                 1.003468e+00                 1.008550e+00 
                      jobunknown               maritalmarried                maritalsingle 
                    9.226922e-01                 1.015789e+00                 1.068445e+00 
                  maritalunknown            educationbasic.6y            educationbasic.9y 
                    1.065061e+00                 1.101276e+00                 9.786948e-01 
            educationhigh.school          educationilliterate educationprofessional.course 
                    1.034388e+00                 3.101297e+00                 1.120248e+00 
      educationuniversity.degree             educationunknown               defaultunknown 
                    1.237856e+00                 1.145744e+00                 7.367445e-01 
                      defaultyes               housingunknown                   housingyes 
                    7.851906e-04                 9.288126e-01                 9.962671e-01 
                     loanunknown                      loanyes             contacttelephone 
                              NA                 9.383587e-01                 5.450980e-01 
                         monthQ2                      monthQ3                      monthQ4 
                    1.116739e-01                 2.314802e-01                 1.360620e-01 
                  day_of_weekmon               day_of_weekthu               day_of_weektue 
                    8.854888e-01                 1.065828e+00                 1.071082e+00 
                  day_of_weekwed                     duration                     campaign 
                    1.154380e+00                 1.004678e+00                 9.555850e-01 
                           pdays                     previous          poutcomenonexistent 
                    9.990378e-01                 9.435960e-01                 1.569466e+00 
                 poutcomesuccess                 emp.var.rate               cons.price.idx 
                    2.552531e+00                 2.493091e-01                 6.140533e+00 
                   cons.conf.idx                    euribor3m                  nr.employed 
                    1.034103e+00                 1.062408e+00                 1.004949e+00 

     

     

    Job变量的基准水平是management,从上面的结果看,服务业和自主劳动者购买银行产品的几率(odds)是管理岗从业人员的0.88倍,未就业人员购买银行产品的几率是管理岗人员的1.25

     

     

    > summary(model.step)
    向前逐步回归
    > model.step=step(model,direction = "backward")
    向后逐步回归
    > model.step = step(model, direction = "forward")
    双向逐步回归
    > model.step = step(model, direction = "both")

    > summary(model.step)

     

    Call:

    glm(formula = y ~ job + education + default + contact + month +

        day_of_week + duration + campaign + pdays + poutcome + emp.var.rate +

        cons.price.idx + cons.conf.idx + nr.employed, family = "binomial",

        data = bank)

     

    Deviance Residuals:

        Min       1Q   Median       3Q      Max 

    -5.9884  -0.3088  -0.1887  -0.1332   3.4026 

     

    Coefficients:

                                   Estimate Std. Error z value Pr(>|z|)   

    (Intercept)                  -2.031e+02  1.426e+01 -14.246  < 2e-16 ***

    jobblue-collar               -2.700e-01  7.917e-02  -3.411 0.000648 ***

    jobentrepreneur              -2.043e-01  1.242e-01  -1.645 0.100003   

    jobhousemaid                 -2.832e-02  1.464e-01  -0.193 0.846590   

    jobmanagement                -8.368e-02  8.409e-02  -0.995 0.319670   

    jobretired                    3.234e-01  9.130e-02   3.542 0.000397 ***

    jobself-employed             -1.670e-01  1.176e-01  -1.421 0.155435   

    jobservices                  -1.528e-01  8.545e-02  -1.789 0.073666 . 

    jobstudent                    2.682e-01  1.046e-01   2.565 0.010316 * 

    jobtechnician                 4.389e-03  7.093e-02   0.062 0.950665   

    jobunemployed                 8.975e-03  1.271e-01   0.071 0.943715   

    jobunknown                   -6.363e-02  2.378e-01  -0.268 0.789057   

    educationbasic.6y             8.993e-02  1.196e-01   0.752 0.452024   

    educationbasic.9y            -2.716e-02  9.416e-02  -0.288 0.772992   

    educationhigh.school          2.890e-02  9.053e-02   0.319 0.749573   

    educationilliterate           1.118e+00  7.398e-01   1.511 0.130744   

    educationprofessional.course  1.084e-01  1.004e-01   1.079 0.280686   

    educationuniversity.degree    2.103e-01  9.017e-02   2.332 0.019678 * 

    educationunknown              1.363e-01  1.195e-01   1.140 0.254110   

    defaultunknown               -3.017e-01  6.666e-02  -4.526 6.02e-06 ***

    defaultyes                   -7.141e+00  1.135e+02  -0.063 0.949831   

    contacttelephone             -6.011e-01  7.069e-02  -8.504  < 2e-16 ***

    monthQ2                      -2.210e+00  1.108e-01 -19.939  < 2e-16 ***

    monthQ3                      -1.475e+00  1.146e-01 -12.869  < 2e-16 ***

    monthQ4                      -1.982e+00  1.183e-01 -16.755  < 2e-16 ***

    day_of_weekmon               -1.210e-01  6.584e-02  -1.837 0.066174 . 

    day_of_weekthu                6.208e-02  6.374e-02   0.974 0.330066   

    day_of_weektue                6.851e-02  6.538e-02   1.048 0.294651    

    day_of_weekwed                1.420e-01  6.525e-02   2.176 0.029592 * 

    duration                      4.667e-03  7.396e-05  63.099  < 2e-16 ***

    campaign                     -4.587e-02  1.158e-02  -3.960 7.49e-05 ***

    pdays                        -8.822e-04  2.024e-04  -4.358 1.31e-05 ***

    poutcomenonexistent           5.219e-01  6.356e-02   8.211  < 2e-16 ***

    poutcomesuccess               9.996e-01  2.028e-01   4.928 8.31e-07 ***

    emp.var.rate                 -1.376e+00  6.885e-02 -19.980  < 2e-16 ***

    cons.price.idx                1.845e+00  1.041e-01  17.725  < 2e-16 ***

    cons.conf.idx                 3.622e-02  4.853e-03   7.464 8.42e-14 ***

    nr.employed                   5.883e-03  9.765e-04   6.024 1.70e-09 ***

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

     

    (Dispersion parameter for binomial family taken to be 1)

     

        Null deviance: 28999  on 41187  degrees of freedom

    Residual deviance: 17203  on 41150  degrees of freedom

    AIC: 17279

     

    Number of Fisher Scoring iterations: 10

     
     

     

     

    模型预测

    predict函数,参数type=’response’

    Newdata参数是要预测的数据集

     

    > prob<-predict(model.step,type = 'response')
    > head(prob)
              1           2           3           4           5           6 
    0.015029328 0.006044212 0.011640349 0.010173952 0.016897254 0.007174804 

     

    假设以0.5为临界值

    > pre<-ifelse(prob>0.5,1,0)

    > table(pre,bank$y)

      

    pre     0     1

      0 35596  2667

      1   952  1973

     

    >

    预测的准确率

    > (35592+1964)/(35592+2676+956+1964)

    [1] 0.911819

     
     

    实际有响应的客户被识别出了多少

    > 1964/(1964+2676)
    [1] 0.4232759

     

     

    模型评估

     

    > confusionMatrix(bank$y,pre,pos='1')
    Confusion Matrix and Statistics
     
              Reference
    Prediction     0     1
             0 35596   952
             1  2667  1973
                                              
                   Accuracy : 0.9121          
                     95% CI : (0.9094, 0.9149)
        No Information Rate : 0.929           
        P-Value [Acc > NIR] : 1               
                                              
                      Kappa : 0.476           
     Mcnemar's Test P-Value : <2e-16          
                                              
                Sensitivity : 0.67453         
                Specificity : 0.93030         
             Pos Pred Value : 0.42522         
             Neg Pred Value : 0.97395         
                 Prevalence : 0.07102         
             Detection Rate : 0.04790         
       Detection Prevalence : 0.11265         
          Balanced Accuracy : 0.80241         
                                              
           'Positive' Class : 1               
                                        

    Kappa 统计量(kappa statistic)

    用于评判分类器的分类结果与随机分类的差异度

    Kappa统计量评价:

        较差:小于0.20

        一般:0.200.40

        稳健:0.400.60

        好的:0.600.80

    很好的:0.801.00

     

     

    ROC曲线

    pred<-prediction(prob,bank$y)
    perf<-performance(pred,measure = "tpr",x="fpr")
    plot(perf)
     
     
     
     
     
     
     
     
     
     
     
     
    RandomForest
    加载数据列
     

    > data=read.table("input.txt",header = TRUE)

    > str(data)

    'data.frame':  222 obs. of  23 variables:

     $ Acti_Profile             : num  0 0 0 0 0 0 0 0 0 0 ...

     $ Activity                 : num  1.25 0 0.938 6.562 0 ...

     $ Diastolic_PTT            : num  256 240 253 0 241 ...

     $ Diastolic                : num  73.2 78.6 74 0 78.4 ...

     $ Heart_Rate_Curve         : num  81.2 69.7 77.6 95 83.6 ...

     $ Heart_Rate_Variability_HF: num  131 250 135 144 141 ...

     $ Heart_Rate_Variability_LF: num  311 218 203 301 244 ...

     $ MAP                      : num  86 93.5 86.9 0 91.7 ...

     $ Position                 : num  0 0 0 1 0 0 0 0 0 0 ...

     $ PTT_Raw                  : num  308 288 308 0 295 ...

     $ RR_Interval              : num  734 878 773 632 714 ...

     $ Sleep_Wake               : num  1 1 1 1 1 0 1 1 0 0 ...

     $ SpO2                     : num  0 0 99 0 98.4 ...

     $ Sympatho_Vagal_Balance   : num  23 8.17 14.5 20.4 16.88 ...

     $ Systolic_PTT             : num  308 288 307 0 295 ...

     $ Systolic                 : num  113 124 113 0 119 ...

     $ Autonomic_arousals       : num  0 0 0 0 0 0 0 0 0 0 ...

     $ Cardio_complex           : num  0 0 0 1 0 0 0 0 0 0 ...

     $ Cardio_rhythm            : num  0 0 2 0 0 0 0 0 0 0 ...

     $ Classification_Arousal   : num  0 0 0 0 0 0 0 0 0 0 ...

     $ PTT_Events               : num  1 0 2 0 0 0 0 0 0 0 ...

     $ Systolic_Events          : num  1 0 1 0 0 0 0 0 0 0 ...

     $ y                        : num  1 0 1 0 0 0 0 0 0 0 ...

    加载随机森林包

    > library(randomForest)

    进行训练  y作为因变量,其余数据作为自变量

    > rf <- randomForest(y ~ ., data=data, ntree=100, proximity=TRUE,importance=TRUE)

    > plot(rf)

    重要性检测

    衡量把一个变量的取值变为随机数,随机森林预测准确性的降低程度

    > importance(rf,type=1)

                                  %IncMSE

    Acti_Profile               0.00000000

    Activity                   0.99353251

    Diastolic_PTT              0.32193611

    Diastolic                  1.99891809

    Heart_Rate_Curve           0.92001352

    Heart_Rate_Variability_HF  2.07870722

    Heart_Rate_Variability_LF -0.24957163

    MAP                        0.48142975

    Position                   1.86876751

    PTT_Raw                    1.94648914

    RR_Interval                0.60557964

    Sleep_Wake                 1.00503782

    SpO2                       0.25396165

    Sympatho_Vagal_Balance     1.42906765

    Systolic_PTT               1.27965813

    Systolic                   0.77382673

    Autonomic_arousals         0.00000000

    Cardio_complex             1.00503782

    Cardio_rhythm              1.14283152

    Classification_Arousal    -0.04383997

    PTT_Events                 4.63980680

    Systolic_Events           33.29461169

     

    输出随机森林的模型

    > print(rf)

     

    Call:

     randomForest(formula = y ~ ., data = data, ntree = 100, proximity = TRUE,      importance = TRUE)

                   Type of random forest: regression

                         Number of trees: 100

    No. of variables tried at each split: 7

     

              Mean of squared residuals: 0.003226897     残差平方和SSE

                        % Var explained: 98.7

     

    >

    总平方和(SST):(样本数据-样本均值)的平方和

    回归平方和(SSR):(预测数据-样本均值)的平方和

    残差平方和(SSE):(样本数据-预测数据均值)的平方和

     

    SST = SSR + SSE   

     

     

     

     

     

    基尼指数:

     

    > importance(rf,type=2)

                              IncNodePurity

    Acti_Profile                0.000000000

    Activity                    0.445181480

    Diastolic_PTT               0.452221870

    Diastolic                   0.449372186

    Heart_Rate_Curve            0.473113852

    Heart_Rate_Variability_HF   0.226815300

    Heart_Rate_Variability_LF   0.205457353

    MAP                         0.536977574

    Position                    0.307333210

    PTT_Raw                     0.656726800

    RR_Interval                 0.452738011

    Sleep_Wake                  0.014423077

    SpO2                        1.793361279

    Sympatho_Vagal_Balance      0.352759689

    Systolic_PTT                0.851951505

    Systolic                    0.823955781

    Autonomic_arousals          0.000000000

    Cardio_complex              0.008047619

    Cardio_rhythm               0.141907084

    Classification_Arousal      0.085739429

    PTT_Events                  7.468690820

    Systolic_Events            39.000163018

     

    >

    进行预测

    prediction <- predict(rf, data[,],type="response")

    输出预测结果

    table(observed =data$y,predicted=prediction)

    plot(prediction)

     

     

     

    支持向量机

    library(e1071)

    svmfit<-svm(y~.,data=data,kernel="linear",cost=10,scale=FALSE)

    > print(svmfit)

     

    Call:

    svm(formula = y ~ ., data = data, kernel = "linear", cost = 10, scale = FALSE)

     

     

    Parameters:

       SVM-Type:  eps-regression

     SVM-Kernel:  linear

           cost:  10

          gamma:  0.04545455

        epsilon:  0.1

     

     

    Number of Support Vectors:  20

    > plot(svmfit,data)

     

     

    神经网络

     

    > concrete<-read_excel("Concrete_Data.xls")

    > str(concrete)

    Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    1030 obs. of  9 variables:

     $ Cement      : num  540 540 332 332 199 ...

     $ Slag        : num  0 0 142 142 132 ...

     $ Ash         : num  0 0 0 0 0 0 0 0 0 0 ...

     $ water       : num  162 162 228 228 192 228 228 228 228 228 ...

     $ superplastic: num  2.5 2.5 0 0 0 0 0 0 0 0 ...

     $ coarseagg   : num  1040 1055 932 932 978 ...

     $ fineagg     : num  676 676 594 594 826 ...

     $ age         : num  28 28 270 365 360 90 365 28 28 28 ...

     $ strength    : num  80 61.9 40.3 41.1 44.3 ...

     

     

    > normalize <- function(x){ return ((x-min(x))/(max(x)-min(x)))}

    > concrete_norm <- as.data.frame(lapply(concrete,normalize))

     

     

    > concrete_train <- concrete_norm[1:773,]

    > concrete_test <- concrete_norm[774:1030,]

     

     

    > library(neuralnet)

    > concrete_model <- neuralnet(strength ~ Cement+Slag+Ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train)

    > plot(concrete_model)

     

     

     

     

     

     

    model_results <- compute(concrete_model,concrete_test[1:8])

    predicted_strength <- model_results$net.result

    > cor(predicted_strength,concrete_test$strength)

                 [,1]

    [1,] 0.7205120076

    > concrete_model2 <- neuralnet(strength ~ Cement+Slag+Ash+water+superplastic+coarseagg+fineagg+age,data=concrete_train,hidden=5)

    > plot(concrete_model2)

    计算误差

    > model_results2 <- compute(concrete_model2,concrete_test[1:8])

    > predicted_strength2 <- model_results2$net.result

    > cor(predicted_strength2,concrete_test$strength)

                 [,1]

    [1,] 0.6727155609

     

    >

     

     

     

     

    主成分分析

    身高、体重、胸围、坐

    > test<-data.frame(

    +     X1=c(148, 139, 160, 149, 159, 142, 153, 150, 151, 139,

    +          140, 161, 158, 140, 137, 152, 149, 145, 160, 156,

    +          151, 147, 157, 147, 157, 151, 144, 141, 139, 148),

    +     X2=c(41, 34, 49, 36, 45, 31, 43, 43, 42, 31,

    +          29, 47, 49, 33, 31, 35, 47, 35, 47, 44,

    +          42, 38, 39, 30, 48, 36, 36, 30, 32, 38),

    +     X3=c(72, 71, 77, 67, 80, 66, 76, 77, 77, 68,

    +          64, 78, 78, 67, 66, 73, 82, 70, 74, 78,

    +          73, 73, 68, 65, 80, 74, 68, 67, 68, 70),

    +     X4=c(78, 76, 86, 79, 86, 76, 83, 79, 80, 74,

    +          74, 84, 83, 77, 73, 79, 79, 77, 87, 85,

    +          82, 78, 80, 75, 88, 80, 76, 76, 73, 78)

    + )

    > test.pr<-princomp(test,cor=TRUE)

    > summary(test.pr,loadings=TRUE)

    Importance of components:

                                 Comp.1        Comp.2        Comp.3        Comp.4

    Standard deviation     1.8817805390 0.55980635717 0.28179594325 0.25711843909

    Proportion of Variance 0.8852744993 0.07834578938 0.01985223841 0.01652747293

    Cumulative Proportion  0.8852744993 0.96362028866 0.98347252707 1.00000000000

     

    Loadings:

       Comp.1 Comp.2 Comp.3 Comp.4

    X1  0.497  0.543 -0.450  0.506

    X2  0.515 -0.210 -0.462 -0.691

    X3  0.481 -0.725  0.175  0.461

    X4  0.507  0.368  0.744 -0.232

     

     

    前两个主成分的累计贡献率已经达到96% 可以舍去另外两个主成分达到降维的目

    因此可以得到函数表达式 Z1=-0.497X'1-0.515X'2-0.481X'3-0.507X'4

                                           Z2=  0.543X'1-0.210X'2-0.725X'3-0.368X'4

    4.画主成分的碎石图并预测

    > screeplot(test.pr,type="lines")

    > p<-predict(test.pr)

    > p

                  Comp.1         Comp.2         Comp.3          Comp.4

     [1,] -0.06990949737 -0.23813701272 -0.35509247634 -0.266120139417

     [2,] -1.59526339772 -0.71847399061  0.32813232022 -0.118056645885

     [3,]  2.84793151061  0.38956678680 -0.09731731272 -0.279482487139

     [4,] -0.75996988424  0.80604334819 -0.04945721875 -0.162949297761

     [5,]  2.73966776853  0.01718087263  0.36012614873  0.358653043787

     [6,] -2.10583167924  0.32284393414  0.18600422367 -0.036456083707

     [7,]  1.42105591247 -0.06053164925  0.21093320662 -0.044223092351

     [8,]  0.82583976981 -0.78102575640 -0.27557797533  0.057288571933

     [9,]  0.93464401954 -0.58469241699 -0.08814135786  0.181037745585

    [10,] -2.36463819933 -0.36532199291  0.08840476284  0.045520127461

    [11,] -2.83741916086  0.34875841111  0.03310422938 -0.031146930047

    [12,]  2.60851223537  0.21278727930 -0.33398036623  0.210157574387

    [13,]  2.44253342081 -0.16769495893 -0.46918095412 -0.162987829937

    [14,] -1.86630668724  0.05021383642  0.37720280364 -0.358821916178

    [15,] -2.81347420580 -0.31790107093 -0.03291329149 -0.222035112399

    [16,] -0.06392982655  0.20718447599  0.04334339948  0.703533623798

    [17,]  1.55561022242 -1.70439673831 -0.33126406220  0.007551878960

    [18,] -1.07392250663 -0.06763418320  0.02283648409  0.048606680158

    [19,]  2.52174211878  0.97274300950  0.12164633439 -0.390667990681

    [20,]  2.14072377494  0.02217881219  0.37410972458  0.129548959692

    [21,]  0.79624421805  0.16307887263  0.12781269571 -0.294140762463

    [22,] -0.28708320594 -0.35744666106 -0.03962115883  0.080991988802

    [23,]  0.25151075072  1.25555187663 -0.55617324819  0.109068938725

    [24,] -2.05706031616  0.78894493512 -0.26552109297  0.388088642937

    [25,]  3.08596854773 -0.05775318018  0.62110421208 -0.218939612456

    [26,]  0.16367554630  0.04317931667  0.24481850312  0.560248997030

    [27,] -1.37265052598  0.02220972121 -0.23378320040 -0.257399715466

    [28,] -2.16097778154  0.13733232981  0.35589738735  0.093123683044

    [29,] -2.40434826507 -0.48613137190 -0.16154440788 -0.007914021222

    [30,] -0.50287467640  0.14734316507 -0.20590831261 -0.122078819188

     

    >

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

  • 相关阅读:
    c++,不能声明为虚函数的函数
    Abstract
    多态性vptrvtable
    C++的重写,重载,重定义
    final
    scanf()和getchar() 使用
    深入理解C++中的mutable关键字
    equal和==
    MoQ(基于.net3.5,c#3.0的mock框架)简单介绍
    VS2008快捷键
  • 原文地址:https://www.cnblogs.com/weizhen/p/6973869.html
Copyright © 2011-2022 走看看