zoukankan      html  css  js  c++  java
  • Chapter 07-Basic statistics(Part1 描述统计数据)

         在这一部分中,仍然使用mtcars(Motor Trend Car Road Tests)这一数据集,以及mpg(one mile per gallon), hp(horsepower), wt(weight)这几个变量。

    例01:

    > vars<-c("mpg","hp","wt")
    > head(mtcars[vars])
                       mpg  hp    wt
    Mazda RX4         21.0 110 2.620
    Mazda RX4 Wag     21.0 110 2.875
    Datsun 710        22.8  93 2.320
    Hornet 4 Drive    21.4 110 3.215
    Hornet Sportabout 18.7 175 3.440
    Valiant           18.1 105 3.460

     1. 一些描述方法(a menagerie of methods)

    例02: 

    > vars<-c("mpg","hp","wt")
    > summary(mtcars[vars])
          mpg              hp              wt       
     Min.   :10.40   Min.   : 52.0   Min.   :1.513  
     1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581  
     Median :19.20   Median :123.0   Median :3.325  
     Mean   :20.09   Mean   :146.7   Mean   :3.217  
     3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610  
     Max.   :33.90   Max.   :335.0   Max.   :5.424  

    summary()函数:为数值变量提供最小值,最大值,均值,四分之一/四分之三值(quartiles);

                             为因子(factor)和逻辑变量(logical vector)提供频率值(frequency)

    例03:

    > mystats<-function(x,na.omit=FALSE){
    +          if(na.omit)
    +             x<-x[!is.na(x)]
    +           m<-mean(x)
    +           n<-length(x)
    +           s<-sd(x)
    +           skew<-sum((x-m)^3/s^3)/n
    +           kurt<-sum((x-m)^4/s^4)/n-3
    +           return(c(n=n,mean=m,stdev=s,skew=skew,kurtosis=kurt))
    +          }
    > 
    > sapply(mtcars[vars],mystats) mpg hp wt n 32.000000 32.0000000 32.00000000 mean 20.090625 146.6875000 3.21725000 stdev 6.026948 68.5628685 0.97845744 skew 0.610655 0.7260237 0.42314646 kurtosis -0.372766 -0.1355511 -0.02271075

    sapply(x,FUN,options)

    · x:是数据帧(data frame)或矩阵(matrix);

    · FUN:是任意函数(arbitrary function),

               典型的FUN函数包含mean,sd,var,min,max,median,length,range,quantile这些函数,skew,kurtosis则是需要自定义添加的。

    注意:若想忽略掉缺省值(missing values),则应该使用sapply(mtcars[vars],mystats,na.omit=TRUE)

    例03(1):

    > mystats<-function(x,na.omit=TRUE){
    +          if(na.omit)
    +            x<-x[!is.na(x)]
    +          m<-mean(x)
    +          n<-length(x)
    +          s<-sd(x)
    +          skew<-sum((x-m)^3/s^3)/n
    +          kurt<-sum((x-m)^4/s^4)/n-3
    +          return(c(n=n,mean=m,stdev=s,skew=skew,kurtosis=kurt))
    +         }
    > sapply(mtcars[vars],mystats)
                   mpg          hp          wt
    n        32.000000  32.0000000 32.00000000
    mean     20.090625 146.6875000  3.21725000
    stdev     6.026948  68.5628685  0.97845744
    skew      0.610655   0.7260237  0.42314646
    kurtosis -0.372766  -0.1355511 -0.02271075

    扩展:一些用户描述的包也提供描述函数用于描述统计数据,如Hmisc,pastecs,psych,需要先安装,再使用。

    例04:

    > library(Hmisc)
    载入需要的程辑包:survival
    载入需要的程辑包:splines
    载入需要的程辑包:Formula
    Hmisc library by Frank E Harrell Jr
    
    Type library(help='Hmisc'), ?Overview, or ?Hmisc.Overview')
    to see overall documentation.
    
    
    载入程辑包:‘Hmisc’
    
    下列对象被屏蔽了from ‘package:survival’:
    
        untangle.specials
    
    下列对象被屏蔽了from ‘package:base’:
    
        format.pval, round.POSIXt, trunc.POSIXt, units
    > describe(mtcars[vars])
    mtcars[vars] 
    
     3  Variables      32  Observations
    -------------------------------------------------------------------------------------------------------------------------------------------------
    mpg 
          n missing  unique    Mean     .05     .10     .25     .50     .75     .90     .95 
         32       0      25   20.09   12.00   14.34   15.43   19.20   22.80   30.09   31.30 
    
    lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9 
    -------------------------------------------------------------------------------------------------------------------------------------------------
    hp 
          n missing  unique    Mean     .05     .10     .25     .50     .75     .90     .95 
         32       0      22   146.7   63.65   66.00   96.50  123.00  180.00  243.50  253.55 
    
    lowest :  52  62  65  66  91, highest: 215 230 245 264 335 
    -------------------------------------------------------------------------------------------------------------------------------------------------
    wt 
          n missing  unique    Mean     .05     .10     .25     .50     .75     .90     .95 
         32       0      29   3.217   1.736   1.956   2.581   3.325   3.610   4.048   5.293 
    
    lowest : 1.513 1.615 1.835 1.935 2.140, highest: 3.845 4.070 5.250 5.345 5.424 
    -------------------------------------------------------------------------------------------------------------------------------------------------

    Hmisc包中的describe()函数:返回自变量(variables)与因变量(observations)的数量,缺省值(missing)的数量以及特殊值(unique values),平均值(mean),quantile,五个最大值与五个最小值。

    例05:

    > install.packages("pastecs")
    trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.0/pastecs_1.3-15.zip'
    Content type 'application/zip' length 1629764 bytes (1.6 Mb)
    opened URL
    downloaded 1.6 Mb
    
    package ‘pastecs’ successfully unpacked and MD5 sums checked
    
    The downloaded binary packages are in
    	C:Usersseven-wangAppDataLocalTempRtmp8aVakLdownloaded_packages
    > library(pastecs)
    载入需要的程辑包:boot
    
    载入程辑包:‘boot’
    
    下列对象被屏蔽了from ‘package:survival’:
    
        aml
    
    > stat.desc(mtcars[vars])
                         mpg           hp          wt
    nbr.val       32.0000000   32.0000000  32.0000000
    nbr.null       0.0000000    0.0000000   0.0000000
    nbr.na         0.0000000    0.0000000   0.0000000
    min           10.4000000   52.0000000   1.5130000
    max           33.9000000  335.0000000   5.4240000
    range         23.5000000  283.0000000   3.9110000
    sum          642.9000000 4694.0000000 102.9520000
    median        19.2000000  123.0000000   3.3250000
    mean          20.0906250  146.6875000   3.2172500
    SE.mean        1.0654240   12.1203173   0.1729685
    CI.mean.0.95   2.1729465   24.7195501   0.3527715
    var           36.3241028 4700.8669355   0.9573790
    std.dev        6.0269481   68.5628685   0.9784574
    coef.var       0.2999881    0.4674077   0.3041285

    pastecs包中有函数stat.desc():提供对统计数据的描述。

    stat.desc(x,basic=TRUE,desc=TRUE,norm=FALSE,p=0.95)

    ·x:数据帧(data frame)或时间序列(time series);

    ·basic=TRUE(缺省状态):各种值的个数,包括缺省值(missing value),空值(null value),最小值,最大值,值域(range),和(sum).

    ·desc=TRUE(缺省状态):提供中间值(median),平均值(mean),均值的标准错误(standard error of the mean),95% confidence interval for the mean,variance,标准差(standard deviation),coefficient of variation.

    ·norm=TRUE(非缺省状态):返回正常的统计数据,包括skewness和kurtosis,以及它们的统计的重要性,the Shapiro-Wilk test of normality.

    ·p值的选项:用于计算the confidence interval for the mean(缺省状态默认为0.95 ).

    例06:

    > library(psych)
    > describe(mtcars[vars])
        var  n   mean    sd median trimmed   mad   min    max  range skew kurtosis    se
    mpg   1 32  20.09  6.03  19.20   19.70  5.41 10.40  33.90  23.50 0.61    -0.37  1.07
    hp    2 32 146.69 68.56 123.00  141.19 77.10 52.00 335.00 283.00 0.73    -0.14 12.12
    wt    3 32   3.22  0.98   3.33    3.15  0.77  1.51   5.42   3.91 0.42    -0.02  0.17

    psych包中的函数describe:提供不含缺失值的自变量(nonminssing observations),平均值(mean),标准差(standard deviation),中间值(median),边缘性均值(trimmed mean),median absolute deviation,最大值,最小值,值域(range),skew,kurtosis,the standard error of the mean.

    注意:Hmics和psych包中都有decscribe()函数,一般使用最后载入的包中的函数。

    2. 按组描述统计数

    例07:

    > aggregate(mtcars[vars],by=list(am=mtcars$am),mean)
      am      mpg       hp       wt
    1  0 17.14737 160.2632 3.768895
    2  1 24.39231 126.8462 2.411000
    > aggregate(mtcars[vars],by=list(am=mtcars$am),sd)
      am      mpg       hp        wt
    1  0 3.833966 53.90820 0.7774001
    2  1 6.166504 84.06232 0.6169816

    aggregate()函数:按组获得描述数据。

    ·有多于一组的变量,使用这样的代码:

      by=list(name1=groupvar1,name2=groupvar2,......,nameN=groupvarN)

    ·仅允许使用单变量函数,如mean,standard deviation等。

    例08(在RGui中完成的):

    > install.packages("doBy")

       --- 在此連線階段时请选用CRAN的鏡子 --- also installing the dependencies ‘mvtnorm’, ‘multcomp’

    试开URL’http://ftp.ctex.org/mirrors/CRAN/bin/windows/contrib/3.0/mvtnorm_0.9-9995.zip' Content type 'application/zip' length 222319 bytes (217 Kb) 打开了URL downloaded 217 Kb

    试开URL’http://ftp.ctex.org/mirrors/CRAN/bin/windows/contrib/3.0/multcomp_1.2-19.zip' Content type 'application/zip' length 599099 bytes (585 Kb) 打开了URL downloaded 585 Kb

    试开URL’http://ftp.ctex.org/mirrors/CRAN/bin/windows/contrib/3.0/doBy_4.5-8.zip' Content type 'application/zip' length 2606320 bytes (2.5 Mb) 打开了URL downloaded 2.5 Mb

    程序包‘mvtnorm’打开成功,MD5和检查也通过 程序包‘multcomp’打开成功,MD5和检查也通过 程序包‘doBy’打开成功,MD5和检查也通过

    下载的二进制程序包在        

    C:Usersseven-wangAppDataLocalTempRtmpEZL947downloaded_packages里 

    > library(doBy)

    载入需要的程辑包:multcomp

    载入需要的程辑包:mvtnorm

    载入需要的程辑包:survival

    载入需要的程辑包:splines

    载入需要的程辑包:MASS

    > summaryBy(mpg+hp+wt~am,data=mtcars,FUN=mystats)  

     am mpg.n mpg.mean mpg.stdev   mpg.skew mpg.kurtosis hp.n  hp.mean hp.stdev
    1  0    19 17.14737  3.833966 0.01395038   -0.8031783   19 160.2632 53.90820
    2  1    13 24.39231  6.166504 0.05256118   -1.4553520   13 126.8462 84.06232
          hp.skew hp.kurtosis wt.n  wt.mean  wt.stdev   wt.skew wt.kurtosis
    1 -0.01422519  -1.2096973   19 3.768895 0.7774001 0.9759294   0.1415676
    2  1.35988586   0.5634635   13 2.411000 0.6169816 0.2103128  -1.1737358

    doBy包中的summarBy()函数:
    summaryBy(formula,data=dataframe,FUN=function)

    ·formula:var1+var2+...+varN~groupvar1+groupvar2+...+groupvarM

     其中~左边的是数值变量,右边的是分类好的变量。

    例09(在RGui中完成):

    > install.packages("psych")

    试开URL’http://ftp.ctex.org/mirrors/CRAN/bin/windows/contrib/3.0/psych_1.3.2.zip'

    Content type 'application/zip' length 2555708 bytes (2.4 Mb)

    打开了URL

    downloaded 2.4 Mb

    程序包‘psych’打开成功,MD5和检查也通过

    下载的二进制程序包在        

           C:Usersseven-wangAppDataLocalTempRtmpEZL947downloaded_packages里

    > library(psych)

    > describeBy(mtcars[vars],mtcars$am)
    group: 0
        var  n   mean    sd median trimmed   mad   min    max  range  skew kurtosis    se
    mpg   1 19  17.15  3.83  17.30   17.12  3.11 10.40  24.40  14.00  0.01    -0.80  0.88
    hp    2 19 160.26 53.91 175.00  161.06 77.10 62.00 245.00 183.00 -0.01    -1.21 12.37
    wt    3 19   3.77  0.78   3.52    3.75  0.45  2.46   5.42   2.96  0.98     0.14  0.18
    --------------------------------------------------------------------------
    group: 1
        var  n   mean    sd median trimmed   mad   min    max  range skew kurtosis    se
    mpg   1 13  24.39  6.17  22.80   24.38  6.67 15.00  33.90  18.90 0.05    -1.46  1.71
    hp    2 13 126.85 84.06 109.00  114.73 63.75 52.00 335.00 283.00 1.36     0.56 23.31
    wt    3 13   2.41  0.62   2.32    2.39  0.68  1.51   3.57   2.06 0.21    -1.17  0.17
    psych包中的describeBy()函数:提供对统计数据的描述。

    ·但是,该函数不适用于任何的一个函数,则有多组变量时需要写成

           list(groupvar1,groupvar2,...,groupvarN)

     并且仅当组间变量交叉时,没有空括号(no empty cells when the grouping variables are crossed)

     

     例10:

    > install.packages("reshape")
    trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.0/reshape_0.8.4.zip'
    Content type 'application/zip' length 124891 bytes (121 Kb)
    opened URL
    downloaded 121 Kb
    
    package ‘reshape’ successfully unpacked and MD5 sums checked
    
    The downloaded binary packages are in
    	C:Usersseven-wangAppDataLocalTempRtmpwj6VpMdownloaded_packages
    > library(reshape)
    载入需要的程辑包:plyr
    
    载入程辑包:‘reshape’
    
    下列对象被屏蔽了from ‘package:plyr’:
    
        rename, round_any
    
    > dstats<-function(x)(c(n=length(x),mean=mean(x),sd=sd(x)))
    > dfm<-melt(mtcars,measure.vars=c("mpg","hp","wt"),id.vars=c("am","cyl"))
    > cast(dfm,am+cyl+variable~.,dstats)
       am cyl variable  n       mean         sd
    1   0   4      mpg  3  22.900000  1.4525839
    2   0   4       hp  3  84.666667 19.6553640
    3   0   4       wt  3   2.935000  0.4075230
    4   0   6      mpg  4  19.125000  1.6317169
    5   0   6       hp  4 115.250000  9.1787799
    6   0   6       wt  4   3.388750  0.1162164
    7   0   8      mpg 12  15.050000  2.7743959
    8   0   8       hp 12 194.166667 33.3598379
    9   0   8       wt 12   4.104083  0.7683069
    10  1   4      mpg  8  28.075000  4.4838599
    11  1   4       hp  8  81.875000 22.6554156
    12  1   4       wt  8   2.042250  0.4093485
    13  1   6      mpg  3  20.566667  0.7505553
    14  1   6       hp  3 131.666667 37.5277675
    15  1   6       wt  3   2.755000  0.1281601
    16  1   8      mpg  2  15.400000  0.5656854
    17  1   8       hp  2 299.500000 50.2045815
    18  1   8       wt  2   3.370000  0.2828427

    reshape包,
    dfm<-melt(dataframe,measure.vars=y,id.vars=g)

    cast(dfm,groupvar1+groupvar2+...+variable~.,FUN)





     

  • 相关阅读:
    竞赛题解
    学习笔记
    竞赛题解
    学习笔记
    竞赛题解
    竞赛题解
    竞赛题解
    「链接」原博客链接
    「杂录」THUWC 2020 游记
    「杂录」CSP-S 2019 爆炸记&题解
  • 原文地址:https://www.cnblogs.com/wangshenwen/p/3272535.html
Copyright © 2011-2022 走看看