zoukankan      html  css  js  c++  java
  • R-aggregate()

    概述

    aggregate函数应该是数据处理中常用到的函数,简单说有点类似sql语言中的group by,可以按照要求把数据打组聚合,然后对聚合以后的数据进行加和、求平均等各种操作。

    x=data.frame(name=c("张三","李四","王五","赵六"),sex=c("M","M","F","F"),age=c(20,40,22,30),height=c(166,170,150,155))

    构造一个很简单的数据,一组人的性别、年龄和身高,可以用aggregate函数来求不同性别的平均年龄和身高

    aggregate(x[,3:4],by=list(sex=x$sex),FUN=mean)

    几个注意点:

    • 字符或者factor类型的列不要一起加入计算,会报错
    • by参数要构造成list,如果有多个字段,by就对应队列,和group by多个字段是同样的道理

    这个函数的功能比较强大,它首先将数据进行分组(按行),然后对每一组数据进行函数统计,最后把结果组合成一个比较nice的表格返回。根据数据对象不同它有三种用法,分别应用于数据框(data.frame)、公式(formula)和时间序列(ts):

    aggregate(x, by, FUN, ..., simplify = TRUE)
    aggregate(formula, data, FUN, ..., subset, na.action = na.omit)
    aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1, ts.eps = getOption("ts.eps"), ...)  
    

    语法

    aggregate(x, ...)
     
    ## S3 method for class 'default':
    aggregate((x, ...))
    
    ## S3 method for class 'data.frame':
    aggregate((x, by, FUN, ..., simplify = TRUE))
    
    ## S3 method for class 'formula':
    aggregate((formula, data, FUN, ...,
              subset, na.action = na.omit))
    
    ## S3 method for class 'ts':
    aggregate((x, nfrequency = 1, FUN = sum, ndeltat = 1,
              ts.eps = getOption("ts.eps"), ...))
    
    ###细节查看  ?aggregate
    

    Example1

    我们通过 mtcars 数据集的操作对这个函数进行简单了解。mtcars 是不同类型汽车道路测试的数据框类型数据:

    > str(mtcars)
    'data.frame': 32 obs. of 11 variables:
    $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
    $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
    $ disp: num 160 160 108 258 360 ...
    $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
    $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
    $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
    $ qsec: num 16.5 17 18.6 19.4 17 ...
    $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
    $ am : num 1 1 1 0 0 0 0 0 0 0 ...
    $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
    $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
    

    先用attach函数把mtcars的列变量名称加入到变量搜索范围内,然后使用aggregate函数按cyl(汽缸数)进行分类计算平均值:

    > attach(mtcars)
    > aggregate(mtcars, by=list(cyl), FUN=mean)
    Group.1 mpg cyl disp hp drat wt qsec vs am gear carb
    1 4 26.66364 4 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
    2 6 19.74286 6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
    3 8 15.10000 8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000
    

    by参数也可以包含多个类型的因子,得到的就是每个不同因子组合的统计结果:

    > aggregate(mtcars, by=list(cyl, gear), FUN=mean)
    
    Group.1 Group.2 mpg cyl disp hp drat wt qsec vs am gear carb
    1 4 3 21.500 4 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 3 1.000000
    2 6 3 19.750 6 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 3 1.000000
    3 8 3 15.050 8 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3 3.083333
    4 4 4 26.925 4 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 4 1.500000
    5 6 4 19.750 6 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4 4.000000
    6 4 5 28.200 4 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 5 2.000000
    7 6 5 19.700 6 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 5 6.000000
    8 8 5 15.400 8 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 5 6.000000
    

    公式(formula)是一种特殊的R数据对象,在aggregate函数中使用公式参数可以对数据框的部分指标进行统计:

    > aggregate(cbind(mpg,hp) ~ cyl+gear, FUN=mean)
    cyl gear mpg hp
    1 4 3 21.500 97.0000
    2 6 3 19.750 107.5000
    3 8 3 15.050 194.1667
    4 4 4 26.925 76.0000
    5 6 4 19.750 116.5000
    6 4 5 28.200 102.0000
    7 6 5 19.700 175.0000
    8 8 5 15.400 299.5000
    

    上面的公式 cbind(mpg,hp) ~ cyl+gear 表示使用 cyl 和 gear 的因子组合对 cbind(mpg,hp) 数据进行操作。aggregate在时间序列数据上的应用请参考R的函数说明文档。

    Example2

    
    ## Compute the averages for the variables in 'state.x77', grouped
    ## according to the region (Northeast, South, North Central, West) that
    ## each state belongs to.
    aggregate(state.x77, list(Region = state.region), mean)
     
    ## Compute the averages according to region and the occurrence of more
    ## than 130 days of frost.
    aggregate(state.x77,
              list(Region = state.region,
                   Cold = state.x77[,"Frost"] > 130),
              mean)
    ## (Note that no state in 'South' is THAT cold.)
     
    
    ## example with character variables and NAs
    testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
                         v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )
    by1 <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)
    by2 <- c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)
    aggregate(x = testDF, by = list(by1, by2), FUN = "mean")
     
    # and if you want to treat NAs as a group
    fby1 <- factor(by1, exclude = "")
    fby2 <- factor(by2, exclude = "")
    aggregate(x = testDF, by = list(fby1, fby2), FUN = "mean")
     
     
    ## Formulas, one ~ one, one ~ many, many ~ one, and many ~ many:
    aggregate(weight ~ feed, data = chickwts, mean)
    aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
    aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
    aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)
     
    ## Dot notation:
    aggregate(. ~ Species, data = iris, mean)
    aggregate(len ~ ., data = ToothGrowth, mean)
     
    ## Often followed by xtabs():
    ag <- aggregate(len ~ ., data = ToothGrowth, mean)
    xtabs(len ~ ., data = ag)
     
     
    ## Compute the average annual approval ratings for American presidents.
    aggregate(presidents, nfrequency = 1, FUN = mean)
    ## Give the summer less weight.
    aggregate(presidents, nfrequency = 1,
              FUN = weighted.mean, w = c(1, 1, 0.5, 1))
    

    Example3

    #load data
    data <- ChickWeight
    head(data)
      weight Time Chick Diet
    1     42    0     1    1
    2     51    2     1    1
    3     59    4     1    1
    4     64    6     1    1
    5     76    8     1    1
    6     93   10     1    1
     
    #dimension of the data
    dim(data)
    [1] 578   4
     
    #how many chickens
    unique(data$Chick)
     [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
    [31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
    50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 < 3 < 1 < 12 < ... < 48
     
    #how many diets
    unique(data$Diet)
    [1] 1 2 3 4
    Levels: 1 2 3 4
     
    #how many time points
    unique(data$Time)
     [1]  0  2  4  6  8 10 12 14 16 18 20 21
     
    library(ggplot2)
    ggplot(data=data, aes(x=Time, y=weight, group=Chick, colour=Chick)) +
           geom_line() +
           geom_point()
    
    ------------------------------------------------------
    
    ## S3 method for class 'data.frame'
    ## aggregate(x, by, FUN, ..., simplify = TRUE)
    
    #find the mean weight depending on diet
    aggregate(data$weight, list(diet = data$Diet), mean)
      diet        x
    1    1 102.6455
    2    2 122.6167
    3    3 142.9500
    4    4 135.2627
     
    #aggregate on time
    aggregate(data$weight, list(time=data$Time), mean)
       time         x
    1     0  41.06000
    2     2  49.22000
    3     4  59.95918
    4     6  74.30612
    5     8  91.24490
    6    10 107.83673
    7    12 129.24490
    8    14 143.81250
    9    16 168.08511
    10   18 190.19149
    11   20 209.71739
    12   21 218.68889
     
    #use a different function
    aggregate(data$weight, list(time=data$Time), sd)
       time         x
    1     0  1.132272
    2     2  3.688316
    3     4  4.495179
    4     6  9.012038
    5     8 16.239780
    6    10 23.987277
    7    12 34.119600
    8    14 38.300412
    9    16 46.904079
    10   18 57.394757
    11   20 66.511708
    12   21 71.510273
     
    #we could also aggregate on time and diet
    head(aggregate(data$weight,
                   list(time = data$Time, diet = data$Diet),
                   mean
                  )
        )
      time diet        x
    1    0    1 41.40000
    2    2    1 47.25000
    3    4    1 56.47368
    4    6    1 66.78947
    5    8    1 79.68421
    6   10    1 93.05263
    tail(aggregate(data$weight,
                   list(time = data$Time, diet = data$Diet),
                   mean
                  )
        )
       time diet        x
    43   12    4 151.4000
    44   14    4 161.8000
    45   16    4 182.0000
    46   18    4 202.9000
    47   20    4 233.8889
    48   21    4 238.5556
     
    #to see the weights over time across different diets
    ggplot(data) + geom_line(aes(x=Time, y=weight, colour=Chick)) +
                 facet_wrap(~Diet) +
                 guides(col=guide_legend(ncol=3))
    

    Example4

    The aggregate function is more difficult to use, but it is included in the base R installation and does not require the installation of another package.

    # Get a count of number of subjects in each category (sex*condition)
    cdata <- aggregate(data["subject"], by=data[c("sex","condition")], FUN=length)
    cdata
    #>   sex condition subject
    #> 1   F   aspirin       5
    #> 2   M   aspirin       9
    #> 3   F   placebo      12
    #> 4   M   placebo       4
    
    # Rename "subject" column to "N"
    names(cdata)[names(cdata)=="subject"] <- "N"
    cdata
    #>   sex condition  N
    #> 1   F   aspirin  5
    #> 2   M   aspirin  9
    #> 3   F   placebo 12
    #> 4   M   placebo  4
    
    # Sort by sex first
    cdata <- cdata[order(cdata$sex),]
    cdata
    #>   sex condition  N
    #> 1   F   aspirin  5
    #> 3   F   placebo 12
    #> 2   M   aspirin  9
    #> 4   M   placebo  4
    
    # We also keep the __before__ and __after__ columns:
    # Get the average effect size by sex and condition
    cdata.means <- aggregate(data[c("before","after","change")], 
                             by = data[c("sex","condition")], FUN=mean)
    cdata.means
    #>   sex condition   before     after    change
    #> 1   F   aspirin 11.06000  7.640000 -3.420000
    #> 2   M   aspirin 11.26667  5.855556 -5.411111
    #> 3   F   placebo 10.13333  8.075000 -2.058333
    #> 4   M   placebo 11.47500 10.500000 -0.975000
    
    # Merge the data frames
    cdata <- merge(cdata, cdata.means)
    cdata
    #>   sex condition  N   before     after    change
    #> 1   F   aspirin  5 11.06000  7.640000 -3.420000
    #> 2   F   placebo 12 10.13333  8.075000 -2.058333
    #> 3   M   aspirin  9 11.26667  5.855556 -5.411111
    #> 4   M   placebo  4 11.47500 10.500000 -0.975000
    
    # Get the sample (n-1) standard deviation for "change"
    cdata.sd <- aggregate(data["change"],
                          by = data[c("sex","condition")], FUN=sd)
    # Rename the column to change.sd
    names(cdata.sd)[names(cdata.sd)=="change"] <- "change.sd"
    cdata.sd
    #>   sex condition change.sd
    #> 1   F   aspirin 0.8642916
    #> 2   M   aspirin 1.1307569
    #> 3   F   placebo 0.5247655
    #> 4   M   placebo 0.7804913
    
    # Merge
    cdata <- merge(cdata, cdata.sd)
    cdata
    #>   sex condition  N   before     after    change change.sd
    #> 1   F   aspirin  5 11.06000  7.640000 -3.420000 0.8642916
    #> 2   F   placebo 12 10.13333  8.075000 -2.058333 0.5247655
    #> 3   M   aspirin  9 11.26667  5.855556 -5.411111 1.1307569
    #> 4   M   placebo  4 11.47500 10.500000 -0.975000 0.7804913
    
    # Calculate standard error of the mean
    cdata$change.se <- cdata$change.sd / sqrt(cdata$N)
    cdata
    #>   sex condition  N   before     after    change change.sd change.se
    #> 1   F   aspirin  5 11.06000  7.640000 -3.420000 0.8642916 0.3865230
    #> 2   F   placebo 12 10.13333  8.075000 -2.058333 0.5247655 0.1514867
    #> 3   M   aspirin  9 11.26667  5.855556 -5.411111 1.1307569 0.3769190
    #> 4   M   placebo  4 11.47500 10.500000 -0.975000 0.7804913 0.3902456
    

    If you have NA’s in your data and wish to skip them, use na.rm=TRUE:

    cdata.means <- aggregate(data[c("before","after","change")], 
                             by = data[c("sex","condition")],
                             FUN=mean, na.rm=TRUE)
    cdata.means
    #>   sex condition   before     after    change
    #> 1   F   aspirin 11.06000  7.640000 -3.420000
    #> 2   M   aspirin 11.26667  5.855556 -5.411111
    #> 3   F   placebo 10.13333  8.075000 -2.058333
    #> 4   M   placebo 11.47500 10.500000 -0.975000 
    
  • 相关阅读:
    Blank page instead of the SharePoint Central Administration site
    BizTalk 2010 BAM Configure
    Use ODBA with Visio 2007
    Handling SOAP Exceptions in BizTalk Orchestrations
    BizTalk与WebMethods之间的EDI交换
    Append messages in BizTalk
    FTP protocol commands
    Using Dynamic Maps in BizTalk(From CodeProject)
    Synchronous To Asynchronous Flows Without An Orchestration的简单实现
    WSE3 and "Action for ultimate recipient is required but not present in the message."
  • 原文地址:https://www.cnblogs.com/fengzzi/p/10044519.html
Copyright © 2011-2022 走看看