zoukankan      html  css  js  c++  java
  • R----data.table包介绍学习

    相比dplyr包,data.table包能够更大程度地提高数据的处理速度,这里就简单介绍一下data.tale包的使用方法。

    data.table:用于快速处理大数据集的哦

    数据的读取

    data.table包中数据读取的函数:fread()

    data.table的创建

    library(data.table)
    DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
    DT
        #    x y v
        # 1: a 1 1
        # 2: a 3 2
        # 3: a 6 3
        # 4: b 1 4
        # 5: b 3 5
        # 6: b 6 6
        # 7: c 1 7
        # 8: c 3 8
        # 9: c 6 9
    

    基础操作

    行提取

    行提取分为单行提取和多行提取。

    单行提取

    DT[2]                      # 2nd row
        #    x y v
        # 1: a 3 2
    DT[2,]                     # same
        #    x y v
        # 1: a 3 2
    

    这里DT [2]和DT [2]是完全相同的,这里的「,」只是说明还有其他参数可设置,而其他参数按默认值进行计算。下文所有这样的最后一个「,」都不再写出来。

    多行提取

    • 数字提取
    DT[1:2]
        #    x y v
        # 1: a 1 1
        # 2: a 3 2
    DT[c(2,5)]
        #   x y v
        #1: a 3 2
        #2: b 3 5
    
    • 逻辑提取
    DT[c(FALSE,TRUE)]          # even rows (usual recycling)
        #    x y v
        # 1: a 3 2
        # 2: b 1 4
        # 3: b 6 6
        # 4: c 3 8
    

    此时,C(FALSE,TRUE)会自己重复匹配成与DT的行数相同的向量

    列提取

    与行提取相同,列的提取也包含单列提取和多列提取。

    单列提取

    • 数字提取

    数字提取时,一定要把问心无愧参数设置为FALSE。

    DT[,2,with=FALSE]          # 2nd column
        #    y
        # 1: 1
        # 2: 3
        # 3: 6
        # 4: 1
        # 5: 3
        # 6: 6
        # 7: 1
        # 8: 3
        # 9: 6
    
    • 列名提取
    DT[,list(v)]               # v column (as data.table
        #    v
        # 1: 1
        # 2: 2
        # 3: 3
        # 4: 4
        # 5: 5
        # 6: 6
        # 7: 7
        # 8: 8
        # 9: 9
    

    列名的修改

    列名的修改可以使用setnames()函数,这个函数好像比对data.frame类型数据名更改的名称()和colnames()函数也要快一些。

    dt = data.table(a=1:2,b=3:4,c=5:6) # compare to data.table
    try(tracemem(dt))                  # by reference, no deep or shallow copies
    setnames(dt,"b","B")               # by name, no match() needed (warning if "b" is missing)
    setnames(dt,3,"C")                 # by position with warning if 3 > ncol(dt)
    setnames(dt,2:3,c("D","E"))        # multiple
    setnames(dt,c("a","E"),c("A","F")) # multiple by name (warning if either "a" or "E" is missing)
    setnames(dt,c("X","Y","Z"))        # replace all (length of names must be == ncol(DT))  
    

    多列提取

    • 数字提取

    如同上面对按数字对单列的提取,对多列提取也要设置与参数为FALSE。

    DT[,2:3,with=FALSE]
        #    y v
        # 1: 1 1
        # 2: 3 2
        # 3: 6 3
        # 4: 1 4
        # 5: 3 5
        # 6: 6 6
        # 7: 1 7
        # 8: 3 8
        # 9: 6 9
    DT[,c(1,3),with=FALSE] 
        #    x v
        # 1: a 1
        # 2: a 2
        # 3: a 3
        # 4: b 4
        # 5: b 5
        # 6: b 6
        # 7: c 7
        # 8: c 8
        # 9: c 9      
    
    • 按列名提取
    DT[,list(y, v)]
        #    y v
        # 1: 1 1
        # 2: 3 2
        # 3: 6 3
        # 4: 1 4
        # 5: 3 5
        # 6: 6 6
        # 7: 1 7
        # 8: 3 8
        # 9: 6 9
    

    如果按列名提取时,不使用列表,仍然能对列进行提取,只是结果以向量的形式输出。

    DT[,v]                     # v column (as vector)
        # [1] 1 2 3 4 5 6 7 8 9
    DT[,c(v)]                  # same
        # [1] 1 2 3 4 5 6 7 8 9   
    DT[, c(y, v)]
        # [1] 1 3 6 1 3 6 1 3 6 1 2 3 4 5 6 7 8 9
    

    列的添加与删除

    列的添加

    • 单列添加
    DT
        #    x y v
        # 1: a 1 1
        # 2: a 3 2
        # 3: a 6 3
        # 4: b 1 4
        # 5: b 3 5
        # 6: b 6 6
        # 7: c 1 7
        # 8: c 3 8
        # 9: c 6 9
    DT[, a := 'k']
    DT
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
        # 4: b 1 4 k
        # 5: b 3 5 k
        # 6: b 6 6 k
        # 7: c 1 7 k
        # 8: c 3 8 k
        # 9: c 6 9 k
    DT[,c:=8]        # add a numeric column, 8 for all rows
    DT
        #    x y v a c
        # 1: a 1 1 k 8
        # 2: a 3 2 k 8
        # 3: a 6 3 k 8
        # 4: b 1 4 k 8
        # 5: b 3 5 k 8
        # 6: b 6 6 k 8
        # 7: c 1 7 k 8
        # 8: c 3 8 k 8
        # 9: c 6 9 k 8 
    DT[,d:=9L]       # add an integer column, 9L for all rows
    DT[2,d:=10L]     # subassign by reference to column d
    DT
        #    x y v a c  d
        # 1: a 1 1 k 8  9
        # 2: a 3 2 k 8 10
        # 3: a 6 3 k 8  9
        # 4: b 1 4 k 8  9
        # 5: b 3 5 k 8  9
        # 6: b 6 6 k 8  9
        # 7: c 1 7 k 8  9
        # 8: c 3 8 k 8  9
        # 9: c 6 9 k 8  9
    DT[, e := d + 2]
    DT
        #    x y v a c  d  e
        # 1: a 1 1 k 8  9 11
        # 2: a 3 2 k 8 10 12
        # 3: a 6 3 k 8  9 11
        # 4: b 1 4 k 8  9 11
        # 5: b 3 5 k 8  9 11
        # 6: b 6 6 k 8  9 11
        # 7: c 1 7 k 8  9 11
        # 8: c 3 8 k 8  9 11
        # 9: c 6 9 k 8  9 11
    

    如果添加的列名,数据中已经包含则是对这一列数据的修改。

    • 多列的添加
    DT[, c('f', 'g') := list( d + 1, c)]
    DT[, ':='( f =  d + 1, g = c)]          # same
    DT
        #    x y v a c  d  e  f g
        # 1: a 1 1 k 8  9 11 10 8
        # 2: a 3 2 k 8 10 12 11 8
        # 3: a 6 3 k 8  9 11 10 8
        # 4: b 1 4 k 8  9 11 10 8
        # 5: b 3 5 k 8  9 11 10 8
        # 6: b 6 6 k 8  9 11 10 8
        # 7: c 1 7 k 8  9 11 10 8
        # 8: c 3 8 k 8  9 11 10 8
        # 9: c 6 9 k 8  9 11 10 8
    

    此处,需要注意的是新创建的列只能依照原有数据列,而不能依照新创建的列。例如这个例子中,G = C是可以运行,而摹= F则会提示错误。

    列的删除

    DT[,c:=NULL]     # remove column c
    DT
        #    x y v a  d  e  f g
        # 1: a 1 1 k  9 11 10 8
        # 2: a 3 2 k 10 12 11 8
        # 3: a 6 3 k  9 11 10 8
        # 4: b 1 4 k  9 11 10 8
        # 5: b 3 5 k  9 11 10 8
        # 6: b 6 6 k  9 11 10 8
        # 7: c 1 7 k  9 11 10 8
        # 8: c 3 8 k  9 11 10 8
        # 9: c 6 9 k  9 11 10 8
    DT[, c('d', 'e', 'f', 'g'):=NULL]     
    DT
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
        # 4: b 1 4 k
        # 5: b 3 5 k
        # 6: b 6 6 k
        # 7: c 1 7 k
        # 8: c 3 8 k
        # 9: c 6 9 k
    

    列指标的简单操作

    简单操作主要包括求和,平均值,方差和标准差等。

    DT[2:3,sum(v)]             # sum(v) over rows 2 and 3
    # [1] 5
    DT[2:3,mean(v)]             # sum(v) over rows 2 and 3
    # [1] 2.5
    

    索引键

    查看和创建索引

    索引是对列而言的,索引创建后,数据将自动按索引值进行重新排序,所以每个数据最多只能有一个索引,但是索引可以由多列组成,这些列可以是数字,因子,字符串或其他格式。

    单列索引的创建

    
    ## methdod first
    key(DT)                    # key
        # NULL 
    setkey(DT,x)               # set a 1-column key. No quotes, for convenience.
    key(DT)
    [1] "x"
    DT
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
        # 4: b 1 4 k
        # 5: b 3 5 k
        # 6: b 6 6 k
        # 7: c 1 7 k
        # 8: c 3 8 k
        # 9: c 6 9 k
    
    ## method second
    setkeyv(DT,"y")            # same (v in setkeyv stands for vector)
    key(DT)
        # [1] "y"
    

    一旦对数据进行新的索引,原有的索引将消失。

    多列索引的创建

    ## methdod first                    # key 
    setkey(DT,x,v)               # set a 1-column key. No quotes, for convenience.
    key(DT)
        # [1] "x" "v"
    DT
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
        # 4: b 1 4 k
        # 5: b 3 5 k
        # 6: b 6 6 k
        # 7: c 1 7 k
        # 8: c 3 8 k
        # 9: c 6 9 k
    
    
    ## method second
    setkeyv(DT,c("x", "y"))           # same (v in setkeyv stands for vector)
    key(DT)
        # [1] "x" "v"
    DT
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
        # 4: b 1 4 k
        # 5: b 3 5 k
        # 6: b 6 6 k
        # 7: c 1 7 k
        # 8: c 3 8 k
        # 9: c 6 9 k
    

    通过索引进行数据的提取

    按照索引对数据提取,可以加快提取数据的速度。

    单索引

    正向提取

    setkey(DT, x)
    DT["a"]                    # binary search (fast)
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
    DT[.(x=="a")]                 # same; i.e. binary search (fast)
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
    DT[x=="a"]                 # same; i.e. binary search (fast)
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
    
    • 反向提取
    DT[!.("a")]                # not join
        #    x y v a
        # 1: b 1 4 k
        # 2: b 3 5 k
        # 3: b 6 6 k
        # 4: c 1 7 k
        # 5: c 3 8 k
        # 6: c 6 9 k
    DT[!"a"]                   # same
        #    x y v a
        # 1: b 1 4 k
        # 2: b 3 5 k
        # 3: b 6 6 k
        # 4: c 1 7 k
        # 5: c 3 8 k
        # 6: c 6 9 k
    DT[!2:4]                   # all rows other than 2:4
        #    x y v a
        # 1: a 1 1 k
        # 2: b 3 5 k
        # 3: b 6 6 k
        # 4: c 1 7 k
        # 5: c 3 8 k
        # 6: c 6 9 k
    
    

    多索引

    • 正向提取
    setkey(DT, x, y)
    # Mehtod First
    DT["a"]                    # join to 1st column of key
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
    DT[.("a")]                 # same, .() is an alias for list()
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
    DT[.("a",3)]               # join to 2 columns
        #    x y v a
        # 1: a 3 2 k
    DT[.("a",3:6)]             # join 4 rows (2 missing)
        #    x y  v  a
        # 1: a 3  2  k
        # 2: a 4 NA NA
        # 3: a 5 NA NA
        # 4: a 6  3  k
    DT[.("a",3:6),nomatch=0]   # remove missing
        #    x y v a
        # 1: a 3 2 k
        # 2: a 6 3 k
    DT[.("a",3:6),roll=TRUE]   # rolling join (locf)
        #    x y v a
        # 1: a 3 2 k
        # 2: a 4 2 k
        # 3: a 5 2 k
        # 4: a 6 3 k
    
    ## Method Second
    DT[J('a')]
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
    DT[J("a",3)]               # binary search (fast)
        #    x y v a
        # 1: a 3 2 k
    DT[J("a",3:6)]              # same; i.e. binary search (fast)
        #    x y  v  a
        # 1: a 3  2  k
        # 2: a 4 NA NA
        # 3: a 5 NA NA
        # 4: a 6  3  k
    DT[J("a",3:6), nomatch = 0]
        #    x y v a
        # 1: a 3 2 k
        # 2: a 6 3 k
    DT[J("a",3:6), roll = T]
        #    x y v a
        # 1: a 3 2 k
        # 2: a 4 2 k
        # 3: a 5 2 k
        # 4: a 6 3 k
    
    
    ## Method Third
    DT[list("a")]
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
    DT[list("a",3)]
        #    x y v a
        # 1: a 3 2 k
    DT[list("a", 3:6)]
        #    x y  v  a
        # 1: a 3  2  k
        # 2: a 4 NA NA
        # 3: a 5 NA NA
        # 4: a 6  3  k
    DT[list("a", 3:6), nomatch = 0]
        #    x y v a
        # 1: a 3 2 k
        # 2: a 6 3 k
    DT[list("a", 3:6), roll = T]
        #    x y v a
        # 1: a 3 2 k
        # 2: a 4 2 k
        # 3: a 5 2 k
        # 4: a 6 3 k
    
    • 反向提取
    DT[x!="b" | y!=3]          # not yet optimized, currently vector scans
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
        # 4: b 1 4 k
        # 5: b 6 6 k
        # 6: c 1 7 k
        # 7: c 3 8 k
        # 8: c 6 9 k
    DT[!.("b",3)]              # same result but much faster
        #    x y v a
        # 1: a 1 1 k
        # 2: a 3 2 k
        # 3: a 6 3 k
        # 4: b 1 4 k
        # 5: b 6 6 k
        # 6: c 1 7 k
        # 7: c 3 8 k
        # 8: c 6 9 k
    
    

    分类汇总

    分类汇总是指按某列的分类指标进行简单操作,借助由参数实现。此外,通过参数与索引相互没有影响这里。

    单指标的分类汇总

    • 默认汇总名称
    DT[,sum(v),by=x]
        #    x V1
        # 1: a  6
        # 2: b 15
        # 3: c 24           
    DT[,sum(v),by=y] 
        #    y V1
        # 1: 1 12
        # 2: 3 15
        # 3: 6 18          
    
    • 自定义汇总名称
    DT[,list(sum.v.x = sum(v)),by=x]
        #    x sum.v.x
        # 1: a       6
        # 2: b      15
        # 3: c      24           
    DT[,list(sum.v.y = sum(v)),by=y] 
        #    y sum.v.y
        # 1: 1      12
        # 2: 3      15
        # 3: 6      18 
    DT[,sum.v.y := sum(v) ,by=y]
        #    x y v a sum.v.y
        # 1: a 1 1 k      12
        # 2: a 3 2 k      15
        # 3: a 6 3 k      18
        # 4: b 1 4 k      12
        # 5: b 3 5 k      15
        # 6: b 6 6 k      18
        # 7: c 1 7 k      12
        # 8: c 3 8 k      15
        # 9: c 6 9 k      18 	
    
    • 汇总结果与原始数据进行匹配
    DT[,sum.v.y := sum(v) ,by=y]
        #    x y v a sum.v.y
        # 1: a 1 1 k      12
        # 2: a 3 2 k      15
        # 3: a 6 3 k      18
        # 4: b 1 4 k      12
        # 5: b 3 5 k      15
        # 6: b 6 6 k      18
        # 7: c 1 7 k      12
        # 8: c 3 8 k      15
        # 9: c 6 9 k      18
    

    多指标的多个分类汇总

    • 默认汇总名称
    DT[,list(mean(v),sum(v)),by=list(x,y)]   # keyed by
        #    x y V1 V2
        # 1: a 1  1  1
        # 2: a 3  2  2
        # 3: a 6  3  3
        # 4: b 1  4  4
        # 5: b 3  5  5
        # 6: b 6  6  6
        # 7: c 1  7  7
        # 8: c 3  8  8
        # 9: c 6  9  9
    
    • 自定义汇总名称
    DT[,list(mean.v = mean(v),sum.v = sum(v)),by=list(x,y)]   # keyed by
        #   x y mean.v sum.v
        #1: a 1      1     1
        #2: a 3      2     2
        #3: a 6      3     3
        #4: b 1      4     4
        #5: b 3      5     5
        #6: b 6      6     6
        #7: c 1      7     7
        #8: c 3      8     8
        #9: c 6      9     9
    
    • 汇总结果与原始数据进行匹配
    DT[,c("mean.v", "sum.v.y") := list(mean(v),sum(v)) ,by=list(x,y)]
        #    x y v a sum.v.y mean.v
        # 1: a 1 1 k       1      1
        # 2: a 3 2 k       2      2
        # 3: a 6 3 k       3      3
        # 4: b 1 4 k       4      4
        # 5: b 3 5 k       5      5
        # 6: b 6 6 k       6      6
        # 7: c 1 7 k       7      7
        # 8: c 3 8 k       8      8
        # 9: c 6 9 k       9      9
    
    

    data.table与data.frame的转化

    data.table格式加快了处理速度,而data.frame则更为基础。两者的转化可以通过data.table(),setDT()和setDT()来实现,其中data.table()和setDT()函数可以将数据从data.frame转化为data.table,setDF()函数可以将数据从data.table转化为data.frame。注意使用data.table(),setDT()和setDT()时,参数本身的数据类型也会发生变化。

    class(DT)
        # [1] "data.table" "data.frame"  
    class(setDF(DT))
        # [1] "data.frame"
    class(DT)
        # [1] "data.frame"
    

    此外,data.table包还可以与基础包中的重复的(),唯一的(),子()函数结合使用。不仅如此,data.table包还有一些基础包的替代函数.rbind()升级版的rbindlist(),可以合并列数不同和列位置不同的数据。比dplyr包中安排()函数更快的setorder()排序函数。

    来源于:http://xukuang.github.io/blog/2016/04/data-table-in-R/

     
  • 相关阅读:
    centos6搭建docker镜像私服
    CentOS6安装docker
    通过pyenv进行多版本python管理
    ( 转)性能测试--地铁模型分析
    LoadRunner基于HTML-based script和URL-based script方式录制的区别和各自的使用场景
    一道Oracle子查询小练习
    Oracle多表连接查询
    LoadRunner关联通用函数的学习
    Selenium2(WebDriver)总结(五)---元素操作进阶(常用类)
    selenium2(WebDriver) API
  • 原文地址:https://www.cnblogs.com/nxld/p/6066797.html
Copyright © 2011-2022 走看看