tidy(整洁),Tidyr包是由Hadely Wickham创建,这个包提高了整理原始数据的效率,tidyr包的4个常用的函数及其用途如下:
- gather()——它把多列放在一起,然后转化为key:value对。这个函数会把宽格式的数据转化为长格式。它是reshape包中melt函数的一个替代
- spread()——它的功能和gather相反,把key:value对转化成不同的列
- separate()——它会把一列拆分为多列
- unite()——它的功能和separate相反,把多列合并为一列
长形表和宽形表,简单的说,长形表就是一个观测对象可由多行组成,而宽形表则是一个观测仅由一行组成。
初始
- 安装载入包
install.packages("tidyr") library(tidyr)
- 组织数据
> name <- c("A","B","C") > gender <- c("F","F","M") > province <- c("JS","SH","HN") > age <- c(18,22,19) > df_wide <- data.frame(name = name, gender = gender, province = province, age = age) > df_wide name gender province age 1 A F JS 18 2 B F SH 22 3 C M HN 19
-
gather()
- Usage: gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
- data:需要被转换的宽形表
- key:将原数据框中的所有列赋给一个新变量key
- value:将原数据框中的所有值赋给一个新变量value
- …:可以指定哪些列聚到一列中
- na.rm:是否删除缺失值
- 默认将所有列存放到key中,如下例
> df_gather <- gather(data = df_wide, key = variable, value = value) Warning message: attributes are not identical across measure variables; they will be dropped > df_gather variable value 1 name A 2 name B 3 name C 4 gender F 5 gender F 6 gender M 7 province JS 8 province SH 9 province HN 10 age 18 11 age 22 12 age 19
- 指定需要被聚为一列的字段
> df_wide %>% gather(key=vars,value=value,gender:age) name vars value 1 A gender F 2 B gender F 3 C gender M 4 A province JS 5 B province SH 6 C province HN 7 A age 18 8 B age 22 9 C age 19
-
上面的代码等价于:df_wide %>% gather(key=vars,value=value,-name)
spread()
- Usage:spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
- data:为需要转换的长形表
- key:需要将变量值拓展为字段的变量
- value:需要分散的值
- fill:对于缺失值,可将fill的值赋值给被转型后的缺失值
- 功能:将一列分离为多列
- 示例数据
> name <- c("A","A","A","B","B") > product <- c("P1","P2","P3","P1","P4") > price <- c(100,130,55,100,78) > df_long <- data.frame(name = name, product = product, price = price) > df_long name product price 1 A P1 100 2 A P2 130 3 A P3 55 4 B P1 100 5 B P4 78
- 列分离
> df_long_expand <- spread(data = df_long, key = product, value = price) > df_long_expand name P1 P2 P3 P4 1 A 100 130 55 NA 2 B 100 NA NA 78
-
被转型后的数据框中存在缺失值,如果想给缺失值传递一个指定值的话,就需要fill参数的作用。
> spread(data = df_long, key = product, value = price,fill = 0) name P1 P2 P3 P4 1 A 100 130 55 0 2 B 100 0 0 78
-
separate()
- Usage:separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE,convert = FALSE, extra = "warn", fill = "warn", ...)
- data:为数据框
- col:需要被拆分的列
- into:新建的列名,为字符串向量
- sep:被拆分列的分隔符
- remove:是否删除被分割的列
- 示例数据
> id <- c(1,2) > datetime <- c(as.POSIXlt("2015-12-31 13:23:44"), as.POSIXlt("2016-01-28 21:14:12")) > df <- data.frame(id = id, datetime = datetime) > df id datetime 1 1 2015-12-31 13:23:44 2 2 2016-01-28 21:14:12
- 使用separate()函数将日期时间值分割为年、月、日、时、分、秒
> #拆成日期和时间 > separate1 <- separate(df,col="datetime",into=c("date","time"),sep=" ",remove=FALSE) > separate1 id datetime date time 1 1 2015-12-31 13:23:44 2015-12-31 13:23:44 2 2 2016-01-28 21:14:12 2016-01-28 21:14:12 > > separate2 <- separate(separate1,col="date",into=c("year","month","day"),sep="-",remove=FALSE) > separate2 id datetime date year month day time 1 1 2015-12-31 13:23:44 2015-12-31 2015 12 31 13:23:44 2 2 2016-01-28 21:14:12 2016-01-28 2016 01 28 21:14:12 > > separate3 <- separate(separate2,col="time",into=c("hh","mm","ss"),sep=":",remove=TRUE) > separate3 id datetime date year month day hh mm ss 1 1 2015-12-31 13:23:44 2015-12-31 2015 12 31 13 23 44 2 2 2016-01-28 21:14:12 2016-01-28 2016 01 28 21 14 12l
-
连接串写法
> df %>% separate(.,col="datetime",into=c("date","time"),sep=" ",remove=TRUE) %>% separate(.,col="date",into=c("year","month","day"),sep="-",remove=TRUE)%>% separate(.,col="time",into=c("hh","mm","ss"),sep=":",remove=TRUE) id year month day hh mm ss 1 1 2015 12 31 13 23 44 2 2 2016 01 28 21 14 12
unite()
- 与separate()函数相反,它将多列合并为一列
- Usage: unite(data, col, ..., sep = "_", remove = TRUE)
- data:为数据框
- col:被组合的新列名称
- …:指定哪些列需要被组合
- sep:组合列之间的连接符,默认为下划线
- remove:是否删除被组合的列
- 示例
> df1 id year month day hh mm ss 1 1 2015 12 31 13 23 44 2 2 2016 01 28 21 14 12 > df1 %>% unite(.,col="date",year,month,day,sep="-")%>% unite(.,col="time",hh,mm,ss,sep=":")%>% unite(.,col="datetime",date,time,sep=" ") id datetime 1 1 2015-12-31 13:23:44 2 2 2016-01-28 21:14:12