R的速度慢一直被人诟病,最近做一个比较大的dataset的分析,跑得实在太慢,发现症结是R的data frame的index太慢:
以下为测试:
gene_list = 1:100000 eQTL_mat = matrix(nrow = length(gene_list), ncol = 7) # 创建一个matrix eQTL_df = as.data.frame(matrix(nrow = length(gene_list), ncol = 7)) # 创建一个data frame eQTL_list = replicate(length(gene_list), list()) # 创建一个list try_func = function() return(1:7) # test eQTL system.time( sapply(gene_list, function(x) return (try_func())) )
### user system elapsed
### 0.108 0.001 0.108
system.time( for (gene_ind in 1:length(gene_list)){ eQTL_mat[gene_ind, ] = try_func() } )
### user system elapsed
### 0.137 0.000 0.138
system.time( for (gene_ind in 1:length(gene_list)){ eQTL_df[gene_ind, ] = try_func() } )
### user system elapsed
### 90.623 165.868 259.065
system.time( for (gene_ind in 1:length(gene_list)){ eQTL_list[[gene_ind]] = 1:7 } )
### user system elapsed
### 0.089 0.000 0.090
结果看到了吗? 太震精了!data frame真的不适合大数据!