Working with large datasets

zoukankan html css js c++ java

Working with large datasets

Working with large datasets

There are three issues to consider when working with large datasets:

(a) efficient programming to speed execution,

(b) storing data externally to limit memory issues,

(c) using specialized statistical routines designed to efficiently analyze massive amounts of data.

Efficient programming

1. Vectorize calculations when possible. Use R’s built-in functions for manipulating vectors, matrices, and lists (for example, sapply, lappy, and mapply) and avoid loops (for and while) when feasible.

2. Use matrices rather than data frames (they have less overhead).

3. When using theread.table() family of functions to input external data into data frames, specify the colClasses and nrows options explicitly, set comment.char = "", and specify "NULL" for columns that aren’t needed. This will decrease memory usage and speed up processing considerably. When reading external data into a matrix, use the scan() function instead.

4. Test programs on subsets of the data, in order to optimize code and remove bugs, before attempting a run on the full dataset.

5. Delete temporary objects and objects that are no longer needed. The call rm(list=ls()) will remove all objects from memory, providing a clean slate. Specific objects can be removed with rm(object).

6. Use the function .ls.objects() described in Jeromy Anglim’s blog entry “Memory Management in R: A Few Tips and Tricks” (jeromyanglim.blogspot.com), to list all workspace objects sorted by size (MB). This function will help you find and deal with memory hogs.

7. Profile your programs to see how much time is being spent in each function.You can accomplish this with the Rprof() and summaryRprof() functions. The system.time() function can also help. The profr and prooftools packages provide functions that can help in analyzing profiling output.

8. The Rcpp package can be used to transfer R objects to C++ functions and back when more optimized subroutines are needed.

Storing data outside of RAM

There are several packages available for storing data outside of R’s main memory.

Analytic packages for large datasets

The biglm and speedglm packages fit linear and generalized linear models to large datasets in a memory efficient manner. This offers lm() and glm() type functionality when dealing with massive datasets.

Several packages offer analytic functions for working with the massive matrices produced by the bigmemory ackage. The biganalytics package offers k-means clustering, column statistics, and a wrapper to biglm. The bigtabulate package provides table() , split() , and tapply() functionality and the bigalgebra package provides advanced linear algebra functions.

The biglars package offers least-angle regression, lasso, and stepwise regression for datasets that are too large to be held in memory, when used in conjunction with the ff package .

The Brobdingnagpackage can be used to manipulate large numbers (numbers larger than 2^1024).

High-Performance and Parallel Computing with R (cran.r-project.org/web/views)

tanhao2013@foxmail.com || http://weibo.com/buttonwood

查看全文

相关阅读:
《Metasploit 渗透测试魔鬼训练营》攻击机无法攻击靶机
 Ubuntu 解压 RAR
verilog实验2：基于FPGA的59秒计时器设计
 verilog实验1：基于FPGA蜂鸣器演奏乐曲并数码管显示
 Java基础之反射和动态代理
 Redis初探
 Rest(表述性状态转移)
深入理解MVC模式
 @Controller和@RestController的区别
 solrconfig.xml和schema.xml说明

原文地址：https://www.cnblogs.com/buttonwood/p/2593953.html

最新文章
网址
 Angular的猜字游戏
 敏感字过滤
 inonc的网址
 Angular实现的一些例子
 angular的表格
 购物车
 html的头部标签详解
 网页兼容性
 office openxml学习（一）