zoukankan html css js c++ java

Hash Table Performance in R: Part I（转）

What Is It?

A hash table, or associative array, is a well known key-value data structure. In R there is no equivalent, but you do have some options. You can use a vector of any type, a list, or an environment.

But as you’ll see with all of these options their performance is compromised in some way. In the average case a lookupash tabl for a key should perform in constant time, or O(1), while in the worst case it will perform in O(n) time,n being the number of elements in the hash table.

For the tests below, we’ll implement a hash table with a few R data structures and make some comparisons. We’ll create hash tables with only unique keys and then perform a search for every key in the table. Here’s our unique random string creator function:

library(plyr)
library(ggplot2)

# Create n unique strings. We use upper and lower case letters only. Create
# length 1 strings first, and if we haven't satisfied the number of strings
# requested, we'll create length 2 strings, and so on until we've created n
# strings.
#
# Returns them in random order
#
unique_strings <- function(n){
    string_i <- 1
    string_len <- 1
    ans <- character(n)
    chars <- c(letters,LETTERS)
    new_strings <- function(len,pfx){
    for(i in 1:length(chars)){
        if (len == 1){
        ans[string_i] <<- paste(pfx,chars[i],sep='')
        string_i <<- string_i + 1
        } else {
        new_strings(len-1,pfx=paste(pfx,chars[i],sep=''))
        }
        if (string_i > n) return ()
    }
    }
    while(string_i <= n){
    new_strings(string_len,'')
    string_len <- string_len + 1
    }
    sample(ans)
}

Using a Vector

All vectors in R can be named, so let’s see how looking up a vector element by name compares to looking up an element by index.

Here’s our fake hash table creator:

# Create a named integer vector size n. Note that elements are initialized to 0
#
fake_hash_table <- function(n){
    ans <- integer(n)
    names(ans) <- unique_strings(n)
    ans
}

Now let’s create hash tables size 2¹⁰ thru 2¹⁵ and search each element:

timings1 <- adply(2^(10:15),.mar=1,.fun=function(i){
    ht <- fake_hash_table(i)
    data.frame(
    size=c(i,i),
    seconds=c(
        system.time(for (j in 1:i)ht[names(ht)[j]]==0L)[3],
        system.time(for (k in 1:i)ht[k]==0L)[3]),
    index = c('1_string','2_numeric')
    )
})

We perform element lookup by using an equality test, so

ht[names(ht)[j]]==0L

performs the lookup using a named string, while

ht[k]==0L

performs the lookup by index.

p1 <- ggplot(timings1,aes(x=size,y=seconds,group=index)) + 
    geom_point(aes(color=index)) + geom_line(aes(color=index)) + 
    scale_x_log10(
    breaks=2^(10:15),
    labels=c(expression(2^10), expression(2^11),expression(2^12),
        expression(2^13), expression(2^14), expression(2^15))
    ) +
    theme(
    axis.text = element_text(colour = "black")
    ) + 
    ggtitle('Search Time for Integer Vectors')
p1

So we see that as the hash size increases the time it takes to search all the keys by name is exponential, whereas searching by index is constant.

And note that we’re talking over 30 seconds to search the 2¹⁵ size hash table, and I’ve got a pretty performant laptop with 8Gb memory, 2.7Ghz Intel processor with SSD drives.

Using a List

List elements can be named, right? They’re also more flexible than integer vectors since they can store any R object. What does their performance look like comparing name lookup to index lookup?

timings2 <- adply(2^(10:15),.mar=1,.fun=function(i){
    strings <- unique_strings(i)
    ht <- list()
    lapply(strings, function(s) ht[[s]] <<- 0L)
    data.frame(
    size=c(i,i),
    seconds=c(
        system.time(for (j in 1:i) ht[[strings[j]]]==0L)[3],
        system.time(for (k in 1:i) ht[[k]]==0L)[3]),
    index = c('1_string','2_numeric')
    )
})
p2 <- ggplot(timings2,aes(x=size,y=seconds,group=index)) + 
    geom_point(aes(color=index)) + geom_line(aes(color=index)) + 
    scale_x_log10(
    breaks=2^(10:15),
    labels=c(expression(2^10), expression(2^11),expression(2^12),
        expression(2^13), expression(2^14), expression(2^15))
    ) +
    theme(
    axis.text = element_text(colour = "black")
    ) +
    ggtitle('Search Time for Lists')
p2

A little better but still exponential growth with named indexing. We cut our 30 seconds down to over 6 seconds for the large hash table, though.

Let’s see if we can do better.

Using an Environment

R environments store bindings of variables to values. In fact they are well suited to implementing a hash table since internally that’s how they are implemented!

e <- new.env()

# Assigning in the environment the usual way
with(e, foo <- 'bar')

e
## <environment: 0x3e81d48>
ls.str(e)
## foo :  chr "bar"

You can also use environments the same way one would use a list:

e$bar <- 'baz'
ls.str(e)
## bar :  chr "baz"
## foo :  chr "bar"

By default R environments are hashed, but you can also create them without hashing. Let’s evaluate the two:

timings3 <- adply(2^(10:15),.mar=1,.fun=function(i){
    strings <- unique_strings(i)
    ht1 <- new.env(hash=TRUE)
    ht2 <- new.env(hash=FALSE)
    lapply(strings, function(s){ ht1[[s]] <<- 0L; ht2[[s]] <<- 0L;})
    data.frame(
    size=c(i,i),
    seconds=c(
        system.time(for (j in 1:i) ht1[[strings[j]]]==0L)[3],
        system.time(for (k in 1:i) ht2[[strings[k]]]==0L)[3]),
    envir = c('2_hashed','1_unhashed')
    )
})

Note that we’re performing the lookup by named string in both cases. The only difference is that ht1 is hashed and ht2 is not.

p3 <- ggplot(timings3,aes(x=size,y=seconds,group=envir)) + 
    geom_point(aes(color=envir)) + geom_line(aes(color=envir)) + 
    scale_x_log10(
    breaks=2^(10:15),
    labels=c(expression(2^10), expression(2^11),expression(2^12),
        expression(2^13), expression(2^14), expression(2^15))
    ) +
    theme(
    axis.text = element_text(colour = "black")
    ) +
    ggtitle('Search Time for Environments')
p3

Bam! See that blue line? That’s near constant time for searching the entire 2¹⁵ size hash table!

An interesting note about un-hashed environments: they are implemented with lists underneath the hood! You’d expect the red line of the above plot to mirror the red line of the list plot, but they differ ever so slightly in performance, with lists being faster.

In Conclusion

When you need a true associative array indexed by string, then you definitely want to use R hashed environments. They are more flexible than vectors as they can store any R object, and they are faster than lists since they are hashed. Plus you can work with them just like lists.

But using environments comes with caveats. I’ll explain in Part II.

转自：http://jeffreyhorner.tumblr.com/post/114524915928/hash-table-performance-in-r-part-i

---------------------------------------------------------------------------------- 数据和特征决定了效果上限，模型和算法决定了逼近这个上限的程度 ----------------------------------------------------------------------------------

查看全文

相关阅读:
C语言printf语法
 Android动画
 【转】Android内存(内存溢出内存不足内存低 .)优化详解
 Android dialog在有的手机上宽度不能充满屏幕的问题
 ios单例模式（Singleton）
给ImageButton设置按下的效果
 Dynamics CRM Entity Relationship Many to Many (N:N)
Python实现Mysql数据库连接池
 python获取指定时间差的时间
 Visual studio debug—Process with an Id of 5616 is not running的解决方法

原文地址：https://www.cnblogs.com/payton/p/4364759.html