zoukankan      html  css  js  c++  java
  • Hash Table Performance in R: Part I(转)

    What Is It?

    A hash table, or associative array, is a well known key-value data structure. In R there is no equivalent, but you do have some options. You can use a vector of any type, a list, or an environment.

    But as you’ll see with all of these options their performance is compromised in some way. In the average case a lookupash tabl for a key should perform in constant time, or O(1), while in the worst case it will perform in O(n) time,n being the number of elements in the hash table.

    For the tests below, we’ll implement a hash table with a few R data structures and make some comparisons. We’ll create hash tables with only unique keys and then perform a search for every key in the table. Here’s our unique random string creator function:

    library(plyr)
    library(ggplot2)
    
    # Create n unique strings. We use upper and lower case letters only. Create
    # length 1 strings first, and if we haven't satisfied the number of strings
    # requested, we'll create length 2 strings, and so on until we've created n
    # strings.
    #
    # Returns them in random order
    #
    unique_strings <- function(n){
        string_i <- 1
        string_len <- 1
        ans <- character(n)
        chars <- c(letters,LETTERS)
        new_strings <- function(len,pfx){
        for(i in 1:length(chars)){
            if (len == 1){
            ans[string_i] <<- paste(pfx,chars[i],sep='')
            string_i <<- string_i + 1
            } else {
            new_strings(len-1,pfx=paste(pfx,chars[i],sep=''))
            }
            if (string_i > n) return ()
        }
        }
        while(string_i <= n){
        new_strings(string_len,'')
        string_len <- string_len + 1
        }
        sample(ans)
    }

    Using a Vector

    All vectors in R can be named, so let’s see how looking up a vector element by name compares to looking up an element by index.

    Here’s our fake hash table creator:

    # Create a named integer vector size n. Note that elements are initialized to 0
    #
    fake_hash_table <- function(n){
        ans <- integer(n)
        names(ans) <- unique_strings(n)
        ans
    }

    Now let’s create hash tables size 210 thru 215 and search each element:

    timings1 <- adply(2^(10:15),.mar=1,.fun=function(i){
        ht <- fake_hash_table(i)
        data.frame(
        size=c(i,i),
        seconds=c(
            system.time(for (j in 1:i)ht[names(ht)[j]]==0L)[3],
            system.time(for (k in 1:i)ht[k]==0L)[3]),
        index = c('1_string','2_numeric')
        )
    })

    We perform element lookup by using an equality test, so

    ht[names(ht)[j]]==0L

    performs the lookup using a named string, while

    ht[k]==0L

    performs the lookup by index.

    p1 <- ggplot(timings1,aes(x=size,y=seconds,group=index)) + 
        geom_point(aes(color=index)) + geom_line(aes(color=index)) + 
        scale_x_log10(
        breaks=2^(10:15),
        labels=c(expression(2^10), expression(2^11),expression(2^12),
            expression(2^13), expression(2^14), expression(2^15))
        ) +
        theme(
        axis.text = element_text(colour = "black")
        ) + 
        ggtitle('Search Time for Integer Vectors')
    p1

    So we see that as the hash size increases the time it takes to search all the keys by name is exponential, whereas searching by index is constant.

    And note that we’re talking over 30 seconds to search the 215 size hash table, and I’ve got a pretty performant laptop with 8Gb memory, 2.7Ghz Intel processor with SSD drives.

    Using a List

    List elements can be named, right? They’re also more flexible than integer vectors since they can store any R object. What does their performance look like comparing name lookup to index lookup?

    timings2 <- adply(2^(10:15),.mar=1,.fun=function(i){
        strings <- unique_strings(i)
        ht <- list()
        lapply(strings, function(s) ht[[s]] <<- 0L)
        data.frame(
        size=c(i,i),
        seconds=c(
            system.time(for (j in 1:i) ht[[strings[j]]]==0L)[3],
            system.time(for (k in 1:i) ht[[k]]==0L)[3]),
        index = c('1_string','2_numeric')
        )
    })
    p2 <- ggplot(timings2,aes(x=size,y=seconds,group=index)) + 
        geom_point(aes(color=index)) + geom_line(aes(color=index)) + 
        scale_x_log10(
        breaks=2^(10:15),
        labels=c(expression(2^10), expression(2^11),expression(2^12),
            expression(2^13), expression(2^14), expression(2^15))
        ) +
        theme(
        axis.text = element_text(colour = "black")
        ) +
        ggtitle('Search Time for Lists')
    p2

    A little better but still exponential growth with named indexing. We cut our 30 seconds down to over 6 seconds for the large hash table, though.

    Let’s see if we can do better.

    Using an Environment

    R environments store bindings of variables to values. In fact they are well suited to implementing a hash table since internally that’s how they are implemented!

    e <- new.env()
    
    # Assigning in the environment the usual way
    with(e, foo <- 'bar')
    
    e
    ## <environment: 0x3e81d48>
    ls.str(e)
    ## foo :  chr "bar"

    You can also use environments the same way one would use a list:

    e$bar <- 'baz'
    ls.str(e)
    ## bar :  chr "baz"
    ## foo :  chr "bar"

    By default R environments are hashed, but you can also create them without hashing. Let’s evaluate the two:

    timings3 <- adply(2^(10:15),.mar=1,.fun=function(i){
        strings <- unique_strings(i)
        ht1 <- new.env(hash=TRUE)
        ht2 <- new.env(hash=FALSE)
        lapply(strings, function(s){ ht1[[s]] <<- 0L; ht2[[s]] <<- 0L;})
        data.frame(
        size=c(i,i),
        seconds=c(
            system.time(for (j in 1:i) ht1[[strings[j]]]==0L)[3],
            system.time(for (k in 1:i) ht2[[strings[k]]]==0L)[3]),
        envir = c('2_hashed','1_unhashed')
        )
    })

    Note that we’re performing the lookup by named string in both cases. The only difference is that ht1 is hashed and ht2 is not.

    p3 <- ggplot(timings3,aes(x=size,y=seconds,group=envir)) + 
        geom_point(aes(color=envir)) + geom_line(aes(color=envir)) + 
        scale_x_log10(
        breaks=2^(10:15),
        labels=c(expression(2^10), expression(2^11),expression(2^12),
            expression(2^13), expression(2^14), expression(2^15))
        ) +
        theme(
        axis.text = element_text(colour = "black")
        ) +
        ggtitle('Search Time for Environments')
    p3

    Bam! See that blue line? That’s near constant time for searching the entire 215 size hash table!

    An interesting note about un-hashed environments: they are implemented with lists underneath the hood! You’d expect the red line of the above plot to mirror the red line of the list plot, but they differ ever so slightly in performance, with lists being faster.

    In Conclusion

    When you need a true associative array indexed by string, then you definitely want to use R hashed environments. They are more flexible than vectors as they can store any R object, and they are faster than lists since they are hashed. Plus you can work with them just like lists.

    But using environments comes with caveats. I’ll explain in Part II.

    转自:http://jeffreyhorner.tumblr.com/post/114524915928/hash-table-performance-in-r-part-i

    ---------------------------------------------------------------------------------- 数据和特征决定了效果上限,模型和算法决定了逼近这个上限的程度 ----------------------------------------------------------------------------------
  • 相关阅读:
    Leetcode练习(Python):链表类:第206题:反转链表:反转一个单链表。
    Leetcode练习(Python):链表类:第203题:移除链表元素:删除链表中等于给定值 val 的所有节点。
    Leetcode练习(Python):链表类:第160题:相交链表:编写一个程序,找到两个单链表相交的起始节点。
    Leetcode练习(Python):链表类:第141题:环形链表:给定一个链表,判断链表中是否有环。 为了表示给定链表中的环,我们使用整数 pos 来表示链表尾连接到链表中的位置(索引从 0 开始)。 如果 pos 是 -1,则在该链表中没有环。
    Leetcode练习(Python):链表类:第83题:删除排序链表中的重复元素:给定一个排序链表,删除所有重复的元素,使得每个元素只出现一次。
    【Java基础总结】数据库编程
    【Java基础总结】多线程
    特迷茫的大三时期
    解决忘记了开机密码,无法进入系统的方法
    一开机未通过输入密码登录,就出现用户名或密码错误??
  • 原文地址:https://www.cnblogs.com/payton/p/4364759.html
Copyright © 2011-2022 走看看