zoukankan      html  css  js  c++  java
  • spark rdd median 中位数求解

    lookup(key)

    Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.

    >>> l = range(1000)
    >>> rdd = sc.parallelize(zip(l, l), 10)
    >>> rdd.lookup(42)  # slow
    [42]
    >>> sorted = rdd.sortByKey()
    >>> sorted.lookup(42)  # fast
    [42]
    >>> sorted.lookup(1024)
    []
    >>> rdd2 = sc.parallelize([(('a', 'b'), 'c')]).groupByKey()
    >>> list(rdd2.lookup(('a', 'b'))[0])
    ['c']


    You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]:

      import org.apache.spark.SparkContext._
    
      val rdd: RDD[Int] = ???
    
      val sorted = rdd.sortBy(identity).zipWithIndex().map {
        case (v, idx) => (idx, v)
      }
    
      val count = sorted.count()
    
      val median: Double = if (count % 2 == 0) {
        val l = count / 2 - 1
        val r = l + 1
        (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
      } else sorted.lookup(count / 2).head.toDouble


    实验:
    all_data = sc.parallelize([25,1,2,3,4,5,6,7,8,100])
    all_data.sortBy(lambda x:x).zipWithIndex().map(lambda x: (x[1],x[0])).collect
    [(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 25), (9, 100)]



  • 相关阅读:
    C/C++ string.h头文件小结
    linux根据进程pid查看进程详细信息
    《mysql必知必会》读书笔记
    安装vim with python
    vim正则表达式小结
    C语言指针篇(二)多级指针
    C语言指针篇(一)指针与指针变量
    递归函数
    lan口和wan口的配置
    C语言基础篇(三) 指针
  • 原文地址:https://www.cnblogs.com/bonelee/p/7154234.html
Copyright © 2011-2022 走看看