zoukankan html css js c++ java

spark rdd median 中位数求解

lookup(key)

Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.

>>> l = range(1000)
>>> rdd = sc.parallelize(zip(l, l), 10)
>>> rdd.lookup(42)  # slow
[42]
>>> sorted = rdd.sortByKey()
>>> sorted.lookup(42)  # fast
[42]
>>> sorted.lookup(1024)
[]
>>> rdd2 = sc.parallelize([(('a', 'b'), 'c')]).groupByKey()
>>> list(rdd2.lookup(('a', 'b'))[0])
['c']

You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]:

  import org.apache.spark.SparkContext._

  val rdd: RDD[Int] = ???

  val sorted = rdd.sortBy(identity).zipWithIndex().map {
    case (v, idx) => (idx, v)
  }

  val count = sorted.count()

  val median: Double = if (count % 2 == 0) {
    val l = count / 2 - 1
    val r = l + 1
    (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
  } else sorted.lookup(count / 2).head.toDouble


实验：

all_data = sc.parallelize([25,1,2,3,4,5,6,7,8,100])
all_data.sortBy(lambda x:x).zipWithIndex().map(lambda x: (x[1],x[0])).collect
[(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 25), (9, 100)]

查看全文

相关阅读:
[Unity] 如何通过 C# 代码控制角色动画的播放
 [Unity] Unity粒子系统报错: Ensure Read/Write is enabled on the Particle System's Texture.
[Unity] Shader Graph Error 当前渲染管道与此主节点不兼容(The current render pipeline is not compatible with this master node)
[运维] 请求 nginx 出现 502 Bad Gateway 的解决方案!
[排错] SpringBoot 警告 Could not find acceptable representation
[理解] C++ 中的源文件和头文件
 [经验] 如何将 Java 项目发布到云服务器上并可以访问
 均衡与竞争
 指数退避算法
 判断两个多项式是否为同一个方程

原文地址：https://www.cnblogs.com/bonelee/p/7154234.html