zoukankan      html  css  js  c++  java
  • 大数据下的Distinct Count(一):序

    在数据库中,常常会有Distinct Count的操作,比如,查看每一选修课程的人数:

    select course, count(distinct sid)
    from stu_table
    group by course;
    

    Hive

    在大数据场景下,报表很重要一项是UV(Unique Visitor)统计,即某时间段内用户人数。例如,查看一周内app的用户分布情况,Hive中写HiveQL实现:

    select app, count(distinct uid) as uv
    from log_table
    where week_cal = '2016-03-27'
    

    Pig

    与之类似,Pig的写法:

    -- all users
    define DISTINCT_COUNT(A, a) returns dist {
        B = foreach $A generate $a;
        unique_B = distinct B;
        C = group unique_B all;
        $dist = foreach C generate SIZE(unique_B);
    }
    A = load '/path/to/data' using PigStorage() as (app, uid);
    B = DISTINCT_COUNT(A, uid);
    
    -- <app, users>
    A = load '/path/to/data' using PigStorage() as (app, uid);
    B = distinct A;
    C = group B by app;
    D = foreach C generate group as app, COUNT($1) as uv;
    -- suitable for small cardinality scenarios
    D = foreach C generate group as app, SIZE($1) as uv;
    

    DataFu 为pig提供基数估计的UDF datafu.pig.stats.HyperLogLogPlusPlus,其采用HyperLogLog++算法,更为快速地Distinct Count:

    define HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();
    A = load '/path/to/data' using PigStorage() as (app, uid);
    B = group A by app;
    C = foreach B generate group as app, HyperLogLogPlusPlus($1) as uv;
    

    Spark

    在Spark中,Load数据后通过RDD一系列的转换——map、distinct、reduceByKey进行Distinct Count:

    rdd.map { row => (row.app, row.uid) }
      .distinct()
      .map { line => (line._1, 1) }
      .reduceByKey(_ + _)
    
    // or
    rdd.map { row => (row.app, row.uid) }
      .distinct()
      .mapValues{ _ => 1 }
      .reduceByKey(_ + _)
    
    // or 
    rdd.map { row => (row.app, row.uid) }
      .distinct()
      .map(_._1)
      .countByValue()
    

    同时,Spark提供近似Distinct Count的API:

    rdd.map { row => (row.app, row.uid) }
        .countApproxDistinctByKey(0.001)
    

    实现是基于HyperLogLog算法:

    The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.

    或者,将Schema化的RDD转成DataFrame后,registerTempTable然后执行sql命令亦可:

    val sqlContext = new SQLContext(sc)
    val df = rdd.toDF()
    df.registerTempTable("app_table")
    
    val appUsers = sqlContext.sql("select app, count(distinct uid) as uv from app_table group by app")
    
  • 相关阅读:
    JS控制台打印星星,总有你要的那一款~
    css居中方法
    line-height
    position定位
    IE盒子模型
    CSS中的盒模型
    CSS中的BEM命名
    循环语句总结(代码以C#为例)
    程序设计中的数学思维函数总结(代码以C#为例)
    转:SpringBoot 自定义异常@ContollerAdvice ExceptionHandler不起作用
  • 原文地址:https://www.cnblogs.com/en-heng/p/5332703.html
Copyright © 2011-2022 走看看