zoukankan      html  css  js  c++  java
  • 大数据下的Distinct Count(一):序

    在数据库中,常常会有Distinct Count的操作,比如,查看每一选修课程的人数:

    select course, count(distinct sid)
    from stu_table
    group by course;
    

    Hive

    在大数据场景下,报表很重要一项是UV(Unique Visitor)统计,即某时间段内用户人数。例如,查看一周内app的用户分布情况,Hive中写HiveQL实现:

    select app, count(distinct uid) as uv
    from log_table
    where week_cal = '2016-03-27'
    

    Pig

    与之类似,Pig的写法:

    -- all users
    define DISTINCT_COUNT(A, a) returns dist {
        B = foreach $A generate $a;
        unique_B = distinct B;
        C = group unique_B all;
        $dist = foreach C generate SIZE(unique_B);
    }
    A = load '/path/to/data' using PigStorage() as (app, uid);
    B = DISTINCT_COUNT(A, uid);
    
    -- <app, users>
    A = load '/path/to/data' using PigStorage() as (app, uid);
    B = distinct A;
    C = group B by app;
    D = foreach C generate group as app, COUNT($1) as uv;
    -- suitable for small cardinality scenarios
    D = foreach C generate group as app, SIZE($1) as uv;
    

    DataFu 为pig提供基数估计的UDF datafu.pig.stats.HyperLogLogPlusPlus,其采用HyperLogLog++算法,更为快速地Distinct Count:

    define HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();
    A = load '/path/to/data' using PigStorage() as (app, uid);
    B = group A by app;
    C = foreach B generate group as app, HyperLogLogPlusPlus($1) as uv;
    

    Spark

    在Spark中,Load数据后通过RDD一系列的转换——map、distinct、reduceByKey进行Distinct Count:

    rdd.map { row => (row.app, row.uid) }
      .distinct()
      .map { line => (line._1, 1) }
      .reduceByKey(_ + _)
    
    // or
    rdd.map { row => (row.app, row.uid) }
      .distinct()
      .mapValues{ _ => 1 }
      .reduceByKey(_ + _)
    
    // or 
    rdd.map { row => (row.app, row.uid) }
      .distinct()
      .map(_._1)
      .countByValue()
    

    同时,Spark提供近似Distinct Count的API:

    rdd.map { row => (row.app, row.uid) }
        .countApproxDistinctByKey(0.001)
    

    实现是基于HyperLogLog算法:

    The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.

    或者,将Schema化的RDD转成DataFrame后,registerTempTable然后执行sql命令亦可:

    val sqlContext = new SQLContext(sc)
    val df = rdd.toDF()
    df.registerTempTable("app_table")
    
    val appUsers = sqlContext.sql("select app, count(distinct uid) as uv from app_table group by app")
    
  • 相关阅读:
    《将博客搬至CSDN》
    日志分析利器Splunk的搭建、使用、破解
    htop的安装和使用!
    centos下升级php5.3到php5.6
    TriAquae3.0部署安装
    Linux编译安装python2.7.5的步骤
    Centos 7.0 下安装 Zabbix server 3.0服务器的安装及 监控主机的加入(1)
    日志分析利器Splunk的搭建、使用、破解
    AIX上如何启动和停止系统服务
    Splunk日志服务器安装
  • 原文地址:https://www.cnblogs.com/en-heng/p/5332703.html
Copyright © 2011-2022 走看看