zoukankan      html  css  js  c++  java
  • Hive Windowing and Analytics Functions

    LanguageManual WindowingAndAnalytics

    Windowing and Analytics Functions

    链接:

    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-EnhancementstoHiveQL

    大概:

    Window function

        - LEAD函数

        - LAG函数

        - FIRST_VALUE函数

        - LAST_VALUE函数

    OVER clause    ---    一个OVER子句仅适用于前面的函数调用

    analytics function

    DISTINCT support

    aggregate function

     例子:

    COUNT(DISTINCT a)  OVER (PARTITION BY ORDER BY ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

    其中    ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING    应该就是window specification

    这些都是对Hive QL的增强.

    窗口和分析函数

    Windowing and Analytics Functions

    1. PARTITION BY with one partitioning column, no ORDER BY or window specification(PARTITION BY一个partitioning column,没有ORDER BYwindow 规范)
    2. PARTITION BY with two partitioning columns, no ORDER BY or window specification(PARTITION BY由两个partitioning columns,不指定ORDERBYWindow规范)
    3. PARTITION BY with one partitioning column, one ORDER BY column, and no window specification(PARTITION BY按一个partitioning column,一个ORDER BY column,而没有window规范)
    4. PARTITION BY with two partitioning columns, two ORDER BY columns, and no window specification(PARTITION BY由两个partitioning columns,两个ORDER BY columns,并且没有window规范(规格))
    5. PARTITION BY with partitioning, ORDER BY, and window specification(PARTITION BY分区、ORDER BYwindow规范)
    6. WINDOW clause(WINDOW 分句)
    7. LEAD using default 1 row lead and not specifying default value(LEAD[领导]使用默认1 row lead[1排引线]而不指定default value)
    8. LAG specifying a lag of 3 rows and default value of 0(LAG[走得极慢,落后],指定a lag of 3 rowsdefault value0)
    9. Distinct counting for each partition(每个partitionDistinct[明显的]计数)

    Enhancements to Hive QL

    Version

        Introduced in Hive version 0.11.

        This section introduces the Hive QL enhancements for windowing and analytics functions.

     本节介绍用于windowing and analyticsHive QL增强功能。

        See "Windowing Specifications in HQL" (attached to HIVE-4197) for details. 

        HIVE-896 has more information, including links to earlier documentation in the initial comments.(hive-896有更多信息,包括the initial comments[最初注释]中的links to earlier documentation)

        All of the windowing and analytics functions operate as per the SQL standard.(所有的windowing and analytics functions都按照the SQL standard)

        The current release supports the following functions for windowing and analytics:(current release[当前版本]支持以下windowing and analytics功能:)

    1. Windowing functions

    LEAD(领导)

        The number of rows to lead can optionally be specified. (可以选择指定要leadThe number of rows)If the number of rows to lead is not specified, the lead is one row.(如果未指定要leadthe number of rows,则lead one row)

        Returns null when the lead for the current row extends beyond the end of the window.(the current rowthe lead扩展到the end of the window,返回NULL)

    LAG(滞后)

        The number of rows to lag can optionally be specified.(可以选择指定要lag[滞后]The number of rows) If the number of rows to lag is not specified, the lag is one row.(如果未指定要lagthe number of rows,则lagone row)

    Returns null when the lag for the current row extends before the beginning of the window.(the current rowlagthe beginning of the window之前扩展时,返回NULL)

    FIRST_VALUE

        This takes at most two parameters. (这最多需要两个parameters)The first parameter is the column for which you want the first value, the second (optional) parameter must be a boolean which is false by default. (The first parameter是您想要第一个值的column,第二个(可选的)parameter必须是一个Boolean,默认情况下为false)If set to true it skips null values.(如果设置为true,则跳过null values)

    LAST_VALUE

        This takes at most two parameters.(这最多需要两个parameters) The first parameter is the column for which you want the last value, the second (optional) parameter must be a boolean which is false by default. If set to true it skips null values.(The first parameter是您想要最后一个值的column,第二个参数(可选)必须是一个boolean,默认情况下为false。如果设置为true,则跳过null values)

    2. The OVER clause(从句)

    OVER with standard aggregates ( aggregates函数配合OVER从句):

        -    COUNT

        -    SUM

        -    MIN

        -    MAX

        -    AVG

        OVER with a PARTITION BY statement with one or more partitioning columns of any primitive datatype(使用带有any primitive datatype[任何原始数据类型]one or more partitioning columnsPARTITION BY 语句结束。).

        OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns of any datatype.(OVER with PARTITION BY  ORDER BY,使用one or more partitioning and/or ordering columns of any datatype)

        OVER with a window specification.(OVER with a window 规范.) Windows can be defined separately in a WINDOW clause.(Windows可以在WINDOW clause中单独定义。) Window specifications support the following formats:(Window规范支持the following formats)

    这里是Window规范的格式

    UNBOUNDED:无限的

    PRECEDING:在前的

    FOLLOWING:后面的

    (ROWS | RANGE)
    BETWEEN 
    (UNBOUNDED | [num]) PRECEDING 
    AND 
    ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
    
     
    (ROWS | RANGE)
    BETWEEN 
    CURRENT ROW 
    AND 
    (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
    
     
    (ROWS | RANGE)
    BETWEEN 
    [num] FOLLOWING 
    AND 
    (UNBOUNDED | [num]) FOLLOWING

        When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(range between unbounded preceding and current).(ORDER BY没有指定Window从句时时,WINDOW规范默认为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(range between unbounded preceding and current))

        When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.(ORDER BYWINDOW子句都丢失时,WINDOW规范默认为ROWUNBOUNDED PRECEDINGUNBOUNDED FOLLOWING)

        The OVER clause supports the following functions, but it does not support a window with them (see HIVE-4797):(OVER子句支持以下函数,但它不支持带有它们的window(参见hive-4797))

        Ranking functions: Rank, NTile, DenseRank, CumeDist, PercentRank.(排名功能:RankNTileDenseRankCumeDistPercentRank)

        Lead and Lag functions.

    3. Analytics functions

        - RANK(rank)

        ROW_NUMBER(row_number)

        DENSE_RANK(denser_rank)

        CUME_DIST(cume_dist)

        PERCENT_RANK(percent_rank)

        NTILE(ntile)

    4. Distinct support in Hive 2.1.0 and later(Hive 2.1.0及更高版本中的Distinct support)(see HIVE-9534)

        Distinct is supported for aggregation functions including SUM, COUNT and AVG, which aggregate over the distinct values within each partition.(包含SUMCOUNTAVG在内的聚合函数支持DISTIC,这些函数在每个partition内的the distinct values上进行聚合。) Current implementation has the limitation that no ORDER BY or window specification can be supported in the partitioning clause for performance reason. (当前实现的限制是,由于性能原因,the partitioning clause中不支持ORDER BYwindow规范。)The supported syntax is as follows.(支持的语法如下所示。)

    COUNT(DISTINCT a) OVER (PARTITION BY c)


    ORDER BY and window specification is supported for distinct in Hive 2.2.0 (see 
    HIVE-13453). (Hive2.2.0中支持ORDER BYwindow规范(参见hive-13453))An example is as follows.(一个例子如下。)

    COUNT(DISTINCT a) OVER (PARTITION BY ORDER BY ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

    5. Aggregate functions in OVER clause support in Hive 2.1.0 and later(Aggregate函数在Hive2.1.0和后来的条款支持) (see HIVE-13475)

    Support to reference aggregate functions within the OVER clause has been added. (OVER子句中添加了对引用aggregate函数的支持。)For instance, currently we can use the SUM aggregation function within the OVER clause as follows.(例如,目前我们可以在OVER子句中使用SUM聚合函数,如下所示。)

    SELECT rank() OVER (ORDER BY sum(b))

    FROM T

    GROUP BY a;

    Examples

    This section provides examples of how to use the Hive QL windowing and analytics functions in SELECT statements. (本节提供了如何在SELECT语句中使用Hive QL windowing and analytics functions的示例。)See HIVE-896 for additional examples.

    PARTITION BY with one partitioning column, no ORDER BY or window specification

    SELECT a, COUNT(b) OVER (PARTITION BY c)

    FROM T;

    PARTITION BY with two partitioning columns, no ORDER BY or window specification

    SELECT a, COUNT(b) OVER (PARTITION BY c, d)

    FROM T;

    PARTITION BY with one partitioning column, one ORDER BY column, and no window specification

    SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d)

    FROM T;

    PARTITION BY with two partitioning columns, two ORDER BY columns, and no window specification

    SELECT a, SUM(b) OVER (PARTITION BY c, d ORDER BY e, f)

    FROM T;

    PARTITION BY with partitioning, ORDER BY, and window specification

    SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

    FROM T;

    SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)

    FROM T;

    SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING)

    FROM T;

    SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)

    FROM T;

     

        There can be multiple OVER clauses in a single query.(a single query中可以有多个OVER clauses) A single OVER clause only applies to the immediately preceding function call. (一个OVER子句仅适用于前面的函数调用)In this example, the first OVER clause applies to COUNT(b) and the second OVER clause applies to SUM(b)(在本例中,第一个OVER子句适用于COUNT(b),第二个OVER子句适用于SUM(b)):

    SELECT 

     a,

     COUNT(b) OVER (PARTITION BY c),

     SUM(b) OVER (PARTITION BY c)

    FROM T;

     

    Aliases can be used as well, with or without the keyword AS:(Aliases也可以使用,有或不带关键字AS,如下所示:)

    SELECT 

     a,

     COUNT(b) OVER (PARTITION BY c) AS b_count,

     SUM(b) OVER (PARTITION BY c) b_sum

    FROM T;

    WINDOW clause

    SELECT a, SUM(b) OVER w

    FROM T

    WINDOW w AS (PARTITION BY c ORDER BY d ROWS UNBOUNDED PRECEDING);

    LEAD using default 1 row lead and not specifying default value

    SELECT a, LEAD(a) OVER (PARTITION BY b ORDER BY C)

    FROM T;

    LAG specifying a lag of 3 rows and default value of 0

    SELECT a, LAG(a, 30) OVER (PARTITION BY b ORDER BY C)

    FROM T;

    Distinct counting for each partition

    SELECT a, COUNT(distinct a) OVER (PARTITION BY b)

    FROM T;

  • 相关阅读:
    go 从入门到精通(二)基本数据类型和操作符
    python爬虫番外篇(一)进程,线程的初步了解
    Go从入门到精通(一)go语言初识
    python爬虫从入门到放弃(九)之 实例爬取上海高级人民法院网开庭公告数据
    python爬虫从入门到放弃(八)之 Selenium库的使用
    python爬虫从入门到放弃(七)之 PyQuery库的使用
    python爬虫从入门到放弃(六)之 BeautifulSoup库的使用
    python爬虫从入门到放弃(五)之 正则的基本使用
    python爬虫从入门到放弃(三)之 Urllib库的基本使用
    openstack网络基本概念(转)
  • 原文地址:https://www.cnblogs.com/grj001/p/12225720.html
Copyright © 2011-2022 走看看