zoukankan      html  css  js  c++  java
  • 使用clickhouse实现开窗函数 row/rank_number 和 lag lead

    ROW_NUMBER实现

    如何在ClickHouse中实现ROW_NUMBER OVER 和DENSE_RANK OVER等同效果的查询,它们在一些其他数据库中可用于RANK排序。

    CH中并没有直接提供对应的开窗函数,需要利用一些特殊函数变相实现,主要会用到下面几个数组函数,它们分别是:

    arrayEnumerate
    arrayEnumerateDense
    arrayEnumerateUniq

    这些函数均接受一个数组作为输入参数,并返回数组中元素出现的位置,例如:

    SELECT    
     arrayEnumerate([10, 20, 30, 10, 40]) AS row_number,     
     arrayEnumerateDense([10, 20, 30, 10, 40]) AS dense_rank,    
     arrayEnumerateUniq([10, 20, 30, 10, 40]) AS uniq_rank
     
    ┌─row_number──┬─dense_rank──┬─uniq_rank───┐
    │ [1,2,3,4,5][1,2,3,1,4][1,1,1,2,1] │
    └─────────────┴─────────────┴─────────────┘

    数据格式:

    我们的目标是实现开窗查询:

    ROW_NUMBER() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerate
    DENSE_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateDense
    UNIQ_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateUniq

    代码如下

    SELECT     
    customer_id ,     
    groupArray(loan_dt) AS loan_dt, 
    groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
    groupArray(due_days) AS due_days,    
    groupArray(loan_id) AS loan_id,    
    arrayEnumerate(loan_id) AS row_number,     
    arrayEnumerateDense(loan_id) AS dense_rank,     
    arrayEnumerateUniq(loan_id) AS uniq_rank
    FROM (    
        SELECT   *  FROM res_report.xxx_loan     ORDER BY loan_dt ,loan_id  
    )
    GROUP BY customer_id

     数组展开,利用ARRAY JOIN将数组展开,并按照customer_id 、loan_id  列排序:

    SELECT 
    customer_id
    ,loan_id
    ,loan_dt
    ,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt
    ,due_days
    ,row_number 
    ,dense_rank
    ,uniq_rank
    from 
    (
    SELECT     
    customer_id ,     
    groupArray(loan_dt) AS loan_dt, 
    groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
    groupArray(due_days) AS due_days,    
    groupArray(loan_id) AS loan_id,    
    arrayEnumerate(loan_id) AS row_number,     
    arrayEnumerateDense(loan_id) AS dense_rank,     
    arrayEnumerateUniq(loan_id) AS uniq_rank
    FROM (    
        SELECT   *  FROM res_report.xxx_loan     ORDER BY loan_dt ,loan_id  
    )
    GROUP BY customer_id
    )
    ARRAY JOIN  
        loan_dt,
        loan_id, 
        end_date,
        due_days,
        row_number,     
        dense_rank,     
        uniq_rank
    ORDER BY     
    customer_id ASC,     
    row_number ASC ,     
    dense_rank ASC

     技巧:因为end_date可能为空值,会导致array长度不一致。报错。需要用特数值填充然后最后再转换回来。

    lag/lead实现:

    neighbor(column, offset[, default_value])
     
    The result of the function depends on the affected data blocks and the order of data in the block.
    If you make a subquery with ORDER BY and call the function from outside the subquery, you can get the expected result.
     
    Parameters
     
    column — A column name or scalar expression.
    offset — The number of rows forwards or backwards from the current row of column. Int64.
    default_value — Optional. The value to be returned if offset goes beyond the scope of the block. Type of data blocks affected.
    Returned values
     
    Value for column in offset distance from current row if offset value is not outside block bounds.
    Default value for column if offset value is outside block bounds. If default_value is given, then it will be used.
    Type: type of data blocks affected or default value type.
     
     
    参考:
    https://clickhouse.tech/docs/en/sql-reference/functions/other-functions/

    代码如下

    SELECT 
    customer_id
    ,loan_id
    ,loan_dt
    ,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt
    ,due_days
    ,row_number 
    ,dense_rank
    ,uniq_rank
    ,if(neighbor(row_number , 1)<>1,neighbor(loan_dt , 1),null)  as lead_loan_dt
    ,if(row_number<>1,neighbor(end_dt, -1),null)  as lag_end_dt
    from 
    (
    SELECT     
    customer_id ,     
    groupArray(loan_dt) AS loan_dt, 
    groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
    groupArray(due_days) AS due_days,    
    groupArray(loan_id) AS loan_id,    
    groupArray(loan_tot) AS loan_tot, 
    arrayEnumerate(loan_id) AS row_number,     
    arrayEnumerateDense(loan_id) AS dense_rank,     
    arrayEnumerateUniq(loan_id) AS uniq_rank
    FROM (    
        SELECT   *  FROM res_report.xxx_loan     ORDER BY loan_dt ,loan_id  
    )
    GROUP BY customer_id
    )
    ARRAY JOIN  
        loan_dt,
        loan_id, 
        end_date,
        due_days,
        row_number,     
        dense_rank,     
        uniq_rank
    ORDER BY     
    customer_id ASC,     
    row_number ASC ,     
    dense_rank ASC

    发现有一点问题:就是最后一行lead的时候会出现异常值1970-01-01年值的问题。 

    SELECT 
    customer_id
    ,loan_id
    ,loan_dt
    ,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt
    ,due_days
    ,row_number as row_num
    ,dense_rank
    ,uniq_rank
    ,if(neighbor(row_num , 1)>1,neighbor(loan_dt , 1),null)  as lead_loan_dt
    ,if(row_num<>1,neighbor(end_dt, -1),null)  as lag_end_dt
    from 
    (
    SELECT     
    customer_id ,     
    groupArray(loan_dt) AS loan_dt, 
    groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
    groupArray(due_days) AS due_days,    
    groupArray(loan_id) AS loan_id,    
    arrayEnumerate(loan_id) AS row_number,     
    arrayEnumerateDense(loan_id) AS dense_rank,     
    arrayEnumerateUniq(loan_id) AS uniq_rank
    FROM (    
        SELECT   *  FROM res_report.ipeso_loan     ORDER BY loan_dt ,loan_id  
    )
    GROUP BY customer_id
    )
    ARRAY JOIN  
        loan_dt,
        loan_id, 
        end_date,
        due_days,
        row_number,     
        dense_rank,     
        uniq_rank
    ORDER BY     
    customer_id ASC,     
    row_number ASC ,     
    dense_rank ASC
  • 相关阅读:
    oracle乱码问题
    fopen 參数具体解释
    图像特征提取方法:Bag-of-words
    事务应用-运行多条SQL语句
    八大排序算法总结
    svn经常使用命令具体解释(非常全,非常有用)
    android layout属性介绍
    ubuntu/linux mint 创建proc文件的三种方法(四)
    在归档模式下删除非系统文件的恢复
    解决使用DevExpress开发错误:未将对象引用设置到对象的实例
  • 原文地址:https://www.cnblogs.com/wqbin/p/14189943.html
Copyright © 2011-2022 走看看