ROW_NUMBER实现
如何在ClickHouse中实现ROW_NUMBER OVER 和DENSE_RANK OVER等同效果的查询,它们在一些其他数据库中可用于RANK排序。
CH中并没有直接提供对应的开窗函数,需要利用一些特殊函数变相实现,主要会用到下面几个数组函数,它们分别是:
arrayEnumerate
arrayEnumerateDense
arrayEnumerateUniq
这些函数均接受一个数组作为输入参数,并返回数组中元素出现的位置,例如:
SELECT arrayEnumerate([10, 20, 30, 10, 40]) AS row_number, arrayEnumerateDense([10, 20, 30, 10, 40]) AS dense_rank, arrayEnumerateUniq([10, 20, 30, 10, 40]) AS uniq_rank ┌─row_number──┬─dense_rank──┬─uniq_rank───┐ │ [1,2,3,4,5] │ [1,2,3,1,4] │ [1,1,1,2,1] │ └─────────────┴─────────────┴─────────────┘
数据格式:
我们的目标是实现开窗查询:
ROW_NUMBER() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerate
DENSE_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateDense
UNIQ_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateUniq
代码如下
SELECT customer_id , groupArray(loan_dt) AS loan_dt, groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date, groupArray(due_days) AS due_days, groupArray(loan_id) AS loan_id, arrayEnumerate(loan_id) AS row_number, arrayEnumerateDense(loan_id) AS dense_rank, arrayEnumerateUniq(loan_id) AS uniq_rank FROM ( SELECT * FROM res_report.xxx_loan ORDER BY loan_dt ,loan_id ) GROUP BY customer_id
数组展开,利用ARRAY JOIN将数组展开,并按照customer_id 、loan_id 列排序:
SELECT customer_id ,loan_id ,loan_dt ,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt ,due_days ,row_number ,dense_rank ,uniq_rank from ( SELECT customer_id , groupArray(loan_dt) AS loan_dt, groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date, groupArray(due_days) AS due_days, groupArray(loan_id) AS loan_id, arrayEnumerate(loan_id) AS row_number, arrayEnumerateDense(loan_id) AS dense_rank, arrayEnumerateUniq(loan_id) AS uniq_rank FROM ( SELECT * FROM res_report.xxx_loan ORDER BY loan_dt ,loan_id ) GROUP BY customer_id ) ARRAY JOIN loan_dt, loan_id, end_date, due_days, row_number, dense_rank, uniq_rank ORDER BY customer_id ASC, row_number ASC , dense_rank ASC
技巧:因为end_date可能为空值,会导致array长度不一致。报错。需要用特数值填充然后最后再转换回来。
lag/lead实现:
neighbor(column, offset[, default_value]) The result of the function depends on the affected data blocks and the order of data in the block. If you make a subquery with ORDER BY and call the function from outside the subquery, you can get the expected result. Parameters column — A column name or scalar expression. offset — The number of rows forwards or backwards from the current row of column. Int64. default_value — Optional. The value to be returned if offset goes beyond the scope of the block. Type of data blocks affected. Returned values Value for column in offset distance from current row if offset value is not outside block bounds. Default value for column if offset value is outside block bounds. If default_value is given, then it will be used. Type: type of data blocks affected or default value type. 参考: https://clickhouse.tech/docs/en/sql-reference/functions/other-functions/
代码如下
SELECT customer_id ,loan_id ,loan_dt ,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt ,due_days ,row_number ,dense_rank ,uniq_rank ,if(neighbor(row_number , 1)<>1,neighbor(loan_dt , 1),null) as lead_loan_dt ,if(row_number<>1,neighbor(end_dt, -1),null) as lag_end_dt from ( SELECT customer_id , groupArray(loan_dt) AS loan_dt, groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date, groupArray(due_days) AS due_days, groupArray(loan_id) AS loan_id, groupArray(loan_tot) AS loan_tot, arrayEnumerate(loan_id) AS row_number, arrayEnumerateDense(loan_id) AS dense_rank, arrayEnumerateUniq(loan_id) AS uniq_rank FROM ( SELECT * FROM res_report.xxx_loan ORDER BY loan_dt ,loan_id ) GROUP BY customer_id ) ARRAY JOIN loan_dt, loan_id, end_date, due_days, row_number, dense_rank, uniq_rank ORDER BY customer_id ASC, row_number ASC , dense_rank ASC
发现有一点问题:就是最后一行lead的时候会出现异常值1970-01-01年值的问题。
SELECT customer_id ,loan_id ,loan_dt ,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt ,due_days ,row_number as row_num ,dense_rank ,uniq_rank ,if(neighbor(row_num , 1)>1,neighbor(loan_dt , 1),null) as lead_loan_dt ,if(row_num<>1,neighbor(end_dt, -1),null) as lag_end_dt from ( SELECT customer_id , groupArray(loan_dt) AS loan_dt, groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date, groupArray(due_days) AS due_days, groupArray(loan_id) AS loan_id, arrayEnumerate(loan_id) AS row_number, arrayEnumerateDense(loan_id) AS dense_rank, arrayEnumerateUniq(loan_id) AS uniq_rank FROM ( SELECT * FROM res_report.ipeso_loan ORDER BY loan_dt ,loan_id ) GROUP BY customer_id ) ARRAY JOIN loan_dt, loan_id, end_date, due_days, row_number, dense_rank, uniq_rank ORDER BY customer_id ASC, row_number ASC , dense_rank ASC