zoukankan      html  css  js  c++  java
  • clickhouse-(09)- row_number() over (partition by)的几种实现方法

    ClickHouse row_number() over (partition by)的几种实现方法

    hive中有row_number() over (partition by)函数,可以一句SQL实现想要的排序,在ClickHouse中有很多种实现方式,本篇就介绍一下几种方法。
    

    目录

    1.row_number排序

    2.row_number排序后取出rank=1的结果

    3.特殊场景

    1.row_number排序

    HIVE中写法:

    select number,
           row_number() over (partition by number order by time desc) as rank
      from table a
     GROUP BY number
    

    ClickHouse写法:

    select number,
           groupArray(time) AS arr_val,
           arrayEnumerate(arr_val) as row_number
      from (select distinct orderid as number,
                   toDate(operatetime) as time
              from table
             order by time desc
             ) a
     GROUP BY number
    

    2.row_number排序后取出rank=1的结果

    hive写法:

    select orderid
      from (select orderid,
                   row_number() over(partition by orderid order by datachange_lasttime desc) as row_num
              from table
             where d = '${CurrentDate}'
             ) a
     where row_num = 1;
    

    ClickHouse写法:

    方法1:利用groupArray

    select orderid, 
           groupArray(1)(datachange_lasttime) as dates
      from (select orderid, 
                   datachange_lasttime
              from table
             ORDER BY orderid, datachange_lasttime desc
            ) a
     group by orderid
    

    方法2:利用max函数实现倒序,如果正序使用min函数即可

    select orderid,
           max(datachange_lasttime) as datachange_lasttime
      from table
     group by orderid
    

    方法3:利用rowNumberInAllBlocks函数

    select orderid, status
      from (select orderid, status, rowNumberInAllBlocks() as rank
              from (select orderid, status, datachange_lasttime
                      from table
                     order by orderid, datachange_lasttime desc
                     ) a
           ) b LIMIT 1 BY orderid
    

    方法4:利用arrayEnumerate函数

    select orderid
      from (select orderid,
                   groupArray(datachange_lasttime) AS arr_val,
                   arrayEnumerate(arr_val) as row_number
              from (select orderid, datachange_lasttime
                      from table
                     order by datachange_lasttime desc
                     ) a
             GROUP BY number
             ) b
     where row_number = 1
    

    3.特殊场景
    要求:

    对于以下场景,需要按照orderid分组,按照日期倒序,取最新一条,若日期一致,则随机取一条作为结果即可

    hive写法:

    select orderid  from (select orderid,               status,               row_number() over(partition by orderid order by datachange_lasttime desc) as row_num          from table         where d = '${CurrentDate}'         ) as b where row_num = 1
    

    ClickHouse写法:

    通过上面的案例,我们很容易想到,把上面的结果作为一个子表,与原表进行关联,只是这样关联,随便举一个关联的写法:

    select a.orderid as orderid_a, a.status as status  from olap_htlmaindb.tmp_ord_orders_status_s_pre a inner join (select orderid, groupArray(1)(datachange_lasttime) as dates               from (select orderid, datachange_lasttime                       from table                      ORDER BY orderid, datachange_lasttime desc                     ) a              group by orderid) b    on a.orderid = b.orderid   and cast(a.datachange_lasttime as String) = cast(b.dates [ 1 ] as String) 
    

    这里我们是先把符合要求的orderid和时间取出来,再回去关联,取出需要的列,因为这些函数都有一个缺点是只能有partition by的字段和排序字段,不能有其他字段,所以要返回关联,所以上面四种方法,ininer join原表,都不能解决上面案例的问题。

    这里就想到了LIMIT 1 BY这个方法,这个方法其实是最有效的,如下:

    select orderid,        status,        datachange_lasttime  from table order by orderid, datachange_lasttime desc  LIMIT 1 BY orderid
    
  • 相关阅读:
    SQL SERVER 性能优化四: 创建分区表
    SQL SERVER 性能优化一: 数据库中几百万数据查询优化
    SQL SERVER 性能优化二: 数据库初始值大小及增长方式设置
    C++连接SQL SERVER 数据库方式
    利用作业定时收缩SQL SERVER数据库
    jQuery easyUI datagrid 增加求和统计行
    jquery easyui DataGrid 数据表格 属性
    jQuery easyui中获取datagrid某一列的值之和
    Maven + Springboot + redis 配置
    mac os Catalina beta Jetbrains idea闪退解决方案
  • 原文地址:https://www.cnblogs.com/weijiqian/p/14853264.html
Copyright © 2011-2022 走看看