zoukankan      html  css  js  c++  java
  • Spark sql -- Spark sql中的窗口函数和对应的api

    一、窗口函数种类

    1. ranking 排名类
    2. analytic 分析类
    3. aggregate 聚合类
    Function Type SQL DataFrame API Description
     Ranking  rank   rank rank值可能是不连续的
     Ranking  dense_rank  denseRank rank值一定是连续的
     Ranking  percent_rank   percentRank 相同的分组中 (rank -1) / ( count(score) - 1 )
     Ranking  ntile  ntile 将同一组数据循环的往n个桶中放,返回对应的桶的index,index从1开始
     Ranking  row_number  rowNumber 很单纯的行号,类似excel的行号
     Analytic   cume_dist  cumeDist  
     Analytic   first_value   firstValue 相同的分组中最小值
     Analytic   last_value   lastValue 相同的分组中最大值
     Analytic   lag  lag 取前n行数据
     Analytic   lead  lead 取后n行数据
     Aggregate   min min 最小值
     Aggregate   max max 最大值
     Aggregate   sum sum 求和
     Aggregate   avg avg 求平均

    二、具体用法如下

    count(...) over(partition by ... order by ...)--求分组后的总数。
    sum(...) over(partition by ... order by ...)--求分组后的和。
    max(...) over(partition by ... order by ...)--求分组后的最大值。
    min(...) over(partition by ... order by ...)--求分组后的最小值。
    avg(...) over(partition by ... order by ...)--求分组后的平均值。
    rank() over(partition by ... order by ...)--rank值可能是不连续的。
    dense_rank() over(partition by ... order by ...)--rank值是连续的。
    first_value(...) over(partition by ... order by ...)--求分组内的第一个值。
    last_value(...) over(partition by ... order by ...)--求分组内的最后一个值。
    lag() over(partition by ... order by ...)--取出前n行数据。  
    lead() over(partition by ... order by ...)--取出后n行数据。
    ratio_to_report() over(partition by ... order by ...)--Ratio_to_report() 括号中就是分子,over() 括号中就是分母。
    percent_rank() over(partition by ... order by ...)--

    三、实际例子

    案例数据:/root/score.json/score.json,学生名字、课程、分数

    {"name":"A","lesson":"Math","score":100}
    {"name":"B","lesson":"Math","score":100}
    {"name":"C","lesson":"Math","score":99}
    {"name":"D","lesson":"Math","score":98}
    {"name":"A","lesson":"E","score":100}
    {"name":"B","lesson":"E","score":99}
    {"name":"C","lesson":"E","score":99}
    {"name":"D","lesson":"E","score":98}
    select
    name,lesson,score,
    ntile(2) over (partition by lesson order by score desc ) as ntile_2,
    ntile(3) over (partition by lesson order by score desc ) as ntile_3,
    row_number() over (partition by lesson order by score desc ) as row_number,
    rank() over (partition by lesson order by score desc ) as rank,
    dense_rank() over (partition by lesson order by score desc ) as dense_rank, 
    percent_rank() over (partition by lesson order by score desc ) as percent_rank 
    from score 
    order by lesson,name,score

    输出结果完全一样,如下表所示

    namelessonscorentile_2ntile_3row_numberrankdense_rankpercent_rank
    A E 100 1 1 1 1 1 0.0
    B E 99 1 1 2 2 2 0.3333333333333333
    C E 99 2 2 3 2 2 0.3333333333333333
    D E 98 2 3 4 4 3 1.0
    A Math 100 1 1 1 1 1 0.0
    B Math 100 1 1 2 1 1 0.0
    C Math 99 2 2 3 3 2 0.6666666666666666
    D Math 98 2 3 4 4 3 1.0

    参考:

    spark sql中的窗口函数

    over(partition by) 函数

    =================================================================================

    原创文章,转载请务必将下面这段话置于文章开头处(保留超链接)。
    本文转发自程序媛说事儿,原文链接https://www.cnblogs.com/abc8023/p/10910741.html

    =================================================================================

  • 相关阅读:
    百度图片
    在线人数统计
    mysql简易导入excel
    asp.net 导出excel带图片
    C# 正则验证
    js生成随机数
    YQL获取天气
    取html里的img和去html标签
    客户端信息获得《转》
    使用ASP.NET上传图片汇总
  • 原文地址:https://www.cnblogs.com/abc8023/p/10910741.html
Copyright © 2011-2022 走看看