zoukankan      html  css  js  c++  java
  • hive 之 行转列两种方法

    一、方法1:with

    2017年月度 各店铺 单客户平均消费(店铺消费/店铺unique客户数量)
    环比:每个月和上一个月比

    with
    tab as (
        select 
            date,store_id,avgCusPrice-lastAvgPrice priceDiff
        from
        (
            select 
                date,store_id,avgCusPrice,
                lag(avgCusPrice,1,0) over(partition by store_id order by date) lastAvgPrice
            from 
            (
                select totalprice/cusCount avgCusPrice,date,store_id
                from 
                (
                    select count(distinct customer_name) cusCount,sum(price) totalprice,date,store_id
                    from 
                    (
                        select customer_name,price
                        ,from_unixtime(unix_timestamp(date,'yyyy-MM-dd'),'yyyy-MM') date,store_id
                        from transaction_details
                    )T
                    group by date,store_id
                )T 
            )T 
        )T 
    )
    select 
        t4.store_id,
        t4.priceDiff as diff4,
        t5.priceDiff as diff5,
        t6.priceDiff as diff6
    from (
        select 
            store_id,priceDiff
        from tab
        where
            date = "2017-04"
    )t4
    inner join tab t5 on t4.store_id=t5.store_id and t5.date="2017-05"
    inner join tab t6 on t4.store_id=t6.store_id and t6.date="2017-06"

    二、方法2:case when

    例1

    select store_id,
    max(case date when'2017-04' then priceDiff else null end) as `4月diff`,
    max(case date when'2017-05' then priceDiff else null end) as `5月diff`,
    max(case date when'2017-06' then priceDiff else null end) as `6月diff`,
    max(case date when'2017-07' then priceDiff else null end) as `7月diff`
    from 
    (
    select 
    date,store_id,avgCusPrice-lastAvgPrice priceDiff
    from
    (
    select 
    date,store_id,avgCusPrice,
    lag(avgCusPrice,1,0) over(partition by store_id order by date) lastAvgPrice
    from 
    (
    select totalprice/cusCount avgCusPrice,date,store_id
    from 
    (
    select count(distinct customer_name) cusCount,sum(price) totalprice,date,store_id
    from 
    (
    select customer_name,price
    ,from_unixtime(unix_timestamp(date,'yyyy-MM-dd'),'yyyy-MM') date,store_id
    from transaction_details
    )T
    group by date,store_id
    )T 
    )T 
    )T 
    )T 
    group by store_id

    例2 

    stu_id, question_id, score分为 1 0 0.5,对应 right,error,half。需要根据stu_id求他们每人right,error,half的题目集合

    思路1

    select
    student_id,
    concat_ws('',collect_list(t.right)) right,
    concat_ws('',collect_list(t.half)) half,
    concat_ws('',collect_list(t.error)) error
    from 
    (
    select
    student_id,
    case when score=1.0 then concat_ws(",",collect_list(question_id)) else null end right,
    case when score=0.5 then concat_ws(",",collect_list(question_id)) else null end half,
    case when score=0.0 then concat_ws(",",collect_list(question_id)) else null end error
    from ex_exam_record
    group by student_id,score
    ) t
    group by student_id;

    思路2

    select 
    student_id,
    concat_ws(',',collect_list(`right`)) as `right`,
    concat_ws(',',collect_list(`half`)) as `half`,
    concat_ws(',',collect_list(`error`)) as `error`
    from 
    (
    select 
    student_id,
    max(case score when 0 then question_id else null end) as `error`,
    max(case score when 1 then question_id else null end) as `right`,
    max(case score when 0.5 then question_id else null end) as `half`
    from ex_exam_record 
    group by student_id,question_id
    )T
    group by student_id

    select 
    ddd.d_date,
    sum(case when datediff('2018-10-20',ddd.d_date)<=1 then 1 else 0 end) two_count
    from dw_sales_source.dwd_fact_sales_order dfo
    inner join dwd_dim_date ddd on dfo.date_sk = ddd.date_sk
    inner join dwd_dim_customer ddc on dfo.customer_sk = ddc.customer_sk
    inner join dwd_dim_product ddp on dfo.product_sk = ddp.product_sk
    where ddd.d_date>='2018-10-19' and ddd.d_date<='2018-10-20'
    group by 
    ddd.d_date;
    
    select 
    ddd.d_date,
    case when datediff('2018-10-20',ddd.d_date)<=1 then 1 else 0 end two_count # 近两天的order次数
    from dw_sales_source.dwd_fact_sales_order dfo
    inner join dwd_dim_date ddd on dfo.date_sk = ddd.date_sk
    inner join dwd_dim_customer ddc on dfo.customer_sk = ddc.customer_sk
    inner join dwd_dim_product ddp on dfo.product_sk = ddp.product_sk
    where ddd.d_date>='2018-10-19' and ddd.d_date<='2018-10-20'
    group by 
    ddd.d_date;

    case when 后的d_date在group by 后面,“10-19 1”“10-19 1”分组加起来“10-19 出现次数”

    思路3

    只groupby stuid,是否就能够分类出对应的question_id集合?

    答:不行

    三、方法3:Lateral view explode

    val friends = spark.read.format("csv").option("header","true").load("hdfs://192.168.56.111:9000/party/data/user_friends.csv")

    方法1:lateral view explode之 hive   

    friends.createOrReplaceTempView("friends")
    spark.sql("""
    select 
    count(1)
    from
    (
    select distinct user,friends_1
    from friends 
    LATERAL VIEW explode(split(friends,' '))friends as friends_1 
    where friends_1 is not null and friends_1 !="" // hive在split(空格)的时候会把两个空格中间视为""
    )T
    """).show()
    >>> 30386387

    方法2:explode之 sparkSQL-API

    friends.select($"user".alias("user_id"),explode(split($"friends"," ")).alias("friends_1")).filter("friends_1 is not null").distinct.count
    >>> 30386387

    结果都是:

  • 相关阅读:
    如何设计一个安全的对外接口?
    Java架构师必须知道的 6 大设计原则
    Spring Boot 如何快速改造老项目?
    Java 类中可以覆盖静态方法吗?
    Dubbo 的心跳设计,值得学习!
    在滴滴和头条干了 2 年后端开发,太真实…
    别乱提交代码了,看下大厂 Git 提交规范是怎么做的!
    3种骚操作,教你查看 Java 字节码!
    sysbench测试服务器性能
    mysql DDL时出现的锁等待状态
  • 原文地址:https://www.cnblogs.com/sabertobih/p/13589760.html
Copyright © 2011-2022 走看看