zoukankan      html  css  js  c++  java
  • hive学习

    1.hive的基础sql

    建立测试数据表:

    文章表:里面存入一段话,一个字段

    create table article (
    sentence STRING
    )
    row format delimited fields terminated by '
    ';
    
    LOAD DATA LOCAL INPATH '/home/hejunhong/wc.log' OVERWRITE INTO TABLE article

    (1)hive进行wordcount的统计

    1.select word,count(*)
    from (
    select explode(split(sentence,'	')) as word from article b ) t
    group by word
    
    2.
    select t.word,count(t.word) from
    (select word
    from article
    lateral view explode(split(sentence,'	'))  a as word) t
    group by t.word

    (2)经典的行转列 统计分析

    建表sql
    数据样式:
    2018-01    211    984
    2018-02    333    999
    2018-03    111    222
    
    create  table rowtocol(
    dt_month string,
    valid_num int,
    unvalid_num int
    )
    row format delimited fields terminated by '	';
    LOAD DATA LOCAL INPATH '/opt/data/row_col.txt' OVERWRITE INTO TABLE rowtocol

     要求转换为以下形式:

    add_t.type    add_t.num
    bene_idno    211
    bene_moble    984
    bene_idno    333
    bene_moble    999
    bene_idno    111
    bene_moble    222
    select  add_t.type  ,add_t.num from rowtocol a lateral view explode(str_to_map(concat('bene_idno=',valid_num,'&bene_moble=',unvalid_num),'&','=')) add_t as type,num

    案例提示:
    如果有一行数据是这样:
    num1 num2 num3 num4 num5 num6
    100 2333 111 1223 8990 9000
    想变成
    num1   100
    num2   ..
    num3   ..
    num4
    num5

    num6 9000

    可尝试使用
    lateral view explode(str_to_map(concat('num1=',num1,'&num2=',num2),'&','=')) add_t as filed,num



    (3)经典函数 时间计算 的使用

    数据样式:
    用户id 商品id 对商品的打分评价 时间

    udata.user_id udata.item_id udata.rating udata.timestamp
    196 242 3 881250949
    186 302 3 891717742
    22 377 1 878887116
    244 51 2 880606923
    166 346 1 886397596
    298 474 4 884182806
    115 265 2 881171488
    253 465 5 891628467
    305 451 3 886324817
    6 86 3 883603013

    create table udata(
    user_id string,
    item_id string,
    rating string,
    `timestamp` string
    )row format delimited fields terminated by '	';
    1.推荐数据里面,想知道,距离现在最近的时间是什么时候,最远的时间是什么时候
    select max(`timestamp`) as max_t,min(`timestamp`) min_t from bigdata.udata;
    893286638 874724710
    最近的这个点作为时间参考点
    
    2.查询两个时间点距离多少天
    select (cast(893286638 as bigint)-cast(`timestamp` as bigint))/(60*60*24) as diffrentDay from udata
    1.能查看用户的行为时间点,以893286638为时间点查出用户对应的所有的 购买频率
    select user_id ,collect_list(cast(days as int)) day_list
    from
    (select user_id,(cast(893286638 as bigint)-cast(`timestamp` as bigint))/(60*60*24) as days from udata) t
    group by user_id limit 10
    
    
    1.看到结果 距离某个时刻的同一天内用户的数据非常集中 可以判断是否为 刷单
    2.查看数据的时间点,可以用这个做一些数据清洗规则
    100 [22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22]
    101    [186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186]
    102    [3,51,51,110,51,110,51,3,3,51,145,51,51,51,51,72,115,51,115,51,110,51,201,51,3,51,51,51,3,51,51,51,51,110,51,51,51,51,164,51,52,177,51,51,51,115,51,3,50,51,51,3,51,51,51,51,201,51,51,51,3,51,51,51,51,110,51,110,51,51,110,3,51,3,3,51,3,51,51,51,51,51,51,115,51,51,51,51,51,51,51,51,51,51,51,51,110,3,3,51,97,51,3,51,72,110,51,51,51,45,51,51,3,201,51,51,3,3,110,51,94,51,51,110,110,51,115,51,51,51,51,3,51,3,51,51,110,51,51,51,115,115,51,51,51,51,51,3,51,164,110,115,51,51,51,3,110,3,51,51,21,201,51,51,3,51,51,3,3,51,72,3,57,3,3,51,51,51,94,115,51,3,51,51,3,51,51,51,51,51,3,51,51,51,3,51,51,160,3,51,51,87,110,51,110,45,59,51,51,51,51,51,110,115,51,51]
    103    [148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148]
    104    [55,56,56,56,55,55,55,55,55,56,55,55,55,56,55,55,55,55,55,55,55,56,55,55,56,55,55,55,56,55,56,56,56,56,55,56,55,55,55,55,55,55,55,56,56,56,56,55,56,55,56,55,56,55,55,55,56,55,55,55,55,55,55,55,55,55,56,56,56,56,55,56,55,55,56,55,55,55,55,55,56,55,56,55,55,56,56,56,56,55,56,55,56,56,55,56,55,55,55,55,55,55,55,55,56,56,56,55,55,55,56]
    105    [47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47]
    106    [136,136,136,136,108,137,136,136,136,137,136,136,108,137,136,108,53,136,136,136,136,136,136,108,136,108,136,136,136,108,136,136,136,136,108,136,136,137,108,136,136,136,136,136,137,108,136,136,136,137,136,108,136,136,136,136,136,136,137,136,136,108,136,136]
    107    [23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23]
    
    4.一个用户有多条行为,做行为分析的时候,最近的行为越有效果(越好)。
    
    在上面结果的时间分布基础上:
    
    (1).引入时间衰减函数的目的:t如果为时间,那么当前时间最近的就是今天 距离为0天 就是峰值为1,时间越远这条数据的参考价值就比较低,乘以他的权重评分就越低
    
    (2exp()函数 就是e的x次方
    
     (3)exp(-t/2) e的-2/t次方 t越大 值越小衰减越慢 相当于一个高斯(正态分布) 以e的0次方为最高点  
    他的函数曲线图:
    select user_id,sum(exp(-(cast(893286638 as bigint)-cast(`timestamp` as bigint))/(60*60*24)/2)*rating) as days from bigdata.udata group by user_id limit 10; 得到的结果: 因为 exp-2/t)中t越小,越接近0(距离当前时间为0)的时候,函数的值越小,求的和越小 所以 sum之后的值越小,当前用户的行为数据越有参考价值 最终比如选择 top100的用户进行 推荐 用户id为100的 就值得做推荐用户 1 3.26938641750186E-8 10 1.3899514053917838E-32 100 0.0028427420371960263 101 4.9919669370351064E-39 102 15.147722144199362 103 4.771115073346258E-31 104 2.15626106001131E-10 105 4.541247668782543E-9 106 1.2297890524212914E-11 107 5.459575349110719E-4

    4.其他案例

    1.表字段说明
    aisles.csv  departments.csv  order_products__prior.csv  order_products__train.csv  orders.csv  products.csv
    1)aisles 通道 货架的编号 (二级类别) 维度表
    aisle_id,aisle
    1,prepared soups salads
    2,specialty cheeses
    3,energy granola bars
    4,instant foods
    5,marinades meat preparation
    6,other
    7,packaged meat
    8,bakery desserts
    9,pasta sauce
    
    2)departments 部门 比如厨房类 (一级类别)维度表
    department_id,department
    1,frozen
    2,other
    3,bakery
    4,produce
    5,alcohol
    6,international
    7,beverages
    8,pets
    9,dry goods pasta
    
    3)orders  订单表 (在hive中属于行为表)
    eval_set:prior历史行为,
    train训练(test中user已经购买了的其中一个商品),
    test(最终我们要预测的数据集,包含哪个用户他可能会购买的商品)
    
    order_number:这个user订单的编号,体现订单的先后顺序
    order_dow:(day of week),订单在星期几
    order_hour_of_day:一天中的哪个小时(分成24小时)
    days_since_prior_order:order_number后面一个订单与前面一个订单相隔天数(注意第一个订单没有)
    
    order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
    2539329,1,prior,1,2,08,
    2398795,1,prior,2,3,07,15.0
    473747,1,prior,3,3,12,21.0
    2254736,1,prior,4,4,07,29.0
    431534,1,prior,5,4,15,28.0
    3367565,1,prior,6,2,07,19.0
    550135,1,prior,7,1,09,20.0
    3108588,1,prior,8,1,14,14.0
    2295261,1,prior,9,1,16,0.0
    
    4)order_products__prior(500M)   order_products__train
    一个订单:订单记录(33120,28985)explode 
    (在hive中属于行为表)
    add_to_cart_order:加购物车的位置 
    reordered:这个订单是否被再次购买(是否)bool
    order_id,product_id,add_to_cart_order,reordered  
    2,33120,1,1
    2,28985,2,1
    2,9327,3,0
    2,45918,4,1
    2,30035,5,0
    2,17794,6,1
    2,40141,7,1
    2,1819,8,1
    2,43668,9,0
    
    5)products 在数据库中(如果落到hive中是维度表)
    product_id,product_name,aisle_id,department_id
    1,Chocolate Sandwich Cookies,61,19
    2,All-Seasons Salt,104,13
    3,Robust Golden Unsweetened Oolong Tea,94,7
    4,Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce,38,1
    5,Green Chile Anytime Sauce,5,13
    6,Dry Nose Oil,11,11
    7,Pure Coconut Water With Orange,98,7
    8,Cut Russet Potatoes Steam N Mash,116,1
    9,Light Strawberry Blueberry Yogurt,120,16
    
    u.data:
    user id | item id | rating | timestamp/t为分割

    (1)2.每个用户有多少个订单(orders表)

    1).每个用户有多少个订单(orders表)
    select user_id, count(order_id) order_cnt  from orders group by  user_id order by  order_cnt desc;
    (2).每个用户一个订单平均是多少商品
    trains:对应的一个订单多少商品/订单数
    order_id,product_id
    
    我今天购买了2个order,一个是10个商品,另一个是4个product
    (10+4/2 =7
      a.先用prior这个表算一个order有多少products 104
      
      select order_id,count(1) as prod_cnt 
      from priors 
      group by order_id
      limit 10;
      
      b. prior与order通过order_id关联 ,把订单中产品数量的信息带到每个用户里(订单中产品数量和user对应上)
      select user_id,prod_cnt
      from orders od
      join (
      select order_id,count(1) as prod_cnt 
      from priors 
      group by order_id
      limit 10000)pro
      on od.order_id=pro.order_id
      limit 10;
      
      c. 求和,一个总共购买多少产品
      select user_id,sum(prod_cnt)as sum_prods
      from orders od
      join (
      select order_id,count(1) as prod_cnt 
      from priors 
      group by order_id
      limit 10000)pro
      on od.order_id=pro.order_id
      group by user_id
      limit 10;
      
      d.求平均
      select user_id,
      sum(prod_cnt)/count(1) as sc_prod,
      avg(prod_cnt) as avg_prod 
      from (select * from orders where eval_set='prior')od --如果不是prior统计为0
      join (
      select order_id,count(1) as prod_cnt 
      from priors 
      group by order_id
      limit 10000)pro
      on od.order_id=pro.order_id
      group by user_id
      limit 10;
     4)每个用户在一周中的购买订单的分布(列转行)
      user_id,dow0,dow1,dow2,dow3,dow4...dow6
        1       0    0    1    2    2      0
      select 
      user_id,
      sum(case order_dow when '0' then 1 else 0 end) as dow0,
      sum(case order_dow when '1' then 1 else 0 end) as dow1,
      sum(case order_dow when '2' then 1 else 0 end) as dow2,
      sum(case order_dow when '3' then 1 else 0 end) as dow3,
      sum(case order_dow when '4' then 1 else 0 end) as dow4,
      sum(case order_dow when '5' then 1 else 0 end) as dow5,
      sum(case order_dow when '6' then 1 else 0 end) as dow6
      from orders
      group by user_id
      limit 10;

     

  • 相关阅读:
    Mac下启动Apache
    Mac OS X中配置Apache
    catransition type
    Block
    mysql 复制表结构和表数据的区别 like 和 select
    mysql kill掉所有的锁表的进程 未验证
    MySQL所有函数及操作符
    linux各种复制命令
    Mac mysql 导入导出数据库
    数据库总结
  • 原文地址:https://www.cnblogs.com/hejunhong/p/11117913.html
Copyright © 2011-2022 走看看