1.hive的基础sql
建立测试数据表:
文章表:里面存入一段话,一个字段
create table article ( sentence STRING ) row format delimited fields terminated by ' '; LOAD DATA LOCAL INPATH '/home/hejunhong/wc.log' OVERWRITE INTO TABLE article
(1)hive进行wordcount的统计
1.select word,count(*) from ( select explode(split(sentence,' ')) as word from article b ) t group by word 2. select t.word,count(t.word) from (select word from article lateral view explode(split(sentence,' ')) a as word) t group by t.word
(2)经典的行转列 统计分析
建表sql 数据样式: 2018-01 211 984 2018-02 333 999 2018-03 111 222 create table rowtocol( dt_month string, valid_num int, unvalid_num int ) row format delimited fields terminated by ' '; LOAD DATA LOCAL INPATH '/opt/data/row_col.txt' OVERWRITE INTO TABLE rowtocol
要求转换为以下形式:
add_t.type add_t.num bene_idno 211 bene_moble 984 bene_idno 333 bene_moble 999 bene_idno 111 bene_moble 222 select add_t.type ,add_t.num from rowtocol a lateral view explode(str_to_map(concat('bene_idno=',valid_num,'&bene_moble=',unvalid_num),'&','=')) add_t as type,num
案例提示:
如果有一行数据是这样:
num1 num2 num3 num4 num5 num6
100 2333 111 1223 8990 9000
想变成
num1 100
num2 ..
num3 ..
num4
num5
num6 9000
可尝试使用
lateral view explode(str_to_map(concat('num1=',num1,'&num2=',num2),'&','=')) add_t as filed,num
(3)经典函数 时间计算 的使用
数据样式:
用户id 商品id 对商品的打分评价 时间
udata.user_id udata.item_id udata.rating udata.timestamp
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
create table udata( user_id string, item_id string, rating string, `timestamp` string )row format delimited fields terminated by ' ';
1.推荐数据里面,想知道,距离现在最近的时间是什么时候,最远的时间是什么时候 select max(`timestamp`) as max_t,min(`timestamp`) min_t from bigdata.udata; 893286638 874724710 最近的这个点作为时间参考点 2.查询两个时间点距离多少天 select (cast(893286638 as bigint)-cast(`timestamp` as bigint))/(60*60*24) as diffrentDay from udata 1.能查看用户的行为时间点,以893286638为时间点查出用户对应的所有的 购买频率 select user_id ,collect_list(cast(days as int)) day_list from (select user_id,(cast(893286638 as bigint)-cast(`timestamp` as bigint))/(60*60*24) as days from udata) t group by user_id limit 10 1.看到结果 距离某个时刻的同一天内用户的数据非常集中 可以判断是否为 刷单 2.查看数据的时间点,可以用这个做一些数据清洗规则 100 [22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22] 101 [186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186] 102 [3,51,51,110,51,110,51,3,3,51,145,51,51,51,51,72,115,51,115,51,110,51,201,51,3,51,51,51,3,51,51,51,51,110,51,51,51,51,164,51,52,177,51,51,51,115,51,3,50,51,51,3,51,51,51,51,201,51,51,51,3,51,51,51,51,110,51,110,51,51,110,3,51,3,3,51,3,51,51,51,51,51,51,115,51,51,51,51,51,51,51,51,51,51,51,51,110,3,3,51,97,51,3,51,72,110,51,51,51,45,51,51,3,201,51,51,3,3,110,51,94,51,51,110,110,51,115,51,51,51,51,3,51,3,51,51,110,51,51,51,115,115,51,51,51,51,51,3,51,164,110,115,51,51,51,3,110,3,51,51,21,201,51,51,3,51,51,3,3,51,72,3,57,3,3,51,51,51,94,115,51,3,51,51,3,51,51,51,51,51,3,51,51,51,3,51,51,160,3,51,51,87,110,51,110,45,59,51,51,51,51,51,110,115,51,51] 103 [148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148] 104 [55,56,56,56,55,55,55,55,55,56,55,55,55,56,55,55,55,55,55,55,55,56,55,55,56,55,55,55,56,55,56,56,56,56,55,56,55,55,55,55,55,55,55,56,56,56,56,55,56,55,56,55,56,55,55,55,56,55,55,55,55,55,55,55,55,55,56,56,56,56,55,56,55,55,56,55,55,55,55,55,56,55,56,55,55,56,56,56,56,55,56,55,56,56,55,56,55,55,55,55,55,55,55,55,56,56,56,55,55,55,56] 105 [47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47] 106 [136,136,136,136,108,137,136,136,136,137,136,136,108,137,136,108,53,136,136,136,136,136,136,108,136,108,136,136,136,108,136,136,136,136,108,136,136,137,108,136,136,136,136,136,137,108,136,136,136,137,136,108,136,136,136,136,136,136,137,136,136,108,136,136] 107 [23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23] 4.一个用户有多条行为,做行为分析的时候,最近的行为越有效果(越好)。 在上面结果的时间分布基础上: (1).引入时间衰减函数的目的:t如果为时间,那么当前时间最近的就是今天 距离为0天 就是峰值为1,时间越远这条数据的参考价值就比较低,乘以他的权重评分就越低 (2)exp()函数 就是e的x次方 (3)exp(-t/2) e的-2/t次方 t越大 值越小衰减越慢 相当于一个高斯(正态分布) 以e的0次方为最高点
他的函数曲线图:
select user_id,sum(exp(-(cast(893286638 as bigint)-cast(`timestamp` as bigint))/(60*60*24)/2)*rating) as days from bigdata.udata group by user_id limit 10; 得到的结果: 因为 exp(-2/t)中t越小,越接近0(距离当前时间为0)的时候,函数的值越小,求的和越小 所以 sum之后的值越小,当前用户的行为数据越有参考价值 最终比如选择 top100的用户进行 推荐 用户id为100的 就值得做推荐用户 1 3.26938641750186E-8 10 1.3899514053917838E-32 100 0.0028427420371960263 101 4.9919669370351064E-39 102 15.147722144199362 103 4.771115073346258E-31 104 2.15626106001131E-10 105 4.541247668782543E-9 106 1.2297890524212914E-11 107 5.459575349110719E-4
4.其他案例
1.表字段说明 aisles.csv departments.csv order_products__prior.csv order_products__train.csv orders.csv products.csv 1)aisles 通道 货架的编号 (二级类别) 维度表 aisle_id,aisle 1,prepared soups salads 2,specialty cheeses 3,energy granola bars 4,instant foods 5,marinades meat preparation 6,other 7,packaged meat 8,bakery desserts 9,pasta sauce 2)departments 部门 比如厨房类 (一级类别)维度表 department_id,department 1,frozen 2,other 3,bakery 4,produce 5,alcohol 6,international 7,beverages 8,pets 9,dry goods pasta 3)orders 订单表 (在hive中属于行为表) eval_set:prior历史行为, train训练(test中user已经购买了的其中一个商品), test(最终我们要预测的数据集,包含哪个用户他可能会购买的商品) order_number:这个user订单的编号,体现订单的先后顺序 order_dow:(day of week),订单在星期几 order_hour_of_day:一天中的哪个小时(分成24小时) days_since_prior_order:order_number后面一个订单与前面一个订单相隔天数(注意第一个订单没有) order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order 2539329,1,prior,1,2,08, 2398795,1,prior,2,3,07,15.0 473747,1,prior,3,3,12,21.0 2254736,1,prior,4,4,07,29.0 431534,1,prior,5,4,15,28.0 3367565,1,prior,6,2,07,19.0 550135,1,prior,7,1,09,20.0 3108588,1,prior,8,1,14,14.0 2295261,1,prior,9,1,16,0.0 4)order_products__prior(500M) order_products__train 一个订单:订单记录(33120,28985)explode (在hive中属于行为表) add_to_cart_order:加购物车的位置 reordered:这个订单是否被再次购买(是否)bool order_id,product_id,add_to_cart_order,reordered 2,33120,1,1 2,28985,2,1 2,9327,3,0 2,45918,4,1 2,30035,5,0 2,17794,6,1 2,40141,7,1 2,1819,8,1 2,43668,9,0 5)products 在数据库中(如果落到hive中是维度表) product_id,product_name,aisle_id,department_id 1,Chocolate Sandwich Cookies,61,19 2,All-Seasons Salt,104,13 3,Robust Golden Unsweetened Oolong Tea,94,7 4,Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce,38,1 5,Green Chile Anytime Sauce,5,13 6,Dry Nose Oil,11,11 7,Pure Coconut Water With Orange,98,7 8,Cut Russet Potatoes Steam N Mash,116,1 9,Light Strawberry Blueberry Yogurt,120,16 u.data: user id | item id | rating | timestamp 以/t为分割
(1)2.每个用户有多少个订单(orders表)
(1).每个用户有多少个订单(orders表) select user_id, count(order_id) order_cnt from orders group by user_id order by order_cnt desc; (2).每个用户一个订单平均是多少商品 trains:对应的一个订单多少商品/订单数 order_id,product_id 我今天购买了2个order,一个是10个商品,另一个是4个product (10+4)/2 =7 a.先用prior这个表算一个order有多少products 10,4 select order_id,count(1) as prod_cnt from priors group by order_id limit 10; b. prior与order通过order_id关联 ,把订单中产品数量的信息带到每个用户里(订单中产品数量和user对应上) select user_id,prod_cnt from orders od join ( select order_id,count(1) as prod_cnt from priors group by order_id limit 10000)pro on od.order_id=pro.order_id limit 10; c. 求和,一个总共购买多少产品 select user_id,sum(prod_cnt)as sum_prods from orders od join ( select order_id,count(1) as prod_cnt from priors group by order_id limit 10000)pro on od.order_id=pro.order_id group by user_id limit 10; d.求平均 select user_id, sum(prod_cnt)/count(1) as sc_prod, avg(prod_cnt) as avg_prod from (select * from orders where eval_set='prior')od --如果不是prior统计为0 join ( select order_id,count(1) as prod_cnt from priors group by order_id limit 10000)pro on od.order_id=pro.order_id group by user_id limit 10;
4)每个用户在一周中的购买订单的分布(列转行) user_id,dow0,dow1,dow2,dow3,dow4...dow6 1 0 0 1 2 2 0 select user_id, sum(case order_dow when '0' then 1 else 0 end) as dow0, sum(case order_dow when '1' then 1 else 0 end) as dow1, sum(case order_dow when '2' then 1 else 0 end) as dow2, sum(case order_dow when '3' then 1 else 0 end) as dow3, sum(case order_dow when '4' then 1 else 0 end) as dow4, sum(case order_dow when '5' then 1 else 0 end) as dow5, sum(case order_dow when '6' then 1 else 0 end) as dow6 from orders group by user_id limit 10;