zoukankan html css js c++ java

hive 在统计汇总时候，如何去重

create  table xxxx.test_collect_list_set(

first_level_directory int ,
second_level_directory int ,
third_level_directory int ,
order_id int 

)

ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '	' 
  NULL DEFINED AS '' 
STORED AS ORC
tblproperties ('orc.compress'='SNAPPY');


insert overwrite into xxxx.test_collect_list_set
values
(1,2,3,1),
(1,2,3,1),
(1,2,3,4),
(1,2,4,1),
(1,2,4,5)

;


select first_level_directory,second_level_directory,collect_set(set_list) from ( 
select first_level_directory,
       second_level_directory,
       third_level_directory ,
       concat_ws(',',collect_set(order_id ))  set_list 
from xxxx.test_collect_list_set
group by first_level_directory,second_level_directory,third_level_directory
) a
GROUP by first_level_directory,second_level_directory

Error:
Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hadoop.io.IntWritable


drop table xxxx.test_collect_list_set ;



SQL :
select first_level_directory,second_level_directory,udf(concat_ws(',',collect_set(set_list))) as set_list,udaf(udaf_set) as udaf_set from ( 
select first_level_directory,
       second_level_directory,
       third_level_directory , 
       concat_ws(',',collect_set(cast(order_id  as string))) set_list ,
	   udaf(order_id)  udaf_set 
from xxxx.test_collect_list_set
group by first_level_directory,second_level_directory,third_level_directory
) a
GROUP by first_level_directory,second_level_directory;

Result :
1	2	1,4,1,5	["1,4","1,5"]

TIps :

I want to remove the duplicted data count from the back 
the data should be become to as froming 
1,4,1,5   ---> 1,4,5    (do not care about the sort)

May be the bast mothod is write a udf funciton using java method to remove he duplicted data 

The other way is more complicated which suppliey by my colleges ,more info please refer to 
https://blog.csdn.net/inte_sleeper/article/details/7196114

My another colleges give me this sql ,it works fine alone ,but if we add more columns it will give some errors 

select collect_set(col) from (
select explode(split('1,2,3,4,1,2',',')) as col from test 
) a;


Also we can using udaf to finish this work ,as talked with college it seems more easy for sql developers


errors like :
UDTF's are not supported outside the SELECT clause, nor nested in expressions

查看全文

相关阅读:
用POP动画模拟真实秒钟摆动效果
 解析苹果的官方例子LazyTableImages实现图片懒加载原理
 支持xcode6的缓动函数Easing以及使用示例
 [转] iOS 动画库 Pop 和 Canvas 各自的优势和劣势是什么？
NSJSONSerialization能够处理的JSONData
[翻译] USING GIT IN XCODE [6] 在XCODE中使用GIT[6]
[翻译] USING GIT IN XCODE [5] 在XCODE中使用GIT[5]
[翻译] USING GIT IN XCODE [4] 在XCODE中使用GIT[4]
[翻译] USING GIT IN XCODE [3] 在XCODE中使用GIT[3]
【转】断点继传

原文地址：https://www.cnblogs.com/TendToBigData/p/10501185.html