collect_set(x) 列转行函数---没有重复, 组装多列的数据的结构体
collect_list(x) 列转行函数---可以有重复,组装多列的数据的结构体
concat_ws 拼接函数, 用于多列转成同一行字段后,间隔符
UDF(User-Defined-Function) 用户定义(普通)函数,只对单行数值产生作用;
UDAF(User- Defined Aggregation Funcation)用户定义聚合函数,可对多行数据产生作用;等同与SQL中常用的SUM(),AVG(),也是聚合函数;
UDTF(User-Defined Table-Generating Functions) 用来解决 输入一行输出多行(On-to-many maping) 的需求。
lateral view用于和split、explode等UDTF一起使用的,能将一行数据拆分成多行数据,在此基础上可以对拆分的数据进行聚合,lateral view首先为原始表的每行调用UDTF,UDTF会把一行拆分成一行或者多行,lateral view把结果组合,产生一个支持别名表的虚拟表。下例中的 lateral view explode(subdinates) adTable as aa; 虚拟表adTable的别名为aa
explode(ARRAY) 列表中的每个元素生成一行
explode(MAP) map中每个key-value对,生成一行,key为一列,value为一列
| CREATE TABLE `employees`( |
| `name` string, |
| `salary` float, |
| `subdinates` array<string>, |
| `deducation` map<string,float>, |
| `address` struct<street:string,city:string,state:string,zip:int>) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/employees' |
| TBLPROPERTIES ( |
| 'creator'='tianyongtao', |
| 'last_modified_by'='root', |
| 'last_modified_time'='1521447397', |
| 'numFiles'='0', |
| 'numRows'='0', |
| 'rawDataSize'='0', |
| 'totalSize'='0', |
| 'transient_lastDdlTime'='1521447397') |
+----------------------------------------------------------------------+--+
Array类型字段的处理
0: jdbc:hive2://192.168.53.122:10000/default> select name,subdinates from employees;
+---------------+-------------------------+--+
| name | subdinates |
+---------------+-------------------------+--+
| tianyongtao | ["wang","ZHANG","LIU"] |
| wangyangming | ["ma","zhong"] |
+---------------+-------------------------+--+
2 rows selected (0.301 seconds)
0: jdbc:hive2://192.168.53.122:10000/default> select name,aa from employees lateral view explode(subdinates) adTable as aa;
+---------------+--------+--+
| name | aa |
+---------------+--------+--+
| tianyongtao | wang |
| tianyongtao | ZHANG |
| tianyongtao | LIU |
| wangyangming | ma |
| wangyangming | zhong |
+---------------+--------+--+
5 rows selected (0.312 seconds)
Map类型字段的处理
0: jdbc:hive2://192.168.53.122:10000/default> select deducation from employees;
+---------------------------------+--+
| deducation |
+---------------------------------+--+
| {"aaa":10.0,"bb":5.0,"CC":8.0} |
| {"aaa":6.0,"bb":12.0} |
+---------------------------------+--+
2 rows selected (0.315 seconds)
0: jdbc:hive2://192.168.53.122:10000/default> select explode(deducation) as (aa,bb) from employees;
+------+-------+--+
| aa | bb |
+------+-------+--+
| aaa | 10.0 |
| bb | 5.0 |
| CC | 8.0 |
| aaa | 6.0 |
| bb | 12.0 |
+------+-------+--+
5 rows selected (0.314 seconds)
0: jdbc:hive2://192.168.53.122:10000/default> select name,aa,bb from employees lateral view explode(deducation) mtable as aa,bb;
+---------------+------+-------+--+
| name | aa | bb |
+---------------+------+-------+--+
| tianyongtao | aaa | 10.0 |
| tianyongtao | bb | 5.0 |
| tianyongtao | CC | 8.0 |
| wangyangming | aaa | 6.0 |
| wangyangming | bb | 12.0 |
+---------------+------+-------+--+
5 rows selected (0.347 seconds)
0: jdbc:hive2://192.168.53.122:10000/default> select name,aa,bb,cc from employees lateral view explode(deducation) mtable as aa,bb lateral view explode(subdinates) adTable as cc;
+---------------+------+-------+--------+--+
| name | aa | bb | cc |
+---------------+------+-------+--------+--+
| tianyongtao | aaa | 10.0 | wang |
| tianyongtao | aaa | 10.0 | ZHANG |
| tianyongtao | aaa | 10.0 | LIU |
| tianyongtao | bb | 5.0 | wang |
| tianyongtao | bb | 5.0 | ZHANG |
| tianyongtao | bb | 5.0 | LIU |
| tianyongtao | CC | 8.0 | wang |
| tianyongtao | CC | 8.0 | ZHANG |
| tianyongtao | CC | 8.0 | LIU |
| wangyangming | aaa | 6.0 | ma |
| wangyangming | aaa | 6.0 | zhong |
| wangyangming | bb | 12.0 | ma |
| wangyangming | bb | 12.0 | zhong |
+---------------+------+-------+--------+--+
13 rows selected (0.305 seconds)
结构体类型字段:
0: jdbc:hive2://192.168.53.122:10000/default> select name,address.street,address.city,address.state from employees;
+---------------+---------+-----------+----------+--+
| name | street | city | state |
+---------------+---------+-----------+----------+--+
| tianyongtao | HENAN | LUOHE | LINYING |
| wangyangming | hunan | changsha | NULL |
+---------------+---------+-----------+----------+--+
2 rows selected (0.309 seconds)
collect_set():该函数的作用是将某字段的值进行去重汇总,产生Array类型字段
0: jdbc:hive2://192.168.53.122:10000/default> select * from cust;
+------------------+-----------+----------------+--+
| cust.custname | cust.sex | cust.nianling |
+------------------+-----------+----------------+--+
| tianyt_touch100 | 1 | 50 |
| wangwu | 1 | 85 |
| zhangsan | 1 | 20 |
| liuqin | 0 | 56 |
| wangwu | 0 | 47 |
| liuyang | 1 | 32 |
| hello | 0 | 100 |
| mahuateng | 1 | 1001 |
| tianyt_touch100 | 1 | 50 |
| wangwu | 1 | 85 |
| zhangsan | 1 | 20 |
| liuqin | 0 | 56 |
| wangwu | 0 | 47 |
| nihao | 1 | 5 |
| liuyang | 1 | 32 |
| hello | 0 | 100 |
| mahuateng | 1 | 1001 |
| nihao | 1 | 5 |
+------------------+-----------+----------------+--+
scala> hcon.sql("select sex,collect_set(nianling) from gamedw.cust group by sex").show
+---+---------------------+
|sex|collect_set(nianling)|
+---+---------------------+
| 1| [85, 5, 20, 50, 3...|
| 0| [100, 56, 47]|
+---+---------------------+
0: jdbc:hive2://192.168.53.122:10000/default> select * from cityinfo;
+----------------+---------------------------------------------------------------+--+
| cityinfo.city | cityinfo.districts |
+----------------+---------------------------------------------------------------+--+
| shenzhen | longhua,futian,baoan,longgang,dapeng,guangming,nanshan,luohu |
| qingdao | shinan,lichang,jimo,jiaozhou,huangdao,laoshan |
+----------------+---------------------------------------------------------------+--+
0: jdbc:hive2://192.168.53.122:10000/default> select city,area from cityinfo lateral view explode(split(districts,",")) areatable as area;
+-----------+------------+--+
| city | area |
+-----------+------------+--+
| shenzhen | longhua |
| shenzhen | futian |
| shenzhen | baoan |
| shenzhen | longgang |
| shenzhen | dapeng |
| shenzhen | guangming |
| shenzhen | nanshan |
| shenzhen | luohu |
| qingdao | shinan |
| qingdao | lichang |
| qingdao | jimo |
| qingdao | jiaozhou |
| qingdao | huangdao |
| qingdao | laoshan |
+-----------+------------+--+
14 rows selected (0.479 seconds)
已知数据求截止当前月的最大值与截止当前月份的和:
scala> hcon.sql("select * from gamedw.visists order by custid,monthid").show
+------+-------+-----+
|custid|monthid|times|
+------+-------+-----+
| 1| 201801| 25|
| 1| 201801| 10|
| 1| 201802| 35|
| 1| 201802| 7|
| 1| 201803| 52|
| 1| 201805| 6|
| 2| 201801| 32|
| 2| 201801| 1|
| 2| 201802| 10|
| 2| 201802| 18|
| 2| 201803| 91|
| 2| 201804| 6|
| 2| 201804| 4|
| 2| 201805| 31|
+------+-------+-----+
scala> hcon.sql("select custid,b.monthid,sum(times),max(times) from gamedw.visists a inner join (select distinct monthid from gamedw.visists) b on a.monthid<=b.monthid group by custid,b.monthid order by custid,b.monthid").show
+------+-------+----------+----------+
|custid|monthid|sum(times)|max(times)|
+------+-------+----------+----------+
| 1| 201801| 35| 25|
| 1| 201802| 77| 35|
| 1| 201803| 129| 52|
| 1| 201804| 129| 52|
| 1| 201805| 135| 52|
| 2| 201801| 33| 32|
| 2| 201802| 61| 32|
| 2| 201803| 152| 91|
| 2| 201804| 162| 91|
| 2| 201805| 193| 91|
+------+-------+----------+----------+
关联的时候小表写在左边