zoukankan      html  css  js  c++  java
  • hive 下篇

    由于spark on hive 问题,导致无法插入数据,暂时使用spark进行hive操作

    区表插入数据

    hive> show partitions customers;
    OK
    partition
    city=beijing
    city=shenzhen
    Time taken: 0.071 seconds, Fetched: 2 row(s)

    查看HDFS:

    drwxr-xr-x   - root supergroup          0 2018-04-13 11:50 /user/hive/warehouse/gamedw.db/cust
    -rwxr-xr-x   1 root supergroup         96 2018-04-11 17:14 /user/hive/warehouse/gamedw.db/cust/customers.txt
    -rwxr-xr-x   1 root supergroup         10 2018-04-13 11:50 /user/hive/warehouse/gamedw.db/cust/part-00000-ebb85cb3-73e4-4d7b-aa07-8f007c8887bb-c000

    drwx-wx-wx   - root supergroup          0 2018-03-29 16:35 /user/hive/warehouse/gamedw.db/customers
    drwx-wx-wx   - root supergroup          0 2018-03-29 16:35 /user/hive/warehouse/gamedw.db/customers/city=beijing
    -rwx-wx-wx   1 root supergroup         80 2018-03-29 16:35 /user/hive/warehouse/gamedw.db/customers/city=beijing/customers.txt
    drwx-wx-wx   - root supergroup          0 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen
    -rwx-wx-wx   1 root supergroup         80 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen/customers.txt

    插入数据:

    scala> hivecon.sql("insert into gamedw.customers partition(city='guangzhen')select * from gamedw.cust ").show
    18/04/13 14:14:44 WARN hive.log: Updating partition stats fast for: customers  
    18/04/13 14:14:44 WARN hive.log: Updated size to 106
    ++
    ||
    ++
    ++

    scala> hivecon.sql("show partitions gamedw.customers ").show
    +--------------+
    |     partition|
    +--------------+
    |  city=beijing|
    |city=guangzhen|
    | city=shenzhen|
    +--------------+

    然后再查看hdfs :

    drwx-wx-wx   - root supergroup          0 2018-04-13 14:14 /user/hive/warehouse/gamedw.db/customers
    drwx-wx-wx   - root supergroup          0 2018-03-29 16:35 /user/hive/warehouse/gamedw.db/customers/city=beijing
    -rwx-wx-wx   1 root supergroup         80 2018-03-29 16:35 /user/hive/warehouse/gamedw.db/customers/city=beijing/customers.txt
    drwx-wx-wx   - root supergroup          0 2018-04-13 14:14 /user/hive/warehouse/gamedw.db/customers/city=guangzhen
    -rwx-wx-wx   1 root supergroup         96 2018-04-13 14:14 /user/hive/warehouse/gamedw.db/customers/city=guangzhen/part-00000-8513f59a-0fb6-45f8-bf87-1d27254d9b74-c000
    -rwx-wx-wx   1 root supergroup         10 2018-04-13 14:14 /user/hive/warehouse/gamedw.db/customers/city=guangzhen/part-00001-8513f59a-0fb6-45f8-bf87-1d27254d9b74-c000
    drwx-wx-wx   - root supergroup          0 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen
    -rwx-wx-wx   1 root supergroup         80 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen/customers.txt

    由于表cust数据是两个文件,因此插入数据后,表customers,自动生成了一个区,在HDFS中多了两个文件:

    [root@host ~]# hdfs dfs -cat /user/hive/warehouse/gamedw.db/customers/city=guangzhen/part-00001-8513f59a-0fb6-45f8-bf87-1d27254d9b74-c000
    nihao 1 5
    [root@host ~]# hdfs dfs -cat /user/hive/warehouse/gamedw.db/customers/city=guangzhen/part-00000-8513f59a-0fb6-45f8-bf87-1d27254d9b74-c000
    tianyt_touch100 1 50
    wangwu 1 85
    zhangsan 1 20
    liuqin 0 56
    wangwu 0 47
    liuyang 1 32
    hello 0 100

    同样插入如下数据:

    scala> hivecon.sql("insert overwrite table  gamedw.customers partition(city='wuhan')select * from gamedw.cust where nianling<>5").show
    18/04/13 14:31:13 WARN hive.log: Updating partition stats fast for: customers  
    18/04/13 14:31:13 WARN hive.log: Updated size to 96
    ++
    ||
    ++
    ++

    查看hdfs:

    drwx-wx-wx   - root supergroup          0 2018-04-13 14:31 /user/hive/warehouse/gamedw.db/customers/city=wuhan
    -rwx-wx-wx   1 root supergroup         96 2018-04-13 14:31 /user/hive/warehouse/gamedw.db/customers/city=wuhan/part-00000-d3a1ad44-e0d6-4a1a-9dd0-b42933085b46-c000
    -rwx-wx-wx   1 root supergroup          0 2018-04-13 14:31 /user/hive/warehouse/gamedw.db/customers/city=wuhan/part-00001-d3a1ad44-e0d6-4a1a-9dd0-b42933085b46-c000

    [root@host ~]# hdfs dfs -text /user/hive/warehouse/gamedw.db/customers/city=wuhan/part-00001-d3a1ad44-e0d6-4a1a-9dd0-b42933085b46-c000
    [root@host ~]# hdfs dfs -text /user/hive/warehouse/gamedw.db/customers/city=wuhan/part-00000-d3a1ad44-e0d6-4a1a-9dd0-b42933085b46-c000
    tianyt_touch100 1 50
    wangwu 1 85
    zhangsan 1 20
    liuqin 0 56
    wangwu 0 47
    liuyang 1 32
    hello 0 100

    可以查出虽然一个文件的数据已经被过滤,但插入时依然会生成一个空的文件

     一次添加两个区:

    scala> hivecon.sql("from gamedw.cust cst insert overwrite table  gamedw.customers partition(city='nanjing') select *   where cst.nianling<>5  insert overwrite table  gamedw.customers partition(city='luohe') select *   where cst.nianling=5 ").show

    18/04/13 15:16:38 WARN hive.log: Updating partition stats fast for: customers 18/04/13 15:16:38 WARN hive.log: Updated size to 96

    18/04/13 15:16:39 WARN hive.log: Updating partition stats fast for: customers 18/04/13 15:16:39 WARN hive.log: Updated size to 10

    ++ || ++ ++

    hive> show partitions customers;
    OK
    partition
    city=beijing
    city=guangzhen
    city=luohe
    city=nanjing
    city=shenzhen
    city=wuhan
    Time taken: 0.086 seconds, Fetched: 6 row(s)

    hive> select * from customers where city='luohe' or city='nanjing';
    OK
    customers.custname      customers.sex   customers.age   customers.city
    nihao   1       5       luohe
    tianyt_touch100 1       50      nanjing
    wangwu  1       85      nanjing
    zhangsan        1       20      nanjing
    liuqin  0       56      nanjing
    wangwu  0       47      nanjing
    liuyang 1       32      nanjing
    hello   0       100     nanjing
    Time taken: 1.037 seconds, Fetched: 8 row(s)

    drwx-wx-wx   - root supergroup          0 2018-04-13 15:16 /user/hive/warehouse/gamedw.db/customers/city=luohe
    -rwx-wx-wx   1 root supergroup          0 2018-04-13 15:16 /user/hive/warehouse/gamedw.db/customers/city=luohe/part-00000-0315be67-7d9a-451a-bdea-09f18f02b97c-c000
    -rwx-wx-wx   1 root supergroup         10 2018-04-13 15:16 /user/hive/warehouse/gamedw.db/customers/city=luohe/part-00001-0315be67-7d9a-451a-bdea-09f18f02b97c-c000
    drwx-wx-wx   - root supergroup          0 2018-04-13 15:16 /user/hive/warehouse/gamedw.db/customers/city=nanjing
    -rwx-wx-wx   1 root supergroup         96 2018-04-13 15:16 /user/hive/warehouse/gamedw.db/customers/city=nanjing/part-00000-4e959b24-fff2-45c0-a3d7-f4136d7ea96c-c000
    -rwx-wx-wx   1 root supergroup          0 2018-04-13 15:16 /user/hive/warehouse/gamedw.db/customers/city=nanjing/part-00001-4e959b24-fff2-45c0-a3d7-f4136d7ea96c-c000

    成功一次添加两个

     之前的select 插入均为静态插入;

    动态插入区:

    scala> hivecon.sql("insert overwrite table gamedw.customers partition(city) select custname,sex,nianling,'kaifeng' as city from gamedw.cust").show
    org.apache.spark.SparkException: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
      at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:301)
      at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
      at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
      at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
      at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
      at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
      at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
      at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:691)
      ... 48 elided

    hive> set hive.exec.dynamic.partition.mode=nonstrict;(在此设置无效,需要在spark设置)

    scala> hivecon.sql("set hive.exec.dynamic.partition.mode=nonstrict").show

    +--------------------+---------+ |              

       key|    value|

    +--------------------+---------+ |

    hive.exec.dynamic...|nonstrict|

    +--------------------+---------+

    scala> hivecon.sql("insert overwrite table gamedw.customers partition(city) select custname,sex,nianling,'kaifeng' as city from gamedw.cust").show

    18/04/13 15:33:46 WARN hive.log: Updating partition stats fast for: customers  

    18/04/13 15:33:46 WARN hive.log: Updated size to 106

    ++ || ++ ++

    hive> select * from customers where city='kaifeng';
    OK
    customers.custname      customers.sex   customers.age   customers.city
    tianyt_touch100 1       50      kaifeng
    wangwu  1       85      kaifeng
    zhangsan        1       20      kaifeng
    liuqin  0       56      kaifeng
    wangwu  0       47      kaifeng
    liuyang 1       32      kaifeng
    hello   0       100     kaifeng
    nihao   1       5       kaifeng
    Time taken: 0.161 seconds, Fetched: 8 row(s)

    bingo,动态插入成功

    单个语句创建表并加载数据

    scala> hivecon.sql("create table cust_copy as  select custname,sex,nianling from gamedw.cust").show

    18/04/13 15:46:38 WARN metastore.HiveMetaStore: Location: hdfs://localhost:9000/user/hive/warehouse/cust_copy specified for non-external table:cust_copy

    ++ || ++ ++

    查看HDFS:

    drwx-wx-wx   - root supergroup          0 2018-04-13 15:58 /user/hive/warehouse/gamedw.db/cust_copy
    -rwx-wx-wx   1 root supergroup         96 2018-04-13 15:58 /user/hive/warehouse/gamedw.db/cust_copy/part-00000-8e3d26e5-5b38-4308-b631-9815f6730acc-c000
    -rwx-wx-wx   1 root supergroup         10 2018-04-13 15:58 /user/hive/warehouse/gamedw.db/cust_copy/part-00001-8e3d26e5-5b38-4308-b631-9815f6730acc-c000

    drwxr-xr-x   - root supergroup          0 2018-04-13 11:50 /user/hive/warehouse/gamedw.db/cust
    -rwxr-xr-x   1 root supergroup         96 2018-04-11 17:14 /user/hive/warehouse/gamedw.db/cust/customers.txt
    -rwxr-xr-x   1 root supergroup         10 2018-04-13 11:50 /user/hive/warehouse/gamedw.db/cust/part-00000-ebb85cb3-73e4-4d7b-aa07-8f007c8887bb-c000

    导出数据:

    如果表的数据文件刚好是需要的格式,直接在hdfs上拷贝数据即可

    否则

    insert overwrite local directory '/root/custtmp' select * from  tablename;

    查询

    employees表导入数据:

    CREATE TABLE `employees`(
      `name` string,
      `salary` float,
      `subdinates` array<string>,
      `deducation` map<string,float>,
      `address` struct<street:string,city:string,state:string,zip:int>)

    [root@host tmpdata]# cat -A employees.txt     
    tianyongtao^A100^Awang^BZHANG^BLIU^Aaaa^C10^Bbb^C5^BCC^C8^AHENAN^BLUOHE^BLINYING^BTAICHEN$
    wangyangming^A200^Ama^Bzhong^Aaaa^C6^Bbb^C12^Ahunan^Bchangsha$

    [root@host tmpdata]# hdfs dfs -put employees.txt /user/hive/warehouse/gamedw.db/employees

    hive> select *  from employees;
    OK
    employees.name  employees.salary        employees.subdinates    employees.deducation    employees.address
    tianyongtao     100.0   ["wang","ZHANG","LIU"]  {"aaa":10.0,"bb":5.0,"CC":8.0}  {"street":"HENAN","city":"LUOHE","state":"LINYING","zip":null}
    wangyangming    200.0   ["ma","zhong"]  {"aaa":6.0,"bb":12.0}   {"street":"hunan","city":"changsha","state":null,"zip":null}
    Time taken: 0.14 seconds, Fetched: 2 row(s) 

    hive> select name,salary,subdinates[0] subdinate1,deducation["bb"] bb,address.city from employees;
    OK
    name    salary  subdinate1      bb      city
    tianyongtao     100.0   wang    5.0     LUOHE
    wangyangming    200.0   ma      12.0    changsha
    Time taken: 0.134 seconds, Fetched: 2 row(s)
    数组索引是基于0的;map使用的是键值;引用结构体的一个元素使用的是的 结构体名.属性名 如上例中的address.city

     hive> select name,salary,subdinates[2] subdinate1,deducation["cc"] cc,address.city from employees;
    OK
    name    salary  subdinate1      cc      city
    tianyongtao     100.0   LIU     NULL    LUOHE
    wangyangming    200.0   NULL    NULL    changsha
    Time taken: 0.123 seconds, Fetched: 2 row(s)

    数据根据索引找不到则返回null,map同理;

    hive> select *  from employees where address.city='LUOHE';
    OK
    employees.name  employees.salary        employees.subdinates    employees.deducation    employees.address
    tianyongtao     100.0   ["wang","ZHANG","LIU"]  {"aaa":10.0,"bb":5.0,"CC":8.0}  {"street":"HENAN","city":"LUOHE","state":"LINYING","zip":null}
    Time taken: 0.126 seconds, Fetched: 1 row(s)
    hive> select *  from employees where deducation["bb"]=12;
    OK
    employees.name  employees.salary        employees.subdinates    employees.deducation    employees.address
    wangyangming    200.0   ["ma","zhong"]  {"aaa":6.0,"bb":12.0}   {"street":"hunan","city":"changsha","state":null,"zip":null}
    Time taken: 0.125 seconds, Fetched: 1 row(s)

    HQL语句中用到的双引号,在spark需要使用转义字符

    scala> hivecon.sql("select upper(name) Uname,lower(name) lname,salary-deducation["aaa"]-deducation["bb"] sal from gamedw.employees").show
    +------------+------------+-----+
    |       Uname|       lname|  sal|
    +------------+------------+-----+
    | TIANYONGTAO| tianyongtao| 85.0|
    |WANGYANGMING|wangyangming|182.0|
    +------------+------------+-----+

    hive> select upper(name) Uname,lower(name) lname,salary-deducation["aaa"]-deducation["bb"] sal from employees;
    OK
    uname   lname   sal
    TIANYONGTAO     tianyongtao     85.0
    WANGYANGMING    wangyangming    182.0
    Time taken: 0.124 seconds, Fetched: 2 row(s)

    函数

    scala> hivecon.sql("  select round(5.66) round,floor(8.55) floor,ceil(8.55) ceil,rand() rand,exp(8) exp,ln(16) ln,log10(100) log10,log2(5) log2,power(10,2) power,sqrt(100) sqrt,bin(5) bin,hex(10) hex,hex(9) hex1,abs(-100) abs,pmod(1,5) pmod,negative(-10) negative").show
    +-----+-----+----+------------------+------------------+-----------------+-----+-----------------+-----+----+---+---+----+---+----+--------+
    |round|floor|ceil|              rand|               exp|               ln|log10|             log2|power|sqrt|bin|hex|hex1|abs|pmod|negative|
    +-----+-----+----+------------------+------------------+-----------------+-----+-----------------+-----+----+---+---+----+---+----+--------+
    |    6|    8|   9|0.4827320089704821|2980.9579870417283|2.772588722239781|  2.0|2.321928094887362|100.0|10.0|101|  A|   9|100|   1|      10|
    +-----+-----+----+------------------+------------------+-----------------+-----+-----------------+-----+----+---+---+----+---+----+--------+

    e,作为数学常数,是自然对数函数的底数。有时称它为欧拉数(Euler number),以瑞士数学家欧拉命名;它就像圆周率π和虚数单位i,e是数学中最重要的常数之一。

    其数值约为(小数点后100位):“e ≈ 2.71828 18284 59045 23536 02874 71352 66249 77572 47093 69995 95749 66967 62772 40766 30353 54759 45713 82178 52516 64274”。

    hive函数exp(double d)返回e的d幂次方;

    hive> select exp(1) exp1,exp(2) exp2,exp(8) exp8;
    OK
    exp1    exp2    exp8
    2.718281828459045       7.38905609893065        2980.9579870417283
    Time taken: 0.545 seconds, Fetched: 1 row(s)

    hive函数ln(double d)返回以自然数e为底d的对数

    hive> select ln(7.38905609893065);
    OK
    _c0
    2.0
    Time taken: 0.136 seconds, Fetched: 1 row(s)

    log10(double d)返回以10为底d的对数;log2(double d)返回以2为底d的对数,log(double base,double d)返回以base为底d的对数

    hive> select log10(100),power(10,2),log2(8),power(2,3),log(2,16);
    OK
    _c0     _c1     _c2     _c3     _c4
    2.0     100.0   3.0     8.0     4.0
    Time taken: 0.164 seconds, Fetched: 1 row(s)

     bin(double i) 计算二进制i的string类型值,i是bigint类型;

    hive> select bin(7),bin(5);
    OK
    _c0     _c1
    111     101
    Time taken: 0.147 seconds, Fetched: 1 row(s)

    hex(bigint i) 计算十六进制i的string类型值,i是bigint类型

    说明:如果变量是int类型,那么返回a的十六进制表示;如果变量是string类型,则返回该字符串的十六进制表示

    hive> select hex(15),hex(9),hex(10),hex(31);
    OK
    _c0     _c1     _c2     _c3
    F       9       A       1F
    Time taken: 0.141 seconds, Fetched: 1 row(s)

     conv(BIGINT num, int from_base, int to_base)

    将数值numfrom_base进制转化到to_base进制

    hive> select conv(17,10,16);
    OK
    _c0
    11
    Time taken: 0.156 seconds, Fetched: 1 row(s)

    语法: pmod(int a, int b),pmod(double a, double b)

    返回值: int double

    说明:返回正的a除以b的余数

    hive> select pmod(8,3);
    OK
    _c0
    2
    Time taken: 0.549 seconds, Fetched: 1 row(s)

    negative 负的

    positive 正数

    hive> select positive(-10),positive(10),negative(-10),negative(10);
    OK
    _c0     _c1     _c2     _c3
    -10     10      10      -10
    Time taken: 0.158 seconds, Fetched: 1 row(s)

    hive> select e(),pi();
    OK
    _c0     _c1
    2.718281828459045       3.141592653589793
    Time taken: 0.144 seconds, Fetched: 1 row(s)

    聚合函数

    scala> hivecon.sql("select city,count(1),round(avg(age))  from gamedw.customers group by city").show
    +---------+--------+----------------------------------+
    |     city|count(1)|round(avg(CAST(age AS BIGINT)), 0)|
    +---------+--------+----------------------------------+
    |    wuhan|       7|                              56.0|
    |  qingdao|       8|                              49.0|
    |  beijing|       6|                              48.0|
    |    luohe|       1|                               5.0|
    |guangzhen|       8|                              49.0|
    |  nanjing|       7|                              56.0|
    |  kaifeng|       8|                              49.0|
    | shenzhen|       6|                              48.0|
    +---------+--------+----------------------------------+

    count(*) 统计所有行数,包括null值的行

    count(exp) 统计exp表达式为非null的行数

    scala> hivecon.sql("select sum(age),sum(distinct age)  from gamedw.customers").show
    +--------+-----------------+                                                   
    |sum(age)|sum(DISTINCT age)|
    +--------+-----------------+
    |    2550|              395|
    +--------+-----------------+

    scala> hivecon.sql("select sum(age),sum(distinct age),avg(age),avg(distinct age)  from gamedw.customers").show
    +--------+-----------------+--------+-----------------+                        
    |sum(age)|sum(DISTINCT age)|avg(age)|avg(DISTINCT age)|
    +--------+-----------------+--------+-----------------+
    |    2550|              395|    50.0|           49.375|
    +--------+-----------------+--------+-----------------+

    表生产函数:

    explode 可以接受数组和map参数

    hive> select * from employees;
    OK
    employees.name  employees.salary        employees.subdinates    employees.deducation    employees.address
    tianyongtao     100.0   ["wang","ZHANG","LIU"]  {"aaa":10.0,"bb":5.0,"CC":8.0}  {"street":"HENAN","city":"LUOHE","state":"LINYING","zip":null}
    wangyangming    200.0   ["ma","zhong"]  {"aaa":6.0,"bb":12.0}   {"street":"hunan","city":"changsha","state":null,"zip":null}
    Time taken: 0.114 seconds, Fetched: 2 row(s)
    hive> select  explode(subdinates) from employees;
    OK
    col
    wang
    ZHANG
    LIU
    ma
    zhong
    Time taken: 0.118 seconds, Fetched: 5 row(s)
    hive> select  explode(subdinates),explode(deducation) from employees;
    FAILED: SemanticException 1:28 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'deducation'
    hive> select explode(deducation) from employees;
    OK
    key     value
    aaa     10.0
    bb      5.0
    CC      8.0
    aaa     6.0
    bb      12.0
    Time taken: 0.115 seconds, Fetched: 5 row(s)

    hive> select explode(address) from employees;
    FAILED: UDFArgumentException explode() takes an array or a map as a parameter

    json

    hive> select get_json_object('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}', '$.store.fruit.type');
    OK
    _c0
    ["apple","pear"]
    Time taken: 0.147 seconds, Fetched: 1 row(s)

    parse_url

     hive> select parse_url('http://data.zqgame.com:3689/Report/ad_main_bydate_shouyou.aspx?adcode=256004296705&gameid=3004&sdate=20180101&edate=20180123','QUERY');
    OK
    _c0
    adcode=256004296705&gameid=3004&sdate=20180101&edate=20180123
    Time taken: 0.143 seconds, Fetched: 1 row(s)

    hive> select parse_url('http://data.zqgame.com:3689/Report/ad_main_bydate_shouyou.aspx?adcode=256004296705&gameid=3004&sdate=20180101&edate=20180123','QUERY','gameid');
    OK
    _c0
    3004
    Time taken: 0.139 seconds, Fetched: 1 row(s)
    hive> select parse_url('http://data.zqgame.com:3689/Report/ad_main_bydate_shouyou.aspx?adcode=256004296705&gameid=3004&sdate=20180101&edate=20180123','HOST');
    OK
    _c0
    data.zqgame.com
    Time taken: 0.14 seconds, Fetched: 1 row(s)
    hive> select parse_url('http://data.zqgame.com:3689/Report/ad_main_bydate_shouyou.aspx?adcode=256004296705&gameid=3004&sdate=20180101&edate=20180123','PATH');
    OK
    _c0
    /Report/ad_main_bydate_shouyou.aspx
    Time taken: 0.146 seconds, Fetched: 1 row(s)

    其他内置函数

    hive> select instr('aggggaaeg','aa'),length('aaaggg1111'),locate('aa','aggggaaeg'),lower('aaaA'),lcase('aaaA'),upper('aggefR');
    OK
    _c0     _c1     _c2     _c3     _c4     _c5
    6       10      6       aaaa    aaaa    AGGEFR
    Time taken: 0.14 seconds, Fetched: 1 row(s)

    hive> select cast('10' as int),ascii('a'),ascii(1),concat('how','are','you'),concat_ws(' ','how','are','you'),find_in_set('a','dgeage'),find_in_set('a','ggg,a,b'),format_number(20.25466,2);
    OK
    _c0     _c1     _c2     _c3     _c4     _c5     _c6     _c7
    10      97      49      howareyou       how are you     0       2       20.25

     scala> hivecon.sql("select lpad('aaaa',10,'bb'),lpad('aaaa',5,'bb'),ltrim(' aafdf'),repeat('55',5),space(10)").show
    +------------------+-----------------+-------------+-------------+----------+
    |lpad(aaaa, 10, bb)|lpad(aaaa, 5, bb)|ltrim( aafdf)|repeat(55, 5)| space(10)|
    +------------------+-----------------+-------------+-------------+----------+
    |        bbbbbbaaaa|            baaaa|        aafdf|   5555555555|          |
    +------------------+-----------------+-------------+-------------+----------+

    hive> select * from employees;
    OK
    employees.name  employees.salary        employees.subdinates    employees.deducation    employees.address
    tianyongtao     100.0   ["wang","ZHANG","LIU"]  {"aaa":10.0,"bb":5.0,"CC":8.0}  {"street":"HENAN","city":"LUOHE","state":"LINYING","zip":null}
    wangyangming    200.0   ["ma","zhong"]  {"aaa":6.0,"bb":12.0}   {"street":"hunan","city":"changsha","state":null,"zip":null}
    Time taken: 0.103 seconds, Fetched: 2 row(s)
    hive> select name,size(subdinates),size(deducation) from employees;
    OK
    name    c1      c2
    tianyongtao     3       3
    wangyangming    2       2
    Time taken: 0.106 seconds, Fetched: 2 row(s)

    字符串装换为map

    hive> select str_to_map("gameid-100,name-'jzry'",',','-')
        > ;
    OK
    _c0
    {"gameid":"100","name":"'jzry'"}
    Time taken: 0.146 seconds, Fetched: 1 row(s)

    hive> select from_unixtime( unix_timestamp(),'yyyyMMdd'), from_unixtime( unix_timestamp()),unix_timestamp(),unix_timestamp('2018-04-18','yyyy-MM-dd');
    unix_timestamp(void) is deprecated. Use current_timestamp instead.
    unix_timestamp(void) is deprecated. Use current_timestamp instead.
    unix_timestamp(void) is deprecated. Use current_timestamp instead.
    OK
    _c0     _c1     _c2     _c3
    20180418        2018-04-18 15:13:37     1524035617      1523980800
    Time taken: 0.178 seconds, Fetched: 1 row(s)

    hive> select current_timestamp(),year(current_timestamp()),month(current_timestamp()),day(current_timestamp()),hour(current_timestamp()),second(current_timestamp());
    OK
    _c0     _c1     _c2     _c3     _c4     _c5
    2018-04-18 15:17:11.882 2018    4       18      15      11
    Time taken: 0.133 seconds, Fetched: 1 row(s)

    hive> select weekofyear(current_timestamp()),date_add(current_timestamp(),2),datediff('2018-01-01',current_timestamp());
    OK
    _c0     _c1     _c2
    16      2018-04-20      -107
    Time taken: 0.139 seconds, Fetched: 1 row(s)

    查看前三行和mysql基本一样

    hive> select * from cust limit 3;
    OK
    cust.custname   cust.sex        cust.nianling
    tianyt_touch100 1       50
    wangwu  1       85
    zhangsan        1       20
    Time taken: 0.109 seconds, Fetched: 3 row(s)

     嵌套查询

    hive> from (select  *  from cust) e select e.nianling where e.nianling>50;
    OK
    e.nianling
    85
    56
    100
    Time taken: 0.109 seconds, Fetched: 3 row(s)

    case when 基本和mysql同

    hive> select custname,case when nianling>80 then 'old' else 'young' end ifold  from cust;
    OK
    custname        ifold
    tianyt_touch100 young
    wangwu  old
    zhangsan        young
    liuqin  young
    wangwu  young
    liuyang young
    hello   old
    nihao   young
    Time taken: 0.097 seconds, Fetched: 8 row(s)

    like 和 Rlike

    hive> select * from cust where custname like '%nihao%';
    OK
    cust.custname   cust.sex        cust.nianling
    nihao   1       5
    Time taken: 0.104 seconds, Fetched: 1 row(s)

    hive> select * from cust where custname Rlike '.*(wang|nihao).*';
    OK
    cust.custname   cust.sex        cust.nianling
    wangwu  1       85
    wangwu  0       47
    nihao   1       5
    Time taken: 0.089 seconds, Fetched: 3 row(s)

    Rlike后面的字符串表达式,字符串中的点号(.)表示可以和任何字符匹配,*重复左边的字符串(上例中为点号)0次到无数次;表达式(x|y)表示和x或者y匹配

    group by  having  同mysql

    inner join on 同mysql

    a , b ,c三个表关联,通常情况下,hive对每对join连接生产一个mapreduce,hive按照从左到右执行,因此a,b生成一个mapreduce job,然后将mapreduce的结果与表连接生成一个mapreduce job.

    join 的优化

    当3个或者以上的表join连接时,如果每个连接都使用相同的连接键时,那么只会产生一个mapreduce job;

    hive同时假定最后连接的表是最大的那张表,在对每行记录进行连接操作时,他会尝试将其他表缓存起来,然后扫描最后的表进行计算,因此连续查询的表的大小从左到右是依次增加的。

    hive也提供了一个标记来提示查询优化器哪一张表大表 select /*+streamtable(b)*/ 其中b就是最大那张表

     scala> hivecon.sql("select /*+streamtable(b)*/ a.sexname,b.custname  from gamedw.tbsex a inner join gamedw.cust b on a.id=b.sex").show

    left out join

    right out join

    left semi join  左半开连接 返回左边表的记录,前提条件是后边表满足on条件

    hive 不支右半开连接

    join 没有 on  则生成 笛卡尔积。

    map-side join  如果所有表中只有一张小表,那么可以在最大的表通过mapper的时候将小表放入内存。hive可以在map端执行连接过程,这是因为hive可以与内存中的小表逐一匹配,从而忽略掉连接时需要的reduce过程。想用这个优化,则可以通过添加一个标记来进行触发。 select /*+mapjoin(d)*/...... 启动d代表写入内存的小表

    scala> hivecon.sql("select /*+mapjoin(a)*/ a.sexname,b.custname  from gamedw.tbsex a inner join gamedw.cust b on a.id=b.sex").show

    order by 与 sort by

    order by 同其他sql ;sort by 只会在每个reducer中对数据进行排序,也就是执行局部排序,这样可以保证每个reducer数据结果都是有序的,可以提高后面全局排序的效率。

    含有sort by 的 distribute by

    distribute by控制map的输出在reducer中是如何划的。mpreduce job中出传输的数据都是按照键值对进行组织的,hive再将语句转换为mpreduce job时,其内部必须使用这个功能。

    默认情况下,mpreduce计算框架会根据map输入的键计算响应的哈希值,然后按照得到的哈希值将键值对均匀发到多个reducer中,不幸的是,意味着当我们使用sort by 时,不同的reducer输出的内容会有明显的重叠。如果我们想将相同性别的数据一起处理,我们使用distribute by来保证将性别相同的记录发到同一个reducer中进行处理,然后使用sort by 按我们期望的对数据进行排序。
    scala> hivecon.sql("select * from gamedw.cust distribute by sex  sort by sex").show

    hive 要求distribute by 要在sort by前面


    scala> hivecon.sql("select * from gamedw.cust cluster by sex").show

     cluster by 等价于distribute by  +sort by

    类型转换

    可以将binary类型转换为字符串,也可以将字符串转为binary

    表 对于每一个表或者是区,Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划。Hive是针对某一列进行。Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶中。的好处是可以获得更高的查询处理效率。使取样更高效。

    使用CLUSTERED BY子句指定使用哪一个列来划和要划的桶的个数。

    hive> create table tb_bucket(id int,name string)  clustered by(id)  sorted by(name) into 5 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY ' ' ;
    OK
    Time taken: 0.22 seconds

    scala> hivecon.sql("insert overwrite table gamedw.tb_bucket select nianling,custname from gamedw.cust").show
    ++
    ||
    ++
    ++

    hive> select * from tb_bucket;
    OK
    tb_bucket.id    tb_bucket.name
    50      tianyt_touch100
    85      wangwu
    20      zhangsan
    56      liuqin
    47      wangwu
    32      liuyang
    100     hello
    1001    mahuateng
    5       nihao

    drwx-wx-wx   - root supergroup          0 2018-04-20 14:38 /user/hive/warehouse/gamedw.db/tb_bucket
    -rwx-wx-wx   1 root supergroup         82 2018-04-20 14:38 /user/hive/warehouse/gamedw.db/tb_bucket/part-00000-b5147761-6fcf-4810-aa9b-2e0cbead8eee-c000
    -rwx-wx-wx   1 root supergroup         15 2018-04-20 14:38 /user/hive/warehouse/gamedw.db/tb_bucket/part-00001-b5147761-6fcf-4810-aa9b-2e0cbead8eee-c000
    -rwx-wx-wx   1 root supergroup          8 2018-04-20 14:38 /user/hive/warehouse/gamedw.db/tb_bucket/part-00002-b5147761-6fcf-4810-aa9b-2e0cbead8eee-c000

    [root@host ~]# hdfs dfs -cat /user/hive/warehouse/gamedw.db/tb_bucket/part-00000-b5147761-6fcf-4810-aa9b-2e0cbead8eee-c000
    50 tianyt_touch100
    85 wangwu
    20 zhangsan
    56 liuqin
    47 wangwu
    32 liuyang
    100 hello
    [root@host ~]# hdfs dfs -cat /user/hive/warehouse/gamedw.db/tb_bucket/part-00001-b5147761-6fcf-4810-aa9b-2e0cbead8eee-c000
    1001 mahuateng
    [root@host ~]# hdfs dfs -cat /user/hive/warehouse/gamedw.db/tb_bucket/part-00002-b5147761-6fcf-4810-aa9b-2e0cbead8eee-c000
    5 nihao

    抽样查询

    抽样 我么可以使用rand()进行抽样,这个函数返回一个随机值。

    hive> select * from tb_bucket tablesample(bucket 3  out of 5 on rand()) s;
    OK
    s.id    s.name
    20      zhangsan
    32      liuyang
    100     hello
    5       nihao
    Time taken: 0.099 seconds, Fetched: 4 row(s)
    hive> select * from tb_bucket tablesample(bucket 3  out of 5 on rand()) s;
    OK
    s.id    s.name
    Time taken: 0.102 seconds
    hive> select * from tb_bucket tablesample(bucket 3  out of 5 on rand()) s;
    OK
    s.id    s.name
    50      tianyt_touch100
    Time taken: 0.103 seconds, Fetched: 1 row(s)

    我们也可以根据指定列而非rand()进行,那么同一个语句执行n次,返回结果相同。

    hive> select * from tb_bucket tablesample(bucket 3  out of 5 on id) s;
    OK
    s.id    s.name
    47      wangwu
    32      liuyang
    Time taken: 0.111 seconds, Fetched: 2 row(s)
    hive> select * from tb_bucket tablesample(bucket 3  out of 5 on id) s;
    OK
    s.id    s.name
    47      wangwu
    32      liuyang
    Time taken: 0.101 seconds, Fetched: 2 row(s)
    hive> select * from tb_bucket tablesample(bucket 3  out of 5 on id) s;
    OK
    s.id    s.name
    47      wangwu
    32      liuyang
    Time taken: 0.129 seconds, Fetched: 2 row(s)

    Hive中表取样的语法是:

    table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

    语句中母(y)表示的是数据将会被散列的个数,而子(x)表示将会选择桶的个数。

    我们也可以使用语句对非表进行抽样

    hive> select * from t_name tablesample(bucket 3  out of 5 on rand()) s;
    OK
    s.id    s.name  s.level
    4       堂哥            2
    6       堂妹            3
    Time taken: 0.108 seconds, Fetched: 2 row(s)
    hive> select * from t_name tablesample(bucket 3  out of 5 on rand()) s;
    OK
    s.id    s.name  s.level
    1       爷爷            1
    2       大爷            1
    3       叔叔            1
    Time taken: 0.102 seconds, Fetched: 3 row(s)

    数据块抽样

    这是另一种按百比抽样的方式,基于行数;

    hive> select * from t_name tablesample(10 percent) s;
    OK
    s.id    s.name  s.level
    1       爷爷            1
    2       大爷            1
    Time taken: 0.096 seconds, Fetched: 2 row(s)

    hive> select * from t_name tablesample(5 rows) s;
    OK
    s.id    s.name  s.level
    1       爷爷            1
    2       大爷            1
    3       叔叔            1
    4       堂哥            2
    5       堂姐            2
    Time taken: 0.11 seconds, Fetched: 5 row(s)

    hive> select * from t_name tablesample(5m) s;
    OK
    s.id    s.name  s.level
    1       爷爷            1
    2       大爷            1
    3       叔叔            1
    4       堂哥            2
    5       堂姐            2
    6       堂妹            3
    1       爷爷            1
    2       大爷            1
    3       叔叔            1
    4       堂哥            2
    5       堂姐            2
    6       堂妹            3
    Time taken: 0.098 seconds, Fetched: 12 row(s)
    hive> select * from t_name tablesample(5k) s;
    OK
    s.id    s.name  s.level
    1       爷爷            1
    2       大爷            1
    3       叔叔            1
    4       堂哥            2
    5       堂姐            2
    6       堂妹            3
    1       爷爷            1
    2       大爷            1
    3       叔叔            1
    4       堂哥            2
    5       堂姐            2
    6       堂妹            3
    Time taken: 0.2 seconds, Fetched: 12 row(s)
    hive> select * from t_name tablesample(5b) s;
    OK
    s.id    s.name  s.level
    1       爷爷            1
    Time taken: 0.153 seconds, Fetched: 1 row(s)

    union all

    可以将2个以及2个以上的表进行合并,查询的列个数以及对应的数据类型必须一致

    scala> hivecon.sql("select custname,sex from gamedw.cust1 union all select custname,nianling from gamedw.cust").show

    HiveQL 视图

     可以降低查询复杂度,可以限制基于条件的过滤数据

    视图可以指定字段名 create view viewname(fieldname,......) as ..........,当然也可以省掉 (fieldname,......)

    hive> select * from employees;
    OK
    employees.name  employees.salary        employees.subdinates    employees.deducation    employees.address
    tianyongtao     100.0   ["wang","ZHANG","LIU"]  {"aaa":10.0,"bb":5.0,"CC":8.0}  {"street":"HENAN","city":"LUOHE","state":"LINYING","zip":null}
    wangyangming    200.0   ["ma","zhong"]  {"aaa":6.0,"bb":12.0}   {"street":"hunan","city":"changsha","state":null,"zip":null}
    Time taken: 0.114 seconds, Fetched: 2 row(s)

    hive> create view view_emp(aa,bb,city,name) as  select deducation['aaa'],deducation['bb'],address.city,name from employees;
    OK
    _c0     _c1     city    name
    Time taken: 0.106 seconds
    hive> select * from view_emp;
    OK
    view_emp.aa     view_emp.bb     view_emp.city   view_emp.name
    10.0    5.0     LUOHE   tianyongtao
    6.0     12.0    changsha        wangyangming
    Time taken: 0.142 seconds, Fetched: 2 row(s)

    索引

    hive索引功能有限,但仍然可以加速一些操作,一张表的索引数据存放在另一张表中;可以通过explain命令查看语句是否使用了索引;

    当逻辑区太多的时候,建立索引就成为区的另一种选择;索引会增加存储空间,建立索引也会消耗计算资源;

    hive> create index ix_t_name_id on table t_name(id)  as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'   with deferred rebuild in table t_name_index;
    OK
    Time taken: 0.37 seconds

    ----------------------

    create index ix_t_name_id     索引名

    on table t_name(id)   t_name为表名

    as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'   创建索引需要的实现类

    with deferred rebuild

    in table t_name_index;  创建索引后的表名

     加载索引数据(重建索引)

     lter index  ix_t_name_id  on  t_name rebuild;

    查看表的索引

    hive> show index on t_name;
    OK
    idx_name        tab_name        col_names       idx_tab_name    idx_type        comment
    ix_t_name_id            t_name                  id                      t_name_index            compact            
    Time taken: 0.036 seconds, Fetched: 1 row(s)

    删除索引

    hive> drop index if exists ix_t_name_id on t_name;
    OK
    Time taken: 2.467 seconds
    hive> show index on t_name;
    OK
    idx_name        tab_name        col_names       idx_tab_name    idx_type        comment
    Time taken: 0.03 seconds

    模式设计

    按天划的表就是一种模式,在hive中可以使用按天区;

    区可以优化查询,但也可能导致一些不利,区多,每个区文件太多就会对超出namenode信息处理能力,因为namenode会将所有系统文件的元数据的保存在内存中。

    因此一个理想的区方案应该是不会产生过多的区和文件夹目录,并且每个文件夹的文件应该足够大,应该是文件系统块大小的若干倍。

    按时间区的策略应该是以不同的时间粒度来确定合适大小的数据积累量,随着时间的推移,区数量的增长是均匀的,而且每个区下的文件的大小应该是文件系统中块的若干倍。

    这样可以保持是区足够大,从而优化一般查询的数据吞吐量。同时要考虑粒度在将来是否是适用的。

    另一种方案就是适用两个级别的区,并且适用不同的维度。如时间+地区

    唯一键和标准化

    关系数据库通常适用唯一键,索引和标准化来存储数据集。hive没有主键,应尽量避免对非标准化数据进行连接操作(join);复杂的数据类型array map struct有助于单行存储一对多的数据;

    同一种数据多种处理

    可以从一个数据源产生多种聚合,而无需每次聚合都要重新扫描一次

      hive > from table1

               > INSERT OVERWRITE TABLE2 select *  where action='xx1'

              > INSERT OVERWRITE TABLE3 select *  where action='xx2';

    对于每个表的

    ETL过程中,如果中间表也使用区,一次可以处理多个区的数据,缺点就是要管理中间表并删除旧

     表数据存储

    区提供了一个隔离数据,优化查询的方式;是将数据解成更容易管理的若干部的技术

    假定一个表按日期 用户ID区,用户过多,会导致出现非常多的区,可能超出文件系统的处理能力;如果我们按照日期区,按用户ID,则字段值则会根据用户指定的值哈希分发到桶中;同一个用户ID通常会存储到同一个桶内;为了能正确的填充表,首先我们应该设置一个属性来强制hive为目标表的分桶初始化过程设置一个正确的reducer个数。

     建表:

    hive> show create table tb_bucket;
    OK
    createtab_stmt
    CREATE TABLE `tb_bucket`(
      `userid` int,
      `log` string)
    PARTITIONED BY (
      `dateid` int)
    CLUSTERED BY (
      userid)
    INTO 5 BUCKETS
    ROW FORMAT SERDE
      'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
    WITH SERDEPROPERTIES (
      'field.delim'=' ',
      'line.delim'=' ',
      'serialization.format'=' ')
    STORED AS INPUTFORMAT
      'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket'
    TBLPROPERTIES (
      'transient_lastDdlTime'='1524555904')
    Time taken: 0.02 seconds, Fetched: 22 row(s)

    导入数据

    hive> set hive.enforce.bucketing=true;

    load data local inpath '/root/tmpdata/userlog.txt' into table tb_bucket partition(dateid=20180223);

    drwx-wx-wx   - root supergroup          0 2018-04-24 15:53 /user/hive/warehouse/gamedw.db/tb_bucket
    drwx-wx-wx   - root supergroup          0 2018-04-24 15:54 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223
    -rwx-wx-wx   1 root supergroup     162285 2018-04-24 15:54 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/userlog.txt

    抽样检查

    hive> select * from gamedw.tb_bucket tablesample(bucket 5  out of 5 on userid) where userid=1285962;
    OK
    tb_bucket.userid        tb_bucket.log   tb_bucket.dateid
    Time taken: 0.135 seconds
    hive> select * from gamedw.tb_bucket tablesample(bucket 4  out of 5 on userid) where userid=1285962;
    OK
    tb_bucket.userid        tb_bucket.log   tb_bucket.dateid
    Time taken: 0.13 seconds
    hive> select * from gamedw.tb_bucket tablesample(bucket 3  out of 5 on userid) where userid=1285962;
    OK
    tb_bucket.userid        tb_bucket.log   tb_bucket.dateid
    1285962 219     20180223
    1285962 0       20180223
    1285962 29      20180223
    1285962 16      20180223
    1285962 13      20180223
    1285962 0       20180223
    1285962 8       20180223
    1285962 11      20180223
    1285962 8       20180223
    1285962 5       20180223
    1285962 6       20180223
    1285962 7       20180223
    1285962 1       20180223
    1285962 6       20180223
    1285962 14      20180223
    1285962 26      20180223
    1285962 32      20180223
    1285962 3       20180223
    1285962 1       20180223
    1285962 4       20180223
    1285962 165     20180223
    Time taken: 0.123 seconds, Fetched: 21 row(s)
    hive> select * from gamedw.tb_bucket tablesample(bucket 2  out of 5 on userid) where userid=1285962;
    OK
    tb_bucket.userid        tb_bucket.log   tb_bucket.dateid
    Time taken: 0.107 seconds
    hive> select * from gamedw.tb_bucket tablesample(bucket 1  out of 5 on userid) where userid=1285962;
    OK
    tb_bucket.userid        tb_bucket.log   tb_bucket.dateid
    Time taken: 0.139 seconds

    分桶的优点,桶是固定的,没有数据的波动,利于数据抽样

    为表增加列

    alter table tbname add columns(colname type);

     几乎总是使用压缩

    调优

    使用explain

    explain extended 可以数据更多信息

    join 的优化,大表要么放在最后边,要么使用 /*streamtable(tbname)*/标注,等

    本地模式,输入数据量非常小,hive通过本地单台计算机处理所有任务,执行时间明显缩短

     set hive.exec.mode.local.auto=true;

    并行执行

    hive> set hive.exec.parallel=true;

    严格模式

    调整maper reducer的个数

    索引

    动态分区调整

    存档分区

     hive有一种存储格式是har(hadoop archive)hadoop归档文件,har像hdfs系统中tar文件一样,是一个单独的文件,其内部可以存放多个文件和文件夹,某个特定分区,老旧文件访问比较低,当数量庞大,那就会需要hdfs的namenode消耗非常大的代价来管理这些文件。通过将分区下的文件归档成一个大的,但同时可以被HIVE访问的文件,可以减轻namenode的压力,缺点是har查询效率不高,也不是压缩文件,不能节省空间。

    首先将hadoop的archive接口拷贝到$HIVE_HOME/auxlib目录下

    [root@host lib]# mkdir $HIVE_HOME/auxlib
    [root@host lib]# cp hadoop-archives-2.7.4.jar $HIVE_HOME/auxlib
    [root@host lib]# pwd
    /root/hadoop/hadoop-2.7.4/share/hadoop/tools/lib

    查看归档分区的文件

    drwx-wx-wx   - root supergroup          0 2018-04-25 11:10 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223
    -rwx-wx-wx   1 root supergroup        190 2018-04-25 11:04 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/log.txt
    -rwx-wx-wx   1 root supergroup     162285 2018-04-24 15:54 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/userlog.txt

    重启hive

    hive> set hive.archive.enabled=true;
    hive> alter table tb_bucket archive partition (dateid=20180223);
    intermediate.archived is hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223_INTERMEDIATE_ARCHIVED
    intermediate.original is hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223_INTERMEDIATE_ORIGINAL
    Creating data.har for hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223
    in hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/.hive-staging_hive_2018-04-25_11-18-41_670_3770144429856467164-1/-ext-10000/partlevel
    Please wait... (this may take a while)
    Moving hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/.hive-staging_hive_2018-04-25_11-18-41_670_3770144429856467164-1/-ext-10000/partlevel to hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223_INTERMEDIATE_ARCHIVED
    Moving hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223 to hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223_INTERMEDIATE_ORIGINAL
    Moving hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223_INTERMEDIATE_ARCHIVED to hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223
    OK
    Time taken: 3.562 seconds

    查看归档文件

    drwxr-xr-x   - root supergroup          0 2018-04-25 11:18 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223
    drwxr-xr-x   - root supergroup          0 2018-04-25 11:18 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/data.har
    -rw-r--r--   1 root supergroup          0 2018-04-25 11:18 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/data.har/_SUCCESS
    -rw-r--r--   5 root supergroup        380 2018-04-25 11:18 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/data.har/_index
    -rw-r--r--   5 root supergroup         22 2018-04-25 11:18 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/data.har/_masterindex
    -rw-r--r--   1 root supergroup     162475 2018-04-25 11:18 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/data.har/part-0

    下面命令将har中的文件提取出来重新发在hdfs中

    hive> alter table tb_bucket unarchive partition (dateid=20180223);

     查看文件(多了一个空文件夹.hive-staging_hive_2018-04-25_11-18-41_670_3770144429856467164-1):

    drwx-wx-wx   - root supergroup          0 2018-04-25 11:21 /user/hive/warehouse/gamedw.db/tb_bucket
    drwxr-xr-x   - root supergroup          0 2018-04-25 11:21 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223
    drwxr-xr-x   - root supergroup          0 2018-04-25 11:21 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/.hive-staging_hive_2018-04-25_11-18-41_670_3770144429856467164-1
    -rw-r--r--   1 root supergroup        190 2018-04-25 11:21 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/log.txt
    -rw-r--r--   1 root supergroup     162285 2018-04-25 11:21 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/userlog.txt

    函数

     用户自定义函数UDF,是允许用户可以扩展HQL的强大功能。ETL中,一个过程往往包含多个步骤,hive语言具多重方式将上一个步骤的输入通过管道传递给下一个步骤,然后在一个查询中产生众多输出。

    发现和描述函数

    hive> show functions; 列出所有当前hive回话中所加载的所有函数名称,包括内置和用户加载进来的函数

    desc function 函数名:展示出函数的简单介绍

    hive> desc function trim;
    OK
    tab_name
    trim(str) - Removes the leading and trailing space characters from str
    Time taken: 0.01 seconds, Fetched: 1 row(s)

    hive> desc function extended trim;
    OK
    tab_name
    trim(str) - Removes the leading and trailing space characters from str
    Example:
      > SELECT trim('   facebook  ') FROM src LIMIT 1;
      'facebook'
    Time taken: 0.005 seconds, Fetched: 4 row(s)


    scala> hivecon.sql("select max(userid) from gamedw.tb_bucket ").show
    18/04/25 15:34:01 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
    18/04/25 15:34:01 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
    18/04/25 15:34:01 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
    +-----------+
    |max(userid)|
    +-----------+
    |    2265738|
    +-----------+


    scala> hivecon.sql("select userid,count(1) from gamedw.tb_bucket group by userid").show
    18/04/25 15:35:22 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
    18/04/25 15:35:22 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
    18/04/25 15:35:22 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
    +-------+--------+
    | userid|count(1)|
    +-------+--------+
    |2230227|       5|
    |2253575|       4|
    |2265183|       1|
    |2265258|       3|
    |2265307|       1|
    |2265351|       3|
    |2140713|       2|
    |2192466|       2|
    |2264948|       1|
    |2218390|       5|
    |2232595|       5|
    |2265050|       1|
    |2230117|      20|
    |1330951|       8|
    |2265318|       2|
    |2223714|       1|
    |2265630|       1|
    |2234815|      11|
    |1914825|       7|
    |2261768|       5|
    +-------+--------+
    only showing top 20 rows

    表生成函数

    hive> select explode(subdinates) from employees;
    OK
    col
    wang
    ZHANG
    LIU
    ma
    zhong
    Time taken: 0.112 seconds, Fetched: 5 row(s)

    hive> select name, explode(subdinates) from employees; 无法在表中产生其他列
    FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions

    不过hive提供了一个lateral view 来实现,lateral view 需要指定视图名称以及生成的列名

    hive> select name, sub from employees lateral view explode(subdinates) subview as sub;
    OK
    name    sub
    tianyongtao     wang
    tianyongtao     ZHANG
    tianyongtao     LIU
    wangyangming    ma
    wangyangming    zhong
    Time taken: 0.1 seconds, Fetched: 5 row(s)

    streaming

    streaming提供了另一种处理数据的方式,在streaming job中,hadoop streaming api会为外部进程开启一个I/O通道,然后数据会输送给这个进程,其会从标准输入中读取数据,然后通过标准输出来写结果数据,最后返回到streaming api job。

    streaming执行效率会 比对应编写的udf或改写的inputformat对象的效率要低,管道中的序列化然后反序列化通常是低效的,而且以通常的方式很难调试整个程序。

     hive提供了多种语法来使用steaming,包括map(),reduce(),transform()

    恒等变换

    最基本的steaming job就是恒等运算,/bin/cat 整个shell命令可以将传递给它的数据直接输出,所以满足恒等运算。


    scala> hivecon.sql("select transform(custname,nianling) using '/bin/cat' as name,age from gamedw.cust where nianling>50").show
    +---------+----+
    |     name| age|
    +---------+----+
    |   wangwu|  85|
    |   liuqin|  56|
    |    hello| 100|
    |mahuateng|1001|
    +---------+----+

    改变类型

    转换为其他数据类型

    scala> hivecon.sql("select transform(custname,nianling) using '/bin/cat' as (name STRING,age  DOUBLE) from gamedw.cust where nianling>50").show
    +---------+------+
    |     name|   age|
    +---------+------+
    |   wangwu|  85.0|
    |   liuqin|  56.0|
    |    hello| 100.0|
    |mahuateng|1001.0|
    +---------+------+

    投影变换

    使用cut命令提取或者映射出特定的字段

    scala> hivecon.sql("select transform(custname,sex,nianling) using '/bin/cut -f1' as name,sex,age  from gamedw.cust where nianling>50").show
    +---------+----+----+
    |     name| sex| age|
    +---------+----+----+
    |   wangwu|null|null|
    |   liuqin|null|null|
    |    hello|null|null|
    |mahuateng|null|null|
    +---------+----+----+

    操作转换
    scala> hivecon.sql("select transform(custname,sex,nianling) using '/bin/cat' as name,sex,age  from gamedw.cust where nianling>50").show
    +---------+---+----+
    |     name|sex| age|
    +---------+---+----+
    |   wangwu|  1|  85|
    |   liuqin|  0|  56|
    |    hello|  0| 100|
    |mahuateng|  1|1001|
    +---------+---+----+

    scala> hivecon.sql("select transform(custname,sex,nianling) using '/bin/sed s/0/5/' as name,sex,age  from gamedw.cust where nianling>50").show
    +---------+---+----+
    |     name|sex| age|
    +---------+---+----+
    |   wangwu|  1|  85|
    |   liuqin|  5|  56|
    |    hello|  5| 100|
    |mahuateng|  1|1501|
    +---------+---+----+

    创建表的默认格式是textfile,便于与其他工具共享数据,便于查看和编辑。相对于二进制文件,文本占的空间要大

    创建存储格式为sequencefile的表 ,sequencefile是二进制的键值对文件

    hive> create table cust00(custname string,sex int,nianling int)stored as sequencefile;
    OK
    Time taken: 0.139 seconds

    csv 逗号分隔值 tsv 制表符分隔值

    xml udf

    xml是非结构话数据,这使得hive成为xml处理的一个强大的数据库平台 。众多原因之一就是大型xml文档解析和处理所需要的复杂性和资源消耗使得hadoop非常适合成为一个xml数据库平台,因为hadoop可以并行的处理xml文档,hive也成为了处理xml相关数据的完美工具。HiveQL天生就支持访问xml的嵌套元素和值,进一步,可以允许对嵌套的字段,值,属性进行连接操作。xpath xml路径语言,使用xpath作为xml的查询语言的话,hive可以从xml提取数据并进入hive系统进行其他处理;xpath将xml建模成一个节点树。

  • 相关阅读:
    pytest.mark.parametrize里面indirect参数详细解释
    linux环境安装python环境
    gitlab怎么给别人新增项目权限
    VMware虚拟机下的CentOS7如果Ping不通百度,解决办法
    ip configuration could not be reserved (no available address timeout etc.)虚拟机连接不上网卡解决办法
    虚拟机安装教程
    auto_now与auto_now_add之间的区别
    【二分答案】洛谷P2678 [NOIP2015 提高组] 跳石头/P1824 进击的奶牛/P2440木材加工/P1873 砍树
    团体程序设计天梯赛PTA L2-021点赞狂魔
    团体程序设计天梯赛PTA L2-020功夫传人
  • 原文地址:https://www.cnblogs.com/playforever/p/8819814.html
Copyright © 2011-2022 走看看