zoukankan      html  css  js  c++  java
  • hive常用函数和建表

    1、分区表:
        在hdfs中显示为文件夹
    
        优化手段之一:避免全表扫描:select * from xxx where province='beijing'
    
    
    元数据:关系型数据库
    数据库:文件夹
    表:文件夹
    分区:文件夹
    
    
    添加分区:
        alter table xxx add partition(province='beijing',city='beijing')
    
    动态插入分区:将未分区表插入到分区表
        //设置动态分区非严格模式
        set hive.exec.dynamic.partition.mode=nonstrict
        insert into xxx partition(province,city) select * from xxx;
    
    
    join:设置Map端join或reduce端join
        join暗示(hint):写在sql语句中,声明小表,需要mapjoin
            /*+MAP JOIN(user0) */        //声明将哪个表加入到内存
    
            select /*+MAPJOIN(user0) */ a.id,a.name,b.orderno,b.price from user0 a left outer join orders b on a.id=b.uid;
            
    
        自动map端join:
            默认开启状态
              <property>
                <name>hive.auto.convert.join</name>
                <value>true</value>
                <description>优化手段之一:自动将reduce端join变为map端join,基于文件大小</description>
              </property>
        
            关闭自动map端join:
               set hive.auto.convert.join=false
    
    
    bucket:桶表
    =====================================
    
        创建桶表,相当于在MR中指定分区数(2)+Hash分区
    
        数据结构是文件
    
        如果分区过多会导致相当数量的文件夹产生
    
        所以可以指定表以文件分桶形式进行存储
    
        桶数是在建表时产生,当以id进行分桶时,会将id以hash形式分配到不同桶中
    
    创建桶表:
        create table user1(id int, name string, age int, province string, city string)
        clustered by(province) into 2 buckets
        row format delimited fields terminated by ' ';
    
    动态插入数据:
        insert into user1 select * from user0;
    
    
    创建分区分桶表:
        分区:province+city
        分桶:id
    
        create table user2(id int, name string, age int ) partitioned by (province string , city string)
        clustered by(id) into 2 buckets
        row format delimited fields terminated by ' ';
    
    动态插入数据:
        set  hive.exec.dynamic.partition.mode=nonstrict ;
        insert into user2 partition(province,city) select * from user0;
    
    在分区表中插入多条数据
        INSERT into TABLE user2 PARTITION (province='henan', city='kaifeng') select id,name,age FROM user0;
    
    在分区表中插入单条数据
        INSERT into TABLE user2 PARTITION (province='henan', city='kaifeng') select 0,'jerry',20;
    
    对于    000000_0_copy_1文件,每次进行插入操作数字都会累加
    
    
    
    常见函数:
    ==================================
        select size(arr) from xxx;            //集合(map array)的长度
        select array_contains(arr,'str') from xxx;    //判断集合中是否含有指定元素,返回boolean
        select arr[0] from xxx;
        select array(1,2,3) as arr;            //通过select方式创建array
        select map(1,'tom',2,'tomas',3,'tomson') as map;//通过select方式创建map
        select struct('tom,'tomas','tomson') as str;    //通过select方式创建struct    //默认key col1,col2, col3
        select named_struct('id',1,'name','tomas') as str;    //自定义key
        select current_date();
        select current_timestamp();
    
        select date_format('2018-11-22','yyyy/MM/dd');    //转换格式
        select cast(date_format(current_date,'yyyy') as int);
        select cast(date_format(current_timestamp,'yyyy') as int);
        select from_unixtime(123456789,'yyyyMMdd');    //将时间戳转换成时间格式,最多到秒
    
    
    条件语句:if case
    =====================================
        if相当于三元运算符:
            select id, if( id<5 , 'true', 'false') from user0;    //如果id<5, 返回true,否则false
    
        case when then else end:
             select age, case when age<30 then 'young' 
             when age<=45 and age>30 then 'middle' 
             else 'old' end 
             from user0;
            
    Hive的文件格式(FileFormat):
    =====================================
        
    
    设置文件格式:
        1、create table t1(id int) stored as textfile;
        2、alter table t1 set fileformat textfile;
        3、set hive.default.fileformat=textfile
    
    其他文件格式的数据加载:
        1、textfile可以通过直接load的方式进行加载
        2、其他格式要先在textfile格式表中通过insert into xx select 方式加载数据
    
        TextFile:
            问题:text文件可以压缩,但是需要选择可切割的压缩方式
            不然会变成一个map在单个节点进行运行
    
        SequenceFile:序列文件,可以通过如下配置指定压缩类型
            set hive.exec.compress.output=true
            set io.seqfile.compression.type=block
    
        
        然而,Text和SeqFile都是行级存储格式,对于hive的投影查询没有有效的优化措施。(select id,name from xx where id > 10)
    
    
    列式存储:
    -----------------------------------
        RCFile:
            和SeqFile类似,扁平化文件包括二进制的K-v,将文件水平切割成行组,一个或多个组存储在hdfs文件中
            在行组中,RCFile以列为单位存储所有的行组
            文件可切割,而且可以跳过列来读取数据
    
        ORCFile
            优化(optimize)后的RC文件,
            SeqFile块压缩是以1M为单位进行压缩
            RCFile         4M
            ORCFile         256M
    
            和RCFile的区别:RCFile通过存储元数据的方式得到数据类型
                        ORCFile通过指定的编码器了解数据类型,以便优化压缩
            
            ORCFile还存储数据的统计:Min、Max、Sum、Count
    
    
        parquet
            和ORC文件类型设计有类似之处,但是,parquet文件支持的Hadoop生态系统组件比ORCFile要多
            比ORCFile更好的支持嵌套结构的数据
    
    
    比较几种文件的性能:
        
    text:    create table text(id string, name string, pass string, email string, nick string) row format delimited
        fields terminated by '	'
        stored as textfile;
    
    seqFile:create table seq(id string, name string, pass string, email string, nick string) row format delimited
        fields terminated by '	'
        stored as sequencefile;
    
    RCFile:    create table rcfile(id string, name string, pass string, email string, nick string) row format delimited
        fields terminated by '	'
        stored as rcfile;
    
    ORC:    create table orcfile(id string, name string, pass string, email string, nick string) row format delimited
        fields terminated by '	'
        stored as orcfile;
    
    parquet:create table parquet(id string, name string, pass string, email string, nick string) row format delimited
        fields terminated by '	'
        stored as parquet;
        
    
    加载数据到text表:load data local inpath 'd_users.txt' into table text;
    
    将数据分别通过insert into table xxx select * from text命令,将文件分别导入到其他表 
    
    
    
    text:        444.48M        count(*) 8307711    88.225s        
    seqFile:    562.7M                    66.581s
    RCFile:        232M                    43.123s
    ORCFile:    244M                    41.309s
    Parquet:    538.7M                    OOM        //使用广泛
    
    
    SerDe:
    ======================================
        Serialize && DeSerialize
        hive通过此类技术将其字段映射到col中
    
        hive处理hdfs数据的基本过程:
            1、读取数据
            2、通过指定的InputFormat来确定recordReader
            3、通过Serde映射记录各个字段到表中的Col
    
        Serde和FileFormat的关系:
            SerDe是在文件格式下的对记录中的字段进行处理和映射的技术
    
        OpenCSVSerDe:    处理","分隔的csv文件类型
        LazySimpleSerde:先将所有的文件类型设置为string,然后将其设置为指定的格式
        JSONSerDe:    专门负责处理json串
        
        使用方法:
            0、将json-serde-1.3.8-jar-with-dependencies.jar发送到${HIVE_HOME}/lib下,并重启hive
    
            1、创建表:
            create table user_log(`_location` string, `_ip` string, `_action` string, `_uid` string, `_timestamp` string)
            row format serde 'org.openx.data.jsonserde.JsonSerDe'
            stored as textfile;
    
            注意:字段必须要和json中的key相对应
                  字段如果以特殊符号开头,要用``反引号括起来
    
    
            2、加载表:
                load data local inpath '1.json' into table user_log;
    
            3、select * from user_log;
    
    
    
    窗口函数:
    =============================================
        开窗函数:
            over(sort by age rows between 1 preceding and 1 following)    //行(row)级别的界定
            over(sort by age range between 10 preceding and 10 following)    //range对数值范围进行界定
        
        数值范围包括:
            n preceding    //之前 unbounded preceding(之前无限)
            current row    //当前行
            n following    //之后行 unbounded following(之后无限)
    
        select name, age , sum(age)over(sort by age range between 10 preceding and 10 following) from user1;    //先排序再计算
        select name, age , sum(age)over(rows between 1 preceding and 1 following) from user1;    //值与其上下两行进行相加
        select name, age , province ,sum(age)over(partition by province) from user_a;        //partition by [order by]
    
        
    
    8    tom1    60    beijing    beijing
    9    Tom2    15    henan    kaifeng
    10    tom3    20    hebei    xiongan
    11    tom4    25    henan    xinyang
    12    tom7    30    henan    kaifeng
    13    tom6    35    beijing    beijing
    14    tom30    40    henan    xinyang
    15    tom100    45    beijing    beijing
    16    jerry45    50    hebei    shijiazhuang
    17    jerry20    55    hebei    langfang
    18    jerry7    60    tianjin    tianjin
    
    
    分析函数
    =====================================================
        rank()        //并列跳跃    113
        //取得每个省份年龄最大的三个人
        select name, province, age , rank()over(partition by province order by age desc) from user_a;
    
        dense_rank()    //不跳跃    112
        select name, province, age , dense_rank()over(partition by province order by age desc) from user_a;
    
        row_number()    //顺序        123
        select name, province, age , row_number()over(partition by province order by age desc) from user_a;
    
        
        
        ntile(n)        //分组切片(三六九等)
        //将年龄分成三份
        select name, province, age , ntile(3)over(partition by province order by age desc) from user_a;
    
    
        first_value(age)    //取第一个值
        //取得所有省份的最大年龄和最小年龄
        select name, province, age , first_value(age)over(partition by province order by age desc) as big,
        first_value(age)over(partition by province order by age) as small from user_a;
    
    
    lead和lag:
        统计连续2月活跃的用户
    
    
    
    id    name    date
    1    tom    2018-01-01
    1    tom    2018-02-01
    1    tom    2018-02-03
    2    tomson    
  • 相关阅读:
    Oracle EXTRACT()函数与to_char() 函数
    Java内部类
    SQL 之 Group By
    Android LayoutInflater布局填充器
    JS 图片转Base64
    C# 事件与委托的区别
    AngularJS的循环输出
    jquery实现button倒计时
    重新理解B/S和C/S的区别
    HashMap与HashTable
  • 原文地址:https://www.cnblogs.com/zyde/p/9225228.html
Copyright © 2011-2022 走看看