zoukankan      html  css  js  c++  java
  • Hive 特殊分隔符处理

    HIVE特殊分隔符处理

    Hive对文件中的分隔符默认情况下只支持单字节分隔符,,默认单字符是01。当然你也可以在创建表格时指定数据的分割符号。但是如果数据文件中的分隔符是多字符的,如下图:

    01||zhangsan

    02||lisi

    03||wangwu

     

    补充:hive读取数据的机制

    1、首先用inputformat的一个具体的实现类读取文件数据,返回一条条的记录(可以是行,或者是你逻辑中的“行”)

    2、然后利用SerDe<默认:org.apache.hadoop.hive.serde2.LazySimpleSerDe>的一个具体的实现类,对上面返回的一条条记录进行字段切割

     

    使用RegexSerDe通过正则表达式来抽取字段

    1、建表

    create table t_bi_reg(id string,name string)

    row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'

    with serdeproperties(

    'input.regex'='(.*)\|\|(.*)',

    'output.format.string'='%1$s%2$s'

    )

    stored as textfile;

     

    2、加载数据

    01||zhangsan

    02||lisi

    03||wangwu

     

    load data local inpath '/root/lianggang.txt' into table t_bi_reg;

    3、查询

    hive> select * from t_bi_reg;

    OK

    01      zhangsan

    02      lisi

    03      wangwu

     

     

    通过自定义inputformat解决特殊分隔符问题

    其原理是在inputformat读取行的时候将数据中的“多字节分隔符”替换为hive默认的分隔符(ctrl+A 亦即 01)或用于替代的单字符分隔符。以便hive在serde操作的时候按照默认的单字节分隔符进行字段抽取

    package cn.gec.bigdata.hive.inputformat;

    import java.io.IOException;

    import org.apache.hadoop.io.LongWritable;

    import org.apache.hadoop.io.Text;

    import org.apache.hadoop.mapred.FileSplit;

    import org.apache.hadoop.mapred.InputSplit;

    import org.apache.hadoop.mapred.JobConf;

    import org.apache.hadoop.mapred.LineRecordReader;

    import org.apache.hadoop.mapred.RecordReader;

    import org.apache.hadoop.mapred.Reporter;

    import org.apache.hadoop.mapred.TextInputFormat;

     

    public class BiDelimiterInputFormat extends TextInputFormat {

             @Override

             public RecordReader<LongWritable, Text> getRecordReader(

     

             InputSplit genericSplit, JobConf job, Reporter reporter)

             throws IOException {

                       reporter.setStatus(genericSplit.toString());

                       MyDemoRecordReader reader = new MyDemoRecordReader(new LineRecordReader(job, (FileSplit) genericSplit));

                       // BiRecordReader reader = new BiRecordReader(job, (FileSplit)genericSplit);

                       return reader;

             }

     

             public static class MyDemoRecordReader implements RecordReader<LongWritable, Text> {

                       LineRecordReader reader;

                       Text text;

                       public MyDemoRecordReader(LineRecordReader reader) {

                                this.reader = reader;

                                text = reader.createValue();

                       }

                       @Override

                       public void close() throws IOException {

                                reader.close();

                       }

                       @Override

                       public LongWritable createKey() {

                                return reader.createKey();

                       }

                       @Override

                       public Text createValue() {

                                return new Text();

                       }

     

                       @Override

                       public long getPos() throws IOException {

                                return reader.getPos();

                       }

     

                       @Override

                       public float getProgress() throws IOException {

                                return reader.getProgress();

                       }

     

                       @Override

                       public boolean next(LongWritable key, Text value) throws IOException {  

                                boolean next = reader.next(key, text);

                                if(next){

                                         String replaceText = text.toString().replaceAll("\|\|", "\|");

                                         value.set(replaceText);

                                }

                                return next;         

                       }

             }

     

    }

     

    1.打包成jar,放到$HIVE_HOME/lib下

    2.建表指明自定义的inputformat

    create table t_lianggang(id string,name string)

    row format delimited

    fields terminated by '|'

    stored as inputformat 'cn.gec.bigdata.hive.inputformat.BiDelimiterInputFormat'

    outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

     

    3.加载数据

    01||zhangsan

    02||lisi

    03||wangwu

     

    load data local inpath '/root/lianggang.txt' into table t_lianggang;

     

    4.查询

    hive> select * from t_lianggang;

    OK

    01      zhangsan

    02      lisi

    03      wangwu

     

     

     

  • 相关阅读:
    android 14 进度条和拖动条
    android 13 5种click事件不同实现方式 比较
    android 12 click事件的不同实现方式
    android 11 模拟onclick 事件
    android 10 事件
    android 09
    android 08 AndroidManifest.xml
    android 07 22 23没看
    Linux常用命令last的使用方法详解
    Linux TOP命令 按内存占用排序和按CPU占用排序
  • 原文地址:https://www.cnblogs.com/Transkai/p/10628508.html
Copyright © 2011-2022 走看看