zoukankan      html  css  js  c++  java
  • Hive 特殊分隔符处理

    HIVE特殊分隔符处理

    Hive对文件中的分隔符默认情况下只支持单字节分隔符,,默认单字符是01。当然你也可以在创建表格时指定数据的分割符号。但是如果数据文件中的分隔符是多字符的,如下图:

    01||zhangsan

    02||lisi

    03||wangwu

     

    补充:hive读取数据的机制

    1、首先用inputformat的一个具体的实现类读取文件数据,返回一条条的记录(可以是行,或者是你逻辑中的“行”)

    2、然后利用SerDe<默认:org.apache.hadoop.hive.serde2.LazySimpleSerDe>的一个具体的实现类,对上面返回的一条条记录进行字段切割

     

    使用RegexSerDe通过正则表达式来抽取字段

    1、建表

    create table t_bi_reg(id string,name string)

    row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'

    with serdeproperties(

    'input.regex'='(.*)\|\|(.*)',

    'output.format.string'='%1$s%2$s'

    )

    stored as textfile;

     

    2、加载数据

    01||zhangsan

    02||lisi

    03||wangwu

     

    load data local inpath '/root/lianggang.txt' into table t_bi_reg;

    3、查询

    hive> select * from t_bi_reg;

    OK

    01      zhangsan

    02      lisi

    03      wangwu

     

     

    通过自定义inputformat解决特殊分隔符问题

    其原理是在inputformat读取行的时候将数据中的“多字节分隔符”替换为hive默认的分隔符(ctrl+A 亦即 01)或用于替代的单字符分隔符。以便hive在serde操作的时候按照默认的单字节分隔符进行字段抽取

    package cn.gec.bigdata.hive.inputformat;

    import java.io.IOException;

    import org.apache.hadoop.io.LongWritable;

    import org.apache.hadoop.io.Text;

    import org.apache.hadoop.mapred.FileSplit;

    import org.apache.hadoop.mapred.InputSplit;

    import org.apache.hadoop.mapred.JobConf;

    import org.apache.hadoop.mapred.LineRecordReader;

    import org.apache.hadoop.mapred.RecordReader;

    import org.apache.hadoop.mapred.Reporter;

    import org.apache.hadoop.mapred.TextInputFormat;

     

    public class BiDelimiterInputFormat extends TextInputFormat {

             @Override

             public RecordReader<LongWritable, Text> getRecordReader(

     

             InputSplit genericSplit, JobConf job, Reporter reporter)

             throws IOException {

                       reporter.setStatus(genericSplit.toString());

                       MyDemoRecordReader reader = new MyDemoRecordReader(new LineRecordReader(job, (FileSplit) genericSplit));

                       // BiRecordReader reader = new BiRecordReader(job, (FileSplit)genericSplit);

                       return reader;

             }

     

             public static class MyDemoRecordReader implements RecordReader<LongWritable, Text> {

                       LineRecordReader reader;

                       Text text;

                       public MyDemoRecordReader(LineRecordReader reader) {

                                this.reader = reader;

                                text = reader.createValue();

                       }

                       @Override

                       public void close() throws IOException {

                                reader.close();

                       }

                       @Override

                       public LongWritable createKey() {

                                return reader.createKey();

                       }

                       @Override

                       public Text createValue() {

                                return new Text();

                       }

     

                       @Override

                       public long getPos() throws IOException {

                                return reader.getPos();

                       }

     

                       @Override

                       public float getProgress() throws IOException {

                                return reader.getProgress();

                       }

     

                       @Override

                       public boolean next(LongWritable key, Text value) throws IOException {  

                                boolean next = reader.next(key, text);

                                if(next){

                                         String replaceText = text.toString().replaceAll("\|\|", "\|");

                                         value.set(replaceText);

                                }

                                return next;         

                       }

             }

     

    }

     

    1.打包成jar,放到$HIVE_HOME/lib下

    2.建表指明自定义的inputformat

    create table t_lianggang(id string,name string)

    row format delimited

    fields terminated by '|'

    stored as inputformat 'cn.gec.bigdata.hive.inputformat.BiDelimiterInputFormat'

    outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

     

    3.加载数据

    01||zhangsan

    02||lisi

    03||wangwu

     

    load data local inpath '/root/lianggang.txt' into table t_lianggang;

     

    4.查询

    hive> select * from t_lianggang;

    OK

    01      zhangsan

    02      lisi

    03      wangwu

     

     

     

  • 相关阅读:
    2.Android之按钮Button和编辑框EditText学习
    《DSP using MATLAB》Problem 3.8
    《DSP using MATLAB》Problem 3.7
    《DSP using MATLAB》Problem 3.6
    《DSP using MATLAB》Problem 3.5
    《DSP using MATLAB》Problem 3.4
    《DSP using MATLAB》Problem 3.3
    《DSP using MATLAB》Problem 3.2
    《DSP using MATLAB》Problem 3.1
    《DSP using MATLAB》Problem 2.20
  • 原文地址:https://www.cnblogs.com/Transkai/p/10628508.html
Copyright © 2011-2022 走看看