zoukankan      html  css  js  c++  java
  • Hive 中SerDe概述

    一、背景

    1、当进程在进行远程通信时,彼此可以发送各种类型的数据,无论是什么类型的数据都会以二进制序列的形式在网络上传送。发送方需要把对象转化为字节序列才可在网络上传输,称为对象序列化;接收方则需要把字节序列恢复为对象,称为对象的反序列化。

    2、Hive的反序列化是对key/value反序列化成hive table的每个列的值。

    3、Hive可以方便的将数据加载到表中而不需要对数据进行转换,这样在处理海量数据时可以节省大量的时间。

    二、技术细节

    1、SerDe是Serialize/Deserilize的简称,目的是用于序列化和反序列化。

    2、用户在建表时可以用自定义的SerDe或使用Hive自带的SerDe,SerDe能为表指定列,且对列指定相应的数据。

    CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

    [(col_name data_type [COMMENT col_comment], ...)]

    [COMMENT table_comment]

    [PARTITIONED BY (col_name data_type

    [COMMENT col_comment], ...)]

    [CLUSTERED BY (col_name, col_name, ...)

    [SORTED BY (col_name [ASC|DESC], ...)]

    INTO num_buckets BUCKETS]

    [ROW FORMAT row_format]

    [STORED AS file_format]

    [LOCATION hdfs_path]

    创建指定SerDe表时,使用row format row_format参数,例如:

    a、添加jar包。在hive客户端输入:hive>add jar /run/serde_test.jar;

    或者在linux shell端执行命令:${HIVE_HOME}/bin/hive -auxpath /run/serde_test.jar

    b、建表:create table serde_table row format serde 'hive.connect.TestDeserializer';

    3、编写序列化类TestDeserializer。实现Deserializer接口的三个函数:

    a)初始化:initialize(Configuration conf, Properties tb1)。

    b)反序列化Writable类型返回Object:deserialize(Writable blob)。

    c)获取deserialize(Writable blob)返回值Object的inspector:getObjectInspector()。

    public interface Deserializer {

    /**

    * Initialize the HiveDeserializer.

    * @param conf System properties

    * @param tbl table properties

    * @throws SerDeException

    */

    public void initialize(Configuration conf, Properties tbl) throws SerDeException;

    /**

    * Deserialize an object out of a Writable blob.

    * In most cases, the return value of this function will be constant since the function

    * will reuse the returned object.

    * If the client wants to keep a copy of the object, the client needs to clone the

    * returned value by calling ObjectInspectorUtils.getStandardObject().

    * @param blob The Writable object containing a serialized object

    * @return A Java object representing the contents in the blob.

    */

    public Object deserialize(Writable blob) throws SerDeException;

    /**

    * Get the object inspector that can be used to navigate through the internal

    * structure of the Object returned from deserialize(...).

    */

    public ObjectInspector getObjectInspector() throws SerDeException;

    }

    实现一行数据划分成hive表的time,userid,host,path四个字段的反序列化类。例如:

    package hive.connect;

    import java.net.MalformedURLException;

    import java.net.URL;

    import java.util.ArrayList;

    import java.util.List;

    import java.util.Properties;

    import org.apache.hadoop.conf.Configuration;

    import org.apache.hadoop.hive.serde2.Deserializer;

    import org.apache.hadoop.hive.serde2.SerDeException;

    import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;

    import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;

    import org.apache.hadoop.hive.serde2.objectinspector.-

    ObjectInspectorFactory.ObjectInspectorOptions;

    import org.apache.hadoop.io.Text;

    import org.apache.hadoop.io.Writable;

    public class TestDeserializer implements Deserializer {

    private static List<String> FieldNames = new ArrayList<String>();

    private static List<ObjectInspector> FieldNamesObjectInspectors = new ArrayList<ObjectInspector>();

    static {

    FieldNames.add("time");

    FieldNamesObjectInspectors.add(ObjectInspectorFactory

    .getReflectionObjectInspector(Long.class,

    ObjectInspectorOptions.JAVA));

    FieldNames.add("userid");

    FieldNamesObjectInspectors.add(ObjectInspectorFactory

    .getReflectionObjectInspector(Integer.class,

    ObjectInspectorOptions.JAVA));

    FieldNames.add("host");

    FieldNamesObjectInspectors.add(ObjectInspectorFactory

    .getReflectionObjectInspector(String.class,

    ObjectInspectorOptions.JAVA));

    FieldNames.add("path");

    FieldNamesObjectInspectors.add(ObjectInspectorFactory

    .getReflectionObjectInspector(String.class,

    ObjectInspectorOptions.JAVA));

    }

    @Override

    public Object deserialize(Writable blob) {

    try {

    if (blob instanceof Text) {

    String line = ((Text) blob).toString();

    if (line == null)

    return null;

    String[] field = line.split("\t");

    if (field.length != 3) {

    return null;

    }

    List<Object> result = new ArrayList<Object>();

    URL url = new URL(field[2]);

    Long time = Long.valueOf(field[0]);

    Integer userid = Integer.valueOf(field[1]);

    result.add(time);

    result.add(userid);

    result.add(url.getHost());

    result.add(url.getPath());

    return result;

    }

    } catch (MalformedURLException e) {

    e.printStackTrace();

    }

    return null;

    }

    @Override

    public ObjectInspector getObjectInspector() throws SerDeException {

    return ObjectInspectorFactory.getStandardStructObjectInspector(

    FieldNames, FieldNamesObjectInspectors);

    }

    @Override

    public void initialize(Configuration arg0, Properties arg1)

    throws SerDeException {

    }

    }

    测试HDFS上hive表数据,如下为一条测试数据:

    1234567891012 123456 http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF

    hive> add jar /run/jar/merg_hua.jar;

    Added /run/jar/merg_hua.jar to class path

    hive> create table serde_table row format serde 'hive.connect.TestDeserializer';

    Found class for hive.connect.TestDeserializer

    OK

    Time taken: 0.028 seconds

    hive> describe serde_table;

    OK

    time bigint from deserializer

    userid int from deserializer

    host string from deserializer

    path string from deserializer

    Time taken: 0.042 seconds

    hive> select * from serde_table;

    OK

    1234567891012 123456 wiki.apache.org /hadoop/Hive/LanguageManual/UDF

    Time taken: 0.039 seconds

    三、总结

    1、创建Hive表使用序列化时,需要自写一个实现Deserializer的类,并且选用create命令的row format参数。

    2、在处理海量数据的时候,如果数据的格式与表结构吻合,可以用到Hive的反序列化而不需要对数据进行转换,可以节省大量的时间。

    本文转载自:http://dajuezhao.javaeye.com/blog/795190


  • 相关阅读:
    windows下使用mingw编译出ffplay(简化版)
    Linux中查看GNOME版本号
    Linux操作系统入门学习总结(2015.10)
    c++11并发机制
    CentOS 7修改管理员密码
    windows下批量杀死进程
    ffmpeg抓屏输出的设置
    User-Defined-Literal自定义字面量
    GitHub支持的Markdown语法 GitHub Flavored Markdown
    c++11支持类数据成员的初始化
  • 原文地址:https://www.cnblogs.com/java20130722/p/3206981.html
Copyright © 2011-2022 走看看