zoukankan      html  css  js  c++  java
  • Hive面试题(5)UDF,UDTF(二)UDTF

    1.udtf介绍及编写

    1.1.介绍

    HIVE中udtf可以将一行转成一行多列,也可以将一行转成多行多列,使用频率较高。本篇文章通过实际案例剖析udtf的编写及使用方法和原理。阅读本篇文章前请先阅读UDF编写

    测试数据

    drop table if exists test;
      create table test
      (
        ind int,
        col string,
        col1 string
      ) ;
      insert into test values (1,'a,b,c','1,2');
      insert into test values (2,'j,k',null);
      insert into test values (3,null,null) ;

    对第一行需要输出如下结果:

    IndKeyValue
    1 a 1
    1 b 2
    1 c Null

    其它行都要输出类似数据,如果输入数据为null,则没输出。

    1.2udtf编写

    编写UDTF(User-Defined Table-Generating Functions),需要继承GenericUDTF类,类中部分代码如下:

    /**
       * A Generic User-defined Table Generating Function (UDTF)
       *
       * Generates a variable number of output rows for a single input row. Useful for
       * explode(array)...
       */
      public abstract class GenericUDTF {
      ​
        public StructObjectInspector initialize(StructObjectInspector argOIs)
              throws UDFArgumentException {
            List<? extends StructField> inputFields = argOIs.getAllStructFieldRefs();
            ObjectInspector[] udtfInputOIs = new ObjectInspector[inputFields.size()];
            for (int i = 0; i < inputFields.size(); i++) {
              udtfInputOIs[i] = inputFields.get(i).getFieldObjectInspector();
            }
            return initialize(udtfInputOIs);
        }
        
        /**
           * Give a set of arguments for the UDTF to process.
           *
           * @param args
           *          object array of arguments
           */
        public abstract void process(Object[] args) throws HiveException;
      ​
        /**
           * Called to notify the UDTF that there are no more rows to process.
           * Clean up code or additional forward() calls can be made here.
           */
        public abstract void close() throws HiveException;
      }

    继承GenericUDTF需要实现以上方法,其中initialize方法和UDF中类似,主要是判断输入类型并确定返回的字段类型。process方法对udft函数输入的每一样进行操作,通过调用forward方法返回一行或多行数据。close方法在process调用结束后调用,用于进行其它一些额外操作,只执行一次。

    package com.practice.hive.udtf;
      ​
      import java.util.List;
      ​
      import com.google.common.collect.Lists;
      import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
      import org.apache.hadoop.hive.ql.metadata.HiveException;
      import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
      import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
      import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
      import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
      import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
      ​
      /**
       * @author liufeifei
       * @date 2018/06/20
       */
      public class ArrToMapUDTF extends GenericUDTF {
      ​
          private String[] obj = new String[2];
      ​
          /**
           * 返回类型为 String,string
           *
           * @param argOIs
           * @return
           * @throws UDFArgumentException
           */
          @Override
          public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
              List<String> colName = Lists.newLinkedList();
              colName.add("key");
              colName.add("value");
              List<ObjectInspector> resType = Lists.newLinkedList();
              resType.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
              resType.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
              // 返回分别为列名 和 列类型
              return ObjectInspectorFactory.getStandardStructObjectInspector(colName, resType);
          }
      ​
          @Override
          public void process(Object[] args) throws HiveException {
              if(args[0] == null) {
                  return;
              }
              String arg1 = args[0].toString();
      ​
              String[] arr1 = arg1.split(",");
              String[] arr2 = null;
              if(args[1] != null) {
                  arr2 = args[1].toString().split(",");
              }
      ​
              for(int i = 0; i < arr1.length ; i++ ) {
                  obj[0] = arr1[i];
                  if(arr2 != null && arr2.length > i) {
                      obj[1] = arr2[i];
                  } else {
                      obj[1] = null;
                  }
                  forward(obj);
              }
          }
      ​
          @Override
          public void close() throws HiveException {
          }
      }

    本文来自博客园,作者:秋华,转载请注明原文链接:https://www.cnblogs.com/qiu-hua/p/14179564.html

  • 相关阅读:
    Kotlin Coroutines不复杂, 我来帮你理一理
    Refresh design pattern
    Android App安装包瘦身计划
    Google IO 2019 Android 太长不看版
    Effective Java读书笔记完结啦
    探究高级的Kotlin Coroutines知识
    移动应用中的非功能性(跨职能)需求
    Android程序员的Flutter学习笔记
    如何正确使用Espresso来测试你的Android程序
    MVP模式, 开源库mosby的使用及代码分析
  • 原文地址:https://www.cnblogs.com/qiu-hua/p/14179564.html
Copyright © 2011-2022 走看看