1、UDF
package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); } }
add jar my_jar.jar;
create temporary function my_lower as 'com.example.hive.udf.Lower';
主要描述了实现一个udf的过程,首先自然是实现一个UDF函数,然后编译为jar并加入到hive的classpath中,最后创建一个临时变量名字让hive中调用。
2、UDAF
package org.apache.hadoop.hive.contrib.udaf.example; import org.apache.hadoop.hive.ql.exec.UDAF; import org.apache.hadoop.hive.ql.exec.UDAFEvaluator; /** * This is a simple UDAF that calculates average. * * It should be very easy to follow and can be used as an example for writing * new UDAFs. * * Note that Hive internally uses a different mechanism (called GenericUDAF) to * implement built-in aggregation functions, which are harder to program but * more efficient. * */ public final class UDAFExampleAvg extends UDAF { /** * The internal state of an aggregation for average. * * Note that this is only needed if the internal state cannot be represented * by a primitive. * * The internal state can also contains fields with types like * ArrayList<String> and HashMap<String,Double> if needed. */ public static class UDAFAvgState { private long mCount; private double mSum; } /** * The actual class for doing the aggregation. Hive will automatically look * for all internal classes of the UDAF that implements UDAFEvaluator. */ public static class UDAFExampleAvgEvaluator implements UDAFEvaluator { UDAFAvgState state; public UDAFExampleAvgEvaluator() { super(); state = new UDAFAvgState(); init(); } /** * Reset the state of the aggregation. */ public void init() { state.mSum = 0; state.mCount = 0; } /** * Iterate through one row of original data. * * The number and type of arguments need to the same as we call this UDAF * from Hive command line. * * This function should always return true. */ public boolean iterate(Double o) { if (o != null) { state.mSum += o; state.mCount++; } return true; } /** * Terminate a partial aggregation and return the state. If the state is a * primitive, just return primitive Java classes like Integer or String. */ public UDAFAvgState terminatePartial() { // This is SQL standard - average of zero items should be null. return state.mCount == 0 ? null : state; } /** * Merge with a partial aggregation. * * This function should always have a single argument which has the same * type as the return value of terminatePartial(). */ public boolean merge(UDAFAvgState o) { if (o != null) { state.mSum += o.mSum; state.mCount += o.mCount; } return true; } /** * Terminates the aggregation and return the final result. */ public Double terminate() { // This is SQL standard - average of zero items should be null. return state.mCount == 0 ? null : Double.valueOf(state.mSum / state.mCount); } } private UDAFExampleAvg() { // prevent instantiation } }
关于UDAF开发注意点:
1.需要import org.apache.hadoop.hive.ql.exec.UDAF以及org.apache.hadoop.hive.ql.exec.UDAFEvaluator,这两个包都是必须的
2.函数类需要继承UDAF类,内部类Evaluator实现UDAFEvaluator接口
3.Evaluator需要实现 init、iterate、terminatePartial、merge、terminate这几个函数
1)init函数类似于构造函数,用于UDAF的初始化
2)iterate接收传入的参数,并进行内部的轮转。其返回类型为boolean
3)terminatePartial无参数,其为iterate函数轮转结束后,返回乱转数据,iterate和terminatePartial类似于hadoop的Combiner
4)merge接收terminatePartial的返回结果,进行数据merge操作,其返回类型为boolean
5)terminate返回最终的聚集函数结果