zoukankan      html  css  js  c++  java
  • Pig group用法举例

        group语句可以把具有相同键值的数据聚合在一起,与SQL中的group操作有着本质的区别,在SQL中group by字句创建的组必须直接注入一个或多个聚合函数。在Pig Latin中group和聚合函数之间没有直接的关系。
        group关键字正如它字面所表达的:将包含了特定的键所对应的值的所有记录封装到一个bag中,之后,用户可以将这个结果传递给一个聚合函数或者使用它做其他一些处理。
     
        触发reduce阶段
     
    数据文件内容如下:
    [hadoop@vm1 ~]$ cat orders.data 
    1       apple   30    x
    2       apple   50    x
    3       banana  30    y
    4       pear    20    y
    5       banana  10    y
    [hadoop@vm1 ~]$ 
    

      

     
    加载数据并分组
    data = load '/orders.data' as (orderid:int, fruit:chararray, amount:int);
    grpd = group data by fruit;
    

       

    查看分组后的数据模式
    分组后的数据只有两个字段:group(分组字段)、数据(列名是被分组的数据集别名,数据是所有数据组成的bag。
    describe grpd;
    grpd: {group: chararray,data: {(orderid: int,fruit: chararray,amount: int)}}
    

      

    查看分组数据
    dump grpd;
    (pear,{(4,pear,20)})
    (apple,{(2,apple,50),(1,apple,30)})
    (banana,{(5,banana,10),(3,banana,30)})
    

       

    使用聚合函数对分组后的结果集进行处理:
    dump grpd;
    (pear,{(4,pear,20)})
    (apple,{(2,apple,50),(1,apple,30)})
    (banana,{(5,banana,10),(3,banana,30)})
    
    group data by $0+$1;
    

       

     
    对多个键分组
    分组后的数据有两个字段,一个是别名是group的tuple,一个是聚合了本组数据的bag
    group data by (filed1, field2)
    orders = load '/orders.data' as (orderid:int, fruit:chararray, amount:int, type:chararray);
    grpd = group orders by (fruit, type);
    describe grpd;
    grpd: {group: (fruit: chararray,type: chararray),orders: {(orderid: int,fruit: chararray,amount: int,type: chararray)}}
    dump grpd;
    ((pear,y),{(4,pear,20,y)})
    ((apple,x),{(2,apple,50,x),(1,apple,30,x)})
    ((banana,y),{(5,banana,10,y),(3,banana,30,y)})
    

      

    sums = foreach grpd generate group, SUM(orders.amount);
    dump sums;
    ((pear,y),20)
    ((apple,x),80)
    ((banana,y),40)
    

      

    sums2 = foreach grpd generate group.$0, group.$1, SUM(orders.amount);
    dump sums2;
    (pear,y,20)
    (apple,x,80)
    (banana,y,40
    

       

    group all 将数据集的所有数据放到一个分组里
    grpd = group orders all;
    describe grpd;
    grpd: {group: chararray,orders: {(orderid: int,fruit: chararray,amount: int,type: chararray)}}
    dump grpd;
    (all,{(5,banana,10,y),(4,pear,20,y),(3,banana,30,y),(2,apple,50,x),(1,apple,30,x)})
    

       

    co-group多个数据集group
    A = LOAD 'data1' AS (owner:chararray,pet:chararray);
    DUMP A;
    (Alice,turtle)
    (Alice,goldfish)
    (Alice,cat)
    (Bob,dog)
    (Bob,cat)
    B = LOAD 'data2' AS (friend1:chararray,friend2:chararray);
    DUMP B;
    (Cindy,Alice)
    (Mark,Alice)
    (Paul,Bob)
    (Paul,Jane)
    X = COGROUP A BY owner, B BY friend2;
    DESCRIBE X;
    X: {group: chararray,A: {owner: chararray,pet: chararray},B: {friend1: chararray,friend2: chararray}}
    DUMP X;
    (Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
    (Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
    (Jane,{},{(Paul,Jane)})
    

       

    partition by parallel n
    A = LOAD 'input_data'; 
    B = GROUP A BY $0 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner PARALLEL 2;
    

      

    SimpleCustomPartitioner:
    public class SimpleCustomPartitioner extends Partitioner <PigNullableWritable, Writable> { 
         //@Override 
        public int getPartition(PigNullableWritable key, Writable value, int numPartitions) { 
            if(key.getValueAsPigType() instanceof Integer) { 
                int ret = (((Integer)key.getValueAsPigType()).intValue() % numPartitions); 
                return ret; 
           } 
           else { 
                return (key.hashCode()) % numPartitions; 
            } 
        } 
    }
    

       

    NULL值处理
    NULL是一个特殊的分组key,所有key是null的tuple都会被聚集到一组里。
     
     
  • 相关阅读:
    CUDA C Best Practices Guide 在线教程学习笔记 Part 1
    0_Simple__simpleCallback
    0_Simple__simpleAtomicIntrinsics + 0_Simple__simpleAtomicIntrinsics_nvrtc
    0_Simple__simpleAssert + 0_Simple__simpleAssert_nvrtc
    0_Simple__matrixMulDrv
    0_Simple__matrixMulCUBLAS
    0_Simple__matrixMul + 0_Simple__matrixMul_nvrtc
    0_Simple__inlinePTX + 0_Simple__inlinePTX_nvrtc
    0_Simple__fp16ScalarProduct
    0_Simple__cudaOpenMP
  • 原文地址:https://www.cnblogs.com/lishouguang/p/4559593.html
Copyright © 2011-2022 走看看