zoukankan      html  css  js  c++  java
  • hadoop-pig学习笔记

    A1 = LOAD '/luo/lzttxt01.txt' AS (col1:chararray,col2:int,col3:int,col4:int,col5:double,col6:double);
    B1 = GROUP A1 BY (col2,col3,col4);
    C1 = FOREACH B1 GENERATE FLATTEN(group),AVG(A1.col5),AVG(A1.col6); ---这里的A1指的是B1里的A1,B1中有若干个A1;
    STORE C1 INTO '/output1';


    A = LOAD '/luo/txt.txt' AS (col1:chararray,col2:int,col3:int,col4:int,col5:double,col6:double);
    B = GROUP A ALL;
    C = FOREACH B GENERATE COUNT(A.col2);
    DUMP C;

    A1 = LOAD '/lu/lzttxt01.txt' AS (col1:chararray,col2:int,col3:int,col4:int,col5:double,col6:double);
    B1 = GROUP A1 BY (col2,col3,col4);
    C1 = FOREACH B1 {D = DISTINCT A1.col6; GENERATE group ,COUNT(D);};
    DUMP C1;

    A = LOAD '/lu1/b01.txt' AS (col1:int,col2:int,col3:int,col4:chararray,col5:chararray);
    B = STREAM A THROUGH `awk '{if($4=="=") print $1" "$2" "$3" 9999 "$5;else print $0}'`; -- STREAM .. THROUGH ..调用shell语句 当第四列为“=”号时,将其替换为9999,否则就按照原样输出这一行
    DUMP B

    若是aa.pig 方式的话文件要放在root本地:pig -x mapreduce /luo/aa.pig 执行 或是 pig -x local test.pig


    A = LOAD '/lu1/a.txt' AS (acol1:chararray,acol2:int,acol3:int);
    B = LOAD '/lu1/c.txt' AS (bol1:int,bol2:chararray,bol3:int);
    C = COGROUP A BY acol1,B BY bol2; ---cogroup 可以按多个关系中的字段进行分组
    DUMP C;


    A = LOAD '/lzt02/aa01.txt' AS (a:int,b:int);
    B = LOAD '/lzt02/aa02.txt' AS (c:int,d:int);
    C = UNION A,B;
    D = GROUP C BY $0;按第一列进行分组
    E = FOREACH D GENERATE FLATTEN(group),SUM(C.$1); $1第二列
    DUMP E;

    A1 = LOAD '/lzt02/aa03.txt' AS (a11:int,b11:chararray); 符合"*//*.qq.com/*"
    B1 = FILTER A1 BY b11 matches '.*//.*\.qq\.com/.*';
    C1 = FILTER B1 BY (a11 matches('\d'))
    DUMP B1;
    .表示任意字符, * 表示任意次数,.是对.的转义 ,/就是表示/这个字符;注意在引号中\.才是和正则中的.的一致
    正则中d表示匹配数字,在引号中必须用'\d'

    1.定义的数据结构为元组
    A = LOAD 'luo.txt' AS (T : tuple(col1:int,col2:int,col3:int,col4:chararray,col5:chararray)) --使用与数据是这种格式的(1,2,3,5,2,4)

    2.
    STORE A into '$output_dir' --使用参数
    pig -param output_dir="/home/my_outdir" my_pig_script.pig
    3.load 多个目录下的数据:
    A = LOAD '/abc/201{0,1}'
    4.两个整数相除,想得到整数和浮点数
    整数:(float)(col1/col2) 浮点数: (float)col1/col2
    5.substring
    B = FOREACH A GENERATE SUBSTRING(date,0,4)
    6.拼接concat
    A = LOAD '1.txt' AS (col1:chararray,col2:int);
    B = FOREACH A GENERATE CONCAT(col1,(chararray)col2) ----多个字段进行拼接时使用concat嵌套:concat(a,concat(b,c))
    7.join的用法,求两个数据表中重合的个数
    A = LOAD '/lzt02/aa01.txt' AS (a:int,b:int);
    B = LOAD '/lzt02/aa02.txt' AS (c:int,d:int);
    C = JOIN A BY a,B BY d;
    D = DISTINCT C;
    E = GROUP D ALL;
    F = FOREACH E GENERATE COUNT(D)

    8.使用三目运算符“?:”
    B = FOREACH A GENERATE col1,((col2 is null)?-1 :col2),col3


    A = LOAD '1.txt' AS (a:int,b:tuple(x:int,y:int)); ----适用于2,(3,5)这样的数据
    B = FOREACH A GENERATE a,FLATEEN(b);
    C = FOREACH B GENERATE group ,SUM(B.x) AS S;
    D = FOREACH C GENERATE group,(s is null)?-1 :s

    9. 在第一列的每种组合下,第二列为3和6的数据分别有多少条
    A = LOAD '/lzt02/aa01.txt' AS (a:int,b:int);
    B = GROUP A BY a;
    C = FOREACH B {
    D = FILTER A BY b==3; ##这里的A是B中的A
    E = FILTER A BY b==6;
    GENERATE group,count(D),count(E);
    }
    DUMP C;B

    10. A = LOAD '/lzt02/aa01.txt' AS (a:int,b:int);
    B = LOAD '/lzt02/aa02.txt' AS (c:int,d:int);
    C = JOIN A BY a LEFT OUTER,B BY d; ----c中有两张表的全部字段 ,遵循left join 原则
    D = DISTINCT C;
    E = GROUP D ALL;
    F = FOREACH E GENERATE COUNT(D)

    11.A表中有,但是B表中没有的数据
    A = LOAD '/lzt02/aa01.txt' AS (a1:int,b1:int);
    B = LOAD '/lzt02/aa02.txt' AS (a1:int,b1:int);
    C = JOIN A BY a1 left outer,B BY a1;
    D = FILTER C BY (B::a1 is null);
    E = FOREACH D GENERATE A::a1 AS a1,A::a2 AS a2;
    DUMP E;

    12.每种组合有多少个
    1 9
    2 4
    1 9
    2 4

    A = LOAD '/lzt02/aa01.txt' AS (a1:int,b1:int);
    B = GROUP A BY (a1,a2);
    C = FOREACH B GENERATE group,COUNT(A);

    13.一个字符串为null 与它为空不一定等价
    B = FILTER A BY (a1 is not null AND (SIZE(a1)>0L));

    14.统计一个字符串中包含的指定字符数()

    B = STREAM A THROUGH `awk -F "luo" '{print NF-1}'` AS (column_count:int)

  • 相关阅读:
    UIWindow与UIView
    UIView与CALayer 区别
    setter getter 方法
    KVC、KVO 理解
    c语言实现单链表
    浅谈C的应用与常见error
    POJ 3683 Priest John's Busiest Day(2-SAT+方案输出)
    Google Code Jam 2008 Round 1A C Numbers(矩阵快速幂+化简方程,好题)
    POJ 3686 The Windy's(思维+费用流好题)
    POJ 2686 Traveling by Stagecoach(状压二维SPFA)
  • 原文地址:https://www.cnblogs.com/luo-mao/p/5872429.html
Copyright © 2011-2022 走看看