zoukankan      html  css  js  c++  java
  • Pig的一个简单数据处理例子

      1、Pig数据模型

        Bag:表

        Tuple:行,记录

        Field:属性

        Pig不要求同一个Bag里面的各个Tuple有相同数量或相同类型的Field

      2、Pig Lating常用语句

        1)LOAD:指出载入数据的方法

        2)FOREACH:逐行扫描进行某种处理

        3)FILTER:过滤行

        4)DUMP:把结果显示到屏幕

        5)STORE:把结果保存到文件

      3、简单例子:

        假如有一份成绩单,有学号、语文成绩、数学成绩,属性之间用|分隔,如下:

    20130001|80|90
    20130002|85|96
    20130003|60|70
    20130004|74|86
    20130005|65|98

      1)把文件从本地系统上传到Hadoop

    [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put /home/coder/score.txt in

      查看是否上传成功:

    [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in
    Found 1 items
    -rw-r--r--   2 coder supergroup         75 2013-04-20 14:33 /user/coder/in/score.txt

      2)载入原始数据,使用LOAD

    grunt> scores = LOAD 'hdfs://h1:9000/user/coder/in/score.txt' USING PigStorage('|') AS (num:int,Chinese:int,Math:int);

      输入文件是:'hdfs://h1:9000/user/coder/in/score.txt'

      表名(Bag):scores

      从输入文件读取数据(Tuple)时以 | 分隔

      读取的Tuple包含3个属性,分别为学号(num)、语文成绩(Chinese)和数学成绩(Math),这三个属性的数据类型都为int

      3)查看表的结构

    grunt> DESCRIBE scores;
    scores: {num: int,Chinese: int,Math: int}

      4)假如我们需要过滤掉学号为20130005的记录

    grunt> filter_scores = FILTER scores BY num != 20130005;

      查看过滤后的记录

    grunt> dump filter_scores;
    (20130001,80,90)
    (20130002,85,96)
    (20130003,60,70)
    (20130004,74,86)

      5)计算每个人的总分

    grunt> totalScore = FOREACH scores GENERATE num,Chinese+Math;

      查看结果:

    grunt> dump totalScore;
    (20130001,170)
    (20130002,181)
    (20130003,130)
    (20130004,160)
    (20130005,163)

      

      6)将每个人的总分结果输出到文件

    grunt> store totalScore into 'hdfs://h1:9000/user/coder/out/result' using PigStorage('|');

      查看结果:

    [coder@h1 ~]$ hadoop dfs -ls /user/coder/out/result
    Found 2 items
    drwxr-xr-x   - coder supergroup          0 2013-04-20 15:54 /user/coder/out/result/_logs
    -rw-r--r--   2 coder supergroup         65 2013-04-20 15:54 /user/coder/out/result/part-m-00000
    [coder@h1 ~]$ ^C
    [coder@h1 ~]$ hadoop dfs -cat /user/coder/out/result/*
    20130001|170
    20130002|181
    20130003|130
    20130004|160
    20130005|163
    cat: Source must be a file.
    [coder@h1 ~]$ 

      再看一个小例子:

      有一批如下格式的文件:

    zhangsan#123456#zhangsan@qq.com
    lisi#434dfdds#lisi@126.com
    wangwu#ffere233#wangwu@163.com
    zhouliu#fgrtr43#zhouliu@139.com

      每行记录有三个字段:账号、密码、邮箱,字段之间以#号分隔,现在要提取这批文件中的邮箱。

      

      1)上传文件到hadoop

    [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put data.txt in
    [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in
    Found 1 items
    -rw-r--r--   2 coder supergroup        122 2013-04-24 20:34 /user/coder/in/data.txt
    [coder@h1 hadoop-0.20.2]$ 

      2)载入原始数据文件

    grunt> T_A = LOAD '/user/coder/in/data.txt' using PigStorage('#') as (username:chararray,password:chararray,email:chararray);

      3)取出email字段

    grunt> T_B = FOREACH T_A GENERATE email;

      4)把结果输出到文件

    grunt> STORE T_B INTO '/user/coder/out/email'

      5)查看结果

    [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -cat /user/coder/out/email/*
    zhangsan@qq.com
    lisi@126.com
    wangwu@163.com
    zhouliu@139.com
    cat: Source must be a file.

        

  • 相关阅读:
    sun.misc.BASE64Encoder----》找不到jar包的解决方法
    javax.validation.UnexpectedTypeException: HV000030: No validator could be found for constraint-实体报错
    避免MQ消息重发的简单实现思路
    使用Spring的@Scheduled实现定时任务参数详解
    重置密码解决MySQL for Linux错误 ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)
    安装mysql zip5.7版--安裝
    bzoj3983
    bzoj4044
    bzoj1064
    bzoj4042
  • 原文地址:https://www.cnblogs.com/luxh/p/3032717.html
Copyright © 2011-2022 走看看