zoukankan      html  css  js  c++  java
  • Pig On Mac

    Install

    首先是 Mac OS 下的安装

    1
    2
    
     export JAVA_HOME=$(/usr/libexec/java_home)
     brew install pig
    

    Run

    Pig 运行分为两种模式,如果需要在本地调试的话,可以使用 shell 模式。

    通过运行下面的 command 就行了

    Shell mode

    1
    
     pig -x local
    

    Count Words

    下面我们用个简单的统计单词次数的例子做进入 pig 世界的 hello world。

    首先我们在网上随便找一篇文章做实验。

    word.txt

    1
    2
    3
    
    Thanks again for the great answers and links! Some people comment that it is hard to satisfy the criteria because core algorithms are
    so pervasive that it's hard to point to a specific use. I see the difficulty. But I think it is worthwhile to come up with specific
    examples because in my experience telling people: "Look, algorithms are important because they are just about everywhere!" does not work
    

    接下来我们进入 shell 模式,一行行输入下面的语句来看结果。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    
    input_file =  load 'words.txt' as (line);
    
    /* TOKENIZE: split line into word column */
    words = FOREACH input_file GENERATE FLATTEN(TOKENIZE(line)) as word;
    
    grpd = GROUP words by word;
    
    cntd = FOREACH grpd GENERATE group, COUNT(words);
    
    /* print result */
    dump cntd;
    

    最后键入 dump cntd 的时候可以看到单词数目已经统计出来了

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    
    (answers,1)
    (because,3)
    (comment,1)
    (people:,1)
    (satisfy,1)
    (telling,1)
    (criteria,1)
    (examples,1)
    (specific,2)
    (important,1)
    (pervasive,1)
    (algorithms,2)
    (experience,1)
    (worthwhile,1)
    (difficulty.,1)
    (everywhere!,1)
    

    More Complicate Example

    Pig 作为一简单实用的 hadoop 操作语言,同 SQL 的语法类似,支持 join, filter, group by 等操作.

    下面我们用个更复杂的例子来看看这门语言的有趣的地方。

    我们首先伪造一部分数据,这些数据以空格分开
    - 第一行代表用户id
    - 第二行 type: 其中 p 代表用户看过改页面,c 代表用户点击广告
    - 第三行 用户看过的url

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
        user1 p news.21cn.com/social/daqian/2008/05/29/4777194_1.shtml
        user2 c www.6wei.net/dianshiju/????xa1xe9|????do=index
        user1 p www.shanziba.com/
        user1 p download.it168.com/18/1805/13947/13947_3.shtml
        user2 p you.video.sina.com.cn/b/5924814-1246200450.html
        user3 c www.shanziba.com/
        user1 c download.it168.com/18/1805/13947/13947_3.shtml
        user3 p you.video.sina.com.cn/b/5924814-1246200450.html
        user1 c 
        user3 p
    

    首先我们想统计每个用户在我们的log 中发生了多少次行为。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    Users = LOAD 'server_log.txt' USING PigStorage(' ') as (user ,type ,url) ;
    
    /* filter bad log */
    Fltrd = FILTER Users by url is not null;
    
    Grpd = GROUP Fltrd by user;
    
    Cntd = foreach Grpd generate FLATTEN(group), COUNT(Fltrd.user);
    
    DUMP Cntd;
    

    输出如下:

    1
    2
    3
    
    (user1,4)
    (user2,2)
    (user3,2)
    

    如果我们想更进一步,查看每个用户发生了多少次click 和多少次 page view. 则稍显麻烦。

    首先我们要把page event 和 click event 分开,这可以通过 pig 的 split 实现。

    接着针对分开的 P_EVENT 和 C_EVENT 做 Group

    最后在使用 Join 命令把 Cntd_P 和 Cntd_C 按用户 join 起来。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    
    Users = LOAD 'server_log.txt' USING PigStorage(' ') as (user ,type ,url) ;
    
    Fltrd = FILTER Users by url is not null;
    
    SPLIT Fltrd INTO P_EVENT if type == 'p', 
                     C_EVENT if type == 'c'; 
    
    Grpd_P = GROUP P_EVENT by user;
    Grpd_C = GROUP C_EVENT by user;
    
    Cntd_P = foreach Grpd_P generate FLATTEN(group) as group_p,COUNT(P_EVENT.user) as p_count;
    Cntd_C = foreach Grpd_C generate FLATTEN(group) as group_c, COUNT(C_EVENT.user) as c_count;
    
    
    Jnd = JOIN Cntd_P BY group_p, Cntd_C BY group_c; 
    
    Cntd_P_C = FOREACH Jnd GENERATE Cntd_P::group_p, Cntd_P::p_count,Cntd_C::c_count;
    
    
    DUMP Cntd_P_C;
    

    Tips

    总体来看 pig 作为一门类 SQL 语言,其灵活性和方便性在处理较为简单的大数据任务时,相比传统的 hadoop job 有着不可比拟的优势。

    但 pig 也有缺点,比如 debug 信息不明确等。

    在日常写 pig 脚本时,可以通过 Describe 的方式来查看当前结果的结构来方便编码。

  • 相关阅读:
    关于:nth-children 的几点总结
    JQ常用知识点总结(笔记篇)————
    ajax的数据处理
    快捷小技巧
    javascript获取select,checkbox,radio的值
    面试题
    canvas基础
    python中的线程之semaphore信号量
    PHP中$_POST和$_GET的用法
    php中echo、print、print_r、var_dump、var_export区别
  • 原文地址:https://www.cnblogs.com/nateriver520/p/3443081.html
Copyright © 2011-2022 走看看