zoukankan      html  css  js  c++  java
  • 大数据笔记(十七)——Pig的安装及环境配置、数据模型

     一、Pig简介和Pig的安装配置

    1、最早是由Yahoo开发,后来给了Apache
    2、支持语言:PigLatin 类似SQL
    3、翻译器 PigLatin ---> MapReduce(Spark)
    4、安装和配置
    (1)tar -zxvf pig-0.17.0.tar.gz -C ~/training/
    (2)设置环境变量 vi ~/.bash_profile

    PIG_HOME=/root/training/pig-0.17.0
    export PIG_HOME
    
    PATH=$PIG_HOME/bin:$PATH
    export PATH

    两种配置模式(运行模式)
    (1)本地模式:操作Linux的文件
    启动: pig -x local
    日志:Connecting to hadoop file system at: file:///


    (2)集群模式:链接到HDFS
    设置环境变量 指向Hadoop配置文件所在的目录

    PIG_CLASSPATH=/root/training/hadoop-2.7.3/etc/hadoop
    export PIG_CLASSPATH

    启动: pig
    日志: Connecting to hadoop file system at: hdfs://bigdata11:9000

    二、Pig的常用命令: 操作HDFS
    ls、cd、cat、mkdir、pwd
    copyFromLocal(上传)、copyToLocal(下载)
    sh: 调用操作系统的命令
    register、define =====> 使用Pig的自定义函数

    三、Pig的数据模型(重要) ----> Apache Storm流式计算

    四、使用PigLatin语句分析和处理数据
    1、需要使用Hadoop的HistoryServer
    mr-jobhistory-daemon.sh start historyserver
    http://192.168.157.11:19888/jobhistory

    2、常用的PigLatin语句
    (*)load 加载数据到bag(表)
    (*)foreach 相当于循环,对bag每一条数据tuple进行处理
    (*)filter 相当于where
    (*)group by 分组
    (*)join 连接
    (*)generate 提取列
    (*)union/intersect 集合运算
    (*)输出:dump 直接打印的屏幕上
    store 输出到HDFS

    注意:有些语句会触发计算,有些不会
    Spark算子(API方法):Transformation:不会触发计算
    Action: 会触发计算

    3、举例: 7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30
    (1) 加载员工数据到表
    emp = load '/scott/emp.csv';

    查询表的结构
    describe emp; ---> Schema for emp unknown.

    (2) 加载员工数据到表,指定每个tuple的schema和类型
    emp = load '/scott/emp.csv' as(empno,ename,job,mgr,hiredate,sal,comm,deptno);
    默认的数据类型:bytearray
    默认分隔符:制表符

    emp = load '/scott/emp.csv' as(empno:int,ename:chararray,job:chararray,mgr:int,hiredate:chararray,sal:int,comm:int,deptno:int);

    emp = load '/scott/emp.csv' using PigStorage(',') as(empno:int,ename:chararray,job:chararray,mgr:int,hiredate:chararray,sal:int,comm:int,deptno:int);

    创建一个部门表
    dept = load '/scott/dept.csv' using PigStorage(',') as(deptno:int,dname:chararray,loc:chararray);

    (3) 查询员工信息:员工号 姓名 薪水
    SQL: select empno,ename,sal from emp;
    PL: 

    emp3 = foreach emp generate empno,ename,sal;

    (4) 查询员工信息:按照月薪排序
    SQL: select * from emp order by sal;
    PL: 

    emp4 = order emp by sal;


    (5) 分组:求每个部门的工资的最大值
    SQL: select deptno,max(sal) from emp group by deptno;
    PL: 第一步:分组

    emp51 = group emp by deptno;

    表结构:
    emp51: {group: int,
    emp: {(empno: int,ename: chararray,job: chararray,mgr: int,hiredate: chararray,sal: int,comm: int,deptno: int)}}

    数据:
    (10,{(7934,MILLER,CLERK,7782,1982/1/23,1300,,10),
    (7839,KING,PRESIDENT,,1981/11/17,5000,,10),
    (7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)})

    (20,{(7876,ADAMS,CLERK,7788,1987/5/23,1100,,20),
    (7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20),
    (7369,SMITH,CLERK,7902,1980/12/17,800,,20),
    (7566,JONES,MANAGER,7839,1981/4/2,2975,,20),
    (7902,FORD,ANALYST,7566,1981/12/3,3000,,20)})

    (30,{(7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30),
    (7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30),
    (7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30),
    (7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30),
    (7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30),
    (7900,JAMES,CLERK,7698,1981/12/3,950,,30)})

    第二步:求每个部门的工资最大值

    emp52 = foreach emp51 generate group,MAX(emp.sal)


    (6) 查询10号部门的员工
    SQL: select * from emp where deptno=10;
    PL: 

    emp6 = filter emp by deptno==10;

    注意:两个等号

    (7) 多表查询
    查询员工信息: 员工姓名 部门名称
    SQL: select e.ename,d.dname from emp e,dept d where e.deptno=d.deptno;
    PL: 

    emp71 = join dept by deptno,emp by deptno;
    emp72 = foreach emp71 generate dept::dname,emp::ename;


    (8) 集合运算:关系型数据库Oracle:参与集合运算的各个集合必须列数相同且类型一致
    10和20号部门的员工
    SQL: select * from emp where deptno=10
    union
    select * from emp where deptno=20;

    PL: 

    emp10 = filter emp by deptno==10;
    emp20 = filter emp by deptno==20;
    emp10_20 = union emp10,emp20;

    (9) 使用PL实现WordCount
    ① 加载数据
    mydata = load '/data/data.txt' as (line:chararray);

    ② 将字符串分割成单词
    words = foreach mydata generate flatten(TOKENIZE(line)) as word;

    ③ 对单词进行分组
    grpd = group words by word;

    ④ 统计每组中单词数量
    cntd = foreach grpd generate group,COUNT(words);

    ⑤ 打印结果
    dump cntd;















  • 相关阅读:
    php开发_图片验证码
    php开发_php环境搭建
    中序线索二叉树算法
    WPF技巧(1)异步绑定
    WPF技巧(2)绑定到附加属性
    nhibernate 抓取策略
    wpf 控件开发基础(6) 单一容器(Decorator)
    WPF技巧(3)监测属性变更
    Caliburn v2 变更启动初始化
    wpf单容器中的Chrome
  • 原文地址:https://www.cnblogs.com/lingluo2017/p/8654203.html
Copyright © 2011-2022 走看看