zoukankan      html  css  js  c++  java
  • pig的一些实例(我常用的语法)

    在pig中, dump和store会分别完成两个MR,不会一起进行

    1:加载名用正则表达式:

    LOAD'/user/wizad/data/wizad/raw/2014-0{6,7-0,7-1,7-2,7-3,8}*/3_1/adwords*'

    或者定义引用:%default cleanedLog/user/wizad/data/wizad/cleaned/2014-11-{0[3-9],1[0-8]}/*/part*正确,

    而%default cleanedLog/user/wizad/data/wizad/cleaned/2014-11-{0[3-9],[10-18]}/*/part*(这语法居然错了, 用hadoop fs -ls/user/wizad/data/wizad/cleaned/2014-11-{0[3-9],[10-18]}/ 发现[10-18]不能使用,是错误的,所以只能用1[0-8]。原因是[]只能在10之内。我试了一年0[10-18]查的是01和08两个文件。而0[100-108] 查的10,11,18三个文件。所以只能在10之内使用。使用时格式为{[10-18]}也是一样的!)

    注意:文件名读入不支持所有的正则表达式,是hadoop支持什么云可是用什么。hadoop2.0支持,

    ?

    *

    [abc]或者[^abc]

    [a-z]或者[^a-z]

    c:转移字符表达,d标示0到9的数字

    {ab,cd}

    2:filter的几种简单用法:

    按值过滤

    FILTERclickDate_all BY log_type=='2';

    FILTERmapping_table BY mapping_ad_network_id=='3' AND mapping_type=='5';

    test=FILTER allRow BY (ad_id=='14997' OR ad_id=='14998' OR ad_id=='14999') ANDlog_type==2;

    test=FILTERallRow BY (INDEXOF(ad_id,'14997')==0 OR INDEXOF(ad_id,'14998')==0 OR INDEXOF(ad_id,'14999')==0)AND log_type==2;

    配合size函数

    FILTERcount_imei BY (SIZE(cimei)>14 AND SIZE(cimei)<17);

    2:正则表达式

    FILTERcimei2 BY NOT cimei MATCHES '^[0-9]*$';

    FILTERcmac2 BY cmac MATCHES'/[A-Fd]{2}:[A-Fd]{2}:[A-Fd]{2}:[A-Fd]{2}:[A-Fd]{2}:[A-Fd]{2}/';

    3:排序

    ORDER province_count BY $2 DESC;

    注意order多个文件,比如hdfs上part00000和part00001,order后只生成一个文件,因为合并成一个文件的操作只能用一个reduce完成,所以结果可能生成很大的文件

    4:CONCAT

    可用于生成独立的一列,如count了的一个数,前面加一列名称

    FOREACHorigin_cleaned_data GENERATE CONCAT('<-_','->') AS cou,guid,log_type;

    read_social_14=FOREACH metadata_social_14 GENERATE CONCAT('14','=='),guid_social;

    all_id=FOREACH allRow GENERATE id,CONCAT('_','-') as cc;

    5:过滤空值,将空值改成取值unknown。

         条件表达式“(判断式)?a:b”的应用:直接对列操作

    origin_historical= FOREACH origin_cleaned_data GENERATE wizad_ad_id,guid,log_type,

    ((province_region_id== '') ? 'unknown' : province_region_id)

    另外注意:pig判断取值为null,是用is null(is not null)或者== null(!= null)

    6:切分成不同子集,按值:

     SPLIT geelyTuiGuang INTO android IFos_id==1,ios IF os_id==2;

     SPLIT ios INTO ios6 IF(INDEXOF(os_version,'7')!=0),ios7 IF INDEXOF(os_version,'7')==0;

    SPLITallCleaned INTO log_42 IF (

    ((chararray)$34=='1'OR (chararray)$34=='2' OR (chararray)$34=='3' OR (chararray)$34=='1' OR(chararray)$34=='4')

    AND

    (INDEXOF((chararray)$35,'.')>0)

    AND

    ((chararray)$36=='1'OR (chararray)$36=='')

    ),

    log_43IF (

    ((chararray)$34=='1'OR (chararray)$34=='2')

    AND

    ((chararray)$35=='1'OR (chararray)$35=='2' OR (chararray)$35=='3' OR (chararray)$35=='1' OR(chararray)$35=='4')

    AND

    (INDEXOF((chararray)$36,'.')>0)

    );

    7:replace函数替换值

     FOREACH ios6 GENERATE imei,mac_address ascmac,REPLACE(idfa,'null','');

    8:数据流过滤

     en_guid =STREAM duimei THROUGH `awk-F"," '{if($3 == "null") print$1","$2","; else print $0}'`;

    9:强制转换:

    cleaned_data_42=FOREACH log_42 GENERATE

    (chararray)$1  AS wizad_ad_id:chararray,

    (chararray)$2  AS guid:chararray,

    (chararray)$6  AS log_type:chararray,

    (chararray)$18AS imei:chararray,

    (chararray)$22AS idfa:chararray,

    (chararray)$23AS mac_address:chararray

    10内置函数REGEX_EXTRACT,使用正则表达式:

    allAdId=FOREACH allRow GENERATE REGEX_EXTRACT((chararray)$3,'(.*) (.*)',1) AStime,REGEX_EXTRACT((chararray)$0,'(.*)_(.*)',1) AS adn,$6 AS ad_id;

    allAdId=FOREACH allRow GENERATE REGEX_EXTRACT(create_time,'(.*) (.*)',1) AStime,ad_id;

    11.SUBSTRING(aa,0,n)提取0到n-1个字符:

    split jn_data into same_prov if(SUBSTRING(province,0,2) == SUBSTRING(province_ad,0,2)), diff_prov if(SUBSTRING(province,0,2)

     != SUBSTRING(province_ad,0,2));

    时间类型提取分钟,做计算

    log_data= foreach click_log generate log_type,guid,ip,SUBSTRING(create_time,0,13) astime,SUBSTRING(create_time,14,16) as minute2,os_id,os_version,device_type;

    12,ABS时间相差5分钟计算:

    minute_compare= foreach join_data generatelog_type,cookie_id,guid,(int)minute1,(int)minute2,time_extract::os_version,log_data::os_version;

    same_users= filter minute_compare by (ABS(minute1-minute2) <= 5);

    13,统计个数

    grp_diff_city= group diff_city all;

    count_diff_city= foreach grp_diff_city generate COUNT_STAR($1);

    dump count_same_city;

    14,join by多个列(字段)

    join_data= join time_extract by (ip,time,os_id), log_data by (ip,time,os_id);

    从左向右依次比较


  • 相关阅读:
    python 学习笔记7(类/对象的属性;特性,__getattr__)
    linux 误删文件恢复
    python 学习笔记6(数据库 sqlite)
    hive 函数 Cube
    边标志法填充多边形
    tolua#代码简要分析
    CocoaAsyncSocket + Protobuf 处理粘包和拆包问题
    【设计模式】适配器模式
    【设计模式】外观模式
    【操作系统】进程管理(二)
  • 原文地址:https://www.cnblogs.com/cl1024cl/p/6205416.html
Copyright © 2011-2022 走看看