zoukankan      html  css  js  c++  java
  • pig对null的处理(实际,对空文本处理为两种取值null或‘’)

    pig对文本null的处理很特殊。会处理成两种null,还会处理成''这种空值。
    比如,读name,age,sex日志信息。name取值处理,如果记录为“,,,”这样,会将name取值为null,如果记录为“,19,男”则name会处理为''。同样是空值,pig读取后的取值却不一样。所以一定要小心。


    pig读取日志信息,遇到取值为空的字段会处理为两种,一种取值为'',另一种为null。
    具体例子:读取日志中倒数第4个字段(全部为空,两个逗号之间无值",,"),
    pig读入后处理为两种值(''和 null),


    1日志中空处理为null:
    (5,148,b84daa9b-194e-4c4c-9595-ce4bfabca918,605378805132617404,2014-11-05 18:31:05,2014-11-05 18:31:05,1,62052,2,,,,,,,,239.130.237.121,2,-1,,,-1,e15b6c6675c6d6e8eb7851ccc866608787daeadd,b84daa9b-194e-4c4c-9595-ce4bfabca918,02:00:00:00:00:00,-991608703440210811,,,,,75061,200,,2,2,1,7.0,,,,,,)
    (5,148,b84daa9b-194e-4c4c-9595-ce4bfabca918,605378805132617404,2014-11-05 18:31:05,2014-11-05 18:31:05,2,62052,2,,,,,,,,239.130.237.121,2,-1,,,-1,e15b6c6675c6d6e8eb7851ccc866608787daeadd,b84daa9b-194e-4c4c-9595-ce4bfabca918,02:00:00:00:00:00,-991608703440210811,,,,,75061,200,,2,2,1,7.0,,,,,,)


    2日志中的空处理为'':
    (3,90,864616028213476,1412364855586,2014-08-25 15:07:42,,1,14999,2,,,,,,460,00,112.5.236.229,2,864616028213476,3ff1c154fb35073a,,864616028213476|3ff1c154fb35073a,,,,864616028213476|3ff1c154fb35073a,,,,,311,35,-1,,1,3,2.x,1,91,,35.0,105.0,132012121230123)
    (5,148,ddeb5f0f-09a7-456e-a9dc-5fb5e96c5453,682937329735483418,2014-11-04 20:08:37,2014-11-04 20:08:37,1,62052,2,,,,,,,,160.35.136.117,1,-1,,,-1,e72da4be06382bd0826be09927f650ca2570add9,ddeb5f0f-09a7-456e-a9dc-5fb5e96c5453,02:00:00:00:00:00,-3733654770696849299,,,,,66454,206,,2,2,1,7.1,,,,38.878998,-76.9898,032010032322002)
    (3,90,864616028213476,1412364855586,2014-08-25 15:07:42,,2,14999,2,,,,,,460,00,112.5.236.229,2,864616028213476,3ff1c154fb35073a,,864616028213476|3ff1c154fb35073a,,,,864616028213476|3ff1c154fb35073a,,,,,311,35,-1,,1,3,2.x,1,91,,35.0,105.0,132012121230123)
    (5,148,ddeb5f0f-09a7-456e-a9dc-5fb5e96c5453,682937329735483418,2014-11-04 20:08:37,2014-11-04 20:08:37,2,62052,2,,,,,,,,160.35.136.117,1,-1,,,-1,e72da4be06382bd0826be09927f650ca2570add9,ddeb5f0f-09a7-456e-a9dc-5fb5e96c5453,02:00:00:00:00:00,-3733654770696849299,,,,,66454,206,,2,2,1,7.1,,,,38.878998,-76.9898,032010032322002)


    处理代码如下:
    --citylevel report analysis:pig -p date=2014-07-30 -p year=2014 -p file_path=/user/wizad/test -f
    SET job.name 'test_citylevel_reporth_istorical';
    SET job.priority HIGH;


    --REGISTER piggybank.jar;
    REGISTER wizad-etl-udf-0.1.jar;


    --DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
    DEFINE SequenceFileLoader com.XXX.xxx.etl.pig.SequenceFileCSVLoader();




    %default Cleaned_Log /user/wizad/test/wizad/cleaned/2014-10*/*/part*


    %default AD_Data /user/wizad/data/wizad/metadata/ad/part*
    %default Campaign_Data /user/wizad/data/wizad/metadata/campaign/part*
    %default Region_Template /user/wizad/data/wizad/metadata/region_template/part-m-00000
    %default Addtion_Data /user/wizad/data/report/region_addition/addition_data.txt


    %default Industry_Path $file_path/report/historical/citylevel/$year/industry
    %default Industry_Path $file_path/report/historical/citylevel/$year/industry
    %default Industry_SUM $file_path/report/historical/citylevel/$year/industry_sum
    %default Industry_TMP $file_path/report/historical/citylevel/$year/industry_tmp


    %default Industry_Brand_Path $file_path/report/historical/citylevel/$year/industry_brand
    %default Industry_Brand_SUM $file_path/report/historical/citylevel/$year/industry_brand_sum
    %default Industry_Brand_TMP $file_path/report/historical/citylevel/$year/industry_brand_tmp


    %default ALL_Path $file_path/report/historical/citylevel/$year/all
    %default ALL_SUM $file_path/report/historical/citylevel/$year/all_sum
    %default ALL_TMP $file_path/report/historical/citylevel/$year/all_tmp


    %default output_path /user/wizad/tmp/result




    --origin_cleaned_data = LOAD '$Cleaned_Log' USING PigStorage(',')
    origin_cleaned_data = LOAD '$Cleaned_Log' USING SequenceFileLoader 
    AS (ad_network_id:chararray,
        wizad_ad_id:chararray,
        guid:chararray,
        id:chararray,
        create_time:chararray,
        action_time:chararray,
        log_type:chararray, 
        ad_id:chararray,
        positioning_method:chararray,
        location_accuracy:chararray,
        lat:chararray, 
        lon:chararray,
        cell_id:chararray,
        lac:chararray,
        mcc:chararray,
        mnc:chararray,
        ip:chararray,
        connection_type:chararray,
        imei:chararray,
        android_id:chararray,
        android_advertising_id:chararray,
        udid:chararray,
        openudid:chararray,
        idfa:chararray,
        mac_address:chararray,
        uid:chararray,
        density:chararray,
        screen_height:chararray,
        screen_chararray,
        user_agent:chararray,
        app_id:chararray,
        app_category_id:chararray,
        device_model_id:chararray,
        carrier_id:chararray,
        os_id:chararray,
        device_type:chararray,
        os_version:chararray,
        country_region_id:chararray,
        province_region_id:chararray,
        city_region_id:chararray,
        ip_lat:chararray,
        ip_lon:chararray,
        quadkey:chararray);


    my_test1 = filter origin_cleaned_data by guid == 'b84daa9b-194e-4c4c-9595-ce4bfabca918';
    dump my_test1;
    describe my_test1;
    --store my_test into '$output_path/mytest' using PigStorage(',');


    my_test2 = filter origin_cleaned_data by guid == '864616028213476' or guid == 'ddeb5f0f-09a7-456e-a9dc-5fb5e96c5453';
    dump my_test2;
    describe my_test2;
    --store my_test into '$output_path/mytest' using PigStorage(',');


    --将第2种空取值''过滤为unknown
    unknown_data = FOREACH origin_cleaned_data GENERATE wizad_ad_id,guid,log_type,
    ((city_region_id == '') ? 'unknown' : city_region_id) AS city_region_id;  --(wizad_ad_id,guid,log_type,city_region_id)
    --将第1种空取值null过滤为isnull
    null_data =  FOREACH origin_cleaned_data GENERATE wizad_ad_id,guid,log_type,
    ((city_region_id is NULL) ? 'isnull' : city_region_id) AS city_region_id;  --(wizad_ad_id,guid,log_type,city_region_id)


    --看看unknown和isnull的数据
    all_unknown = filter unknown_data by city_region_id == 'unknown';
    dump all_unknown;
    --store all_unknown into '$output_path/unknown' using PigStorage(',');
    all_null = filter null_data by city_region_id == 'isnull';
    dump all_null;
    --store all_null into '$output_path/isnull' using PigStorage(',');


    --把两种都过滤为no_use
    origin_historical = FOREACH origin_cleaned_data GENERATE wizad_ad_id,guid,log_type,
    ((city_region_id == '') or (city_region_id == null) or (city_region_id is null) ? 'no_use' : city_region_id) AS city_region_id;  --(wizad_ad_id,guid,log_type,city_region_id)
    dump origin_historical;
    describe origin_historical;




    两种数据分别的结果如下:
    unknown数据:
    (90,864616028213476,1,unknown)
    (90,862624024878336,1,unknown)
    (90,990001402489819,1,unknown)
    (90,862343020727070,1,unknown)
    (201,1ff90f55-f5cd-4b2a-9357-5bde0e3ff526,1,unknown)
    (201,c3916c92-a70c-4d34-babd-d3fc021cf642,1,unknown)
    (201,00:c6:10:dd:81:17,1,unknown)
    (201,88:53:95:da:9e:03,1,unknown)
    ......


    而null数据:
    (148,b84daa9b-194e-4c4c-9595-ce4bfabca918,1,isnull)
    (148,13fbe940-7cd0-44a1-b637-a0df8ea83621,1,isnull)
    (148,b84daa9b-194e-4c4c-9595-ce4bfabca918,2,isnull)

    (148,13fbe940-7cd0-44a1-b637-a0df8ea83621,2,isnull)




    看了一些资料,pig对null处理的一些总结,部分引自网络:


    Pig数据流的语言,擅长于处理纯文本信息,在模式定义方面提供了更大的灵活性,包括数据类型的选择。这和传统的关系型数据库是有本质区别的。(传统的RDBMS必须要事先定义严格的表结构)

    缺点,太过灵活。

    对文本为空时,处理也比较特殊。如果Pig的函数有4种类型:计算函数(Evalfunction)比如MAX,筛选函数(Filterfunction)如ISEMPTY,加载函数(loadfunction),存储函数(stroefunction)如PigStorage函数。

    这些函数对空值有着特殊处理:

    1,首先加载时(加载/存储函数:PigStorage, BinStorage, BinaryStorage, TextLoader, PigDump),传统的关系数据库写入一个类型不符的数据到数据库,比如字符‘a’写到定义为int的字段,会报错。
    Pig不同,如load时,字符‘a’传到在模式定义中int字段,pig会用空值(null)取代之,同时会输出提示信息,但不终止语句执行。本质是大数据集有一定比例的损坏数据,是一种常见情况。此时我们可以通过如下语句进行筛选:
    corrupt_records= FILTER records BY temperature is not null;
    或者用split语句:
    Splitrecords into good_one if temperature is not null, bad_one is temperature is null;



    2,其他函数遇到空值的特殊处理。pig函数空值的殊处理:
    比较操作(==,!=,,<,=,<=),matches,算数操作(+,-,*,/ ,包含%,?,CASE)中,如果有一个操作数为空,那么结果为空。 


    CAST 操作:将一个null数据从一个数据类型转换到另一个数据类型,结果为空 


    AVG,MIN,MAX,SUM,COUNT :这几个操作将忽略空值 

    SIZE :任意计算的对象为null,结果也为null 

    CONCAT :任意一个字表达式为空,结果为空

    tuple(.) or map(#):如果 被引用的对象为空,那么结果为空 。 


    FILTER 操作:filter的表达式为空,不会拒绝操作。如b = filter a by X!=5 ,如果X为空,!X也为空,X!=5 将为空,那么filter将不会处理这一行数据。
     
    三元操作符 ? :如果一个bool表达式的结果为null,结果将为空。 
       null == 2 ? 1 : 4 -- returns null
       2 == 2 ? 1 : 'fred' -- type error; both values must be of the same type

    COUNT_STAR ,不过滤null数据 

    以下操作会产生null :

    1、除0 

    2、用户的UDFs 

    3、引用一个不存在的字段 

    4、引用一个map中不存在的字段 

    5、引用一个tuple中不存在的字段 

    6、load不存在的数据时产生null,空字符串不会被load,会被替换成null null可以作为一个常量使用 。 

    7、load时数据类型不匹配产生null 



    GROUP /COGROUP/JOIN: 用group处理一个关系时,一个关系中的null会被聚集在一起当做一个null处理 。 当cogroup来处理多个关系是,如果有key为空的情况时,多个关系之间的的空是不一样的,会被分别当做不同的null key来处理。 


    如数据 : 

    a :

      1 5 4

      3 6 

    b :

      1 7

      2 8
      10


    JOIN【inner】如:join中空和空是匹配不上的,会被过滤掉 。在join之前过滤出key为空的数据 ,有助于提高join的速度。 a = load './t1.txt' as (a1:int,a2:int); b = load './t2.txt' as (b1:int,b2:int); c= join a by a1,b by b1 ; dump c ; (1,5,1,7) 

    JOIN【outer】 d = join a by a1 left,b by b1 ; dump d ; (1,5,1,7)
    (3,6,,)
    (,4,,) 
    d = join a by a1 right,b by b1; dump d ; (1,5,1,7)
    (,,2,8)
    (,,,10) 
    d = join a by a1 full,b by b1; dump d ; (1,5,1,7)
    (,,2,8)
    (3,6,,)
    (,4,,)
    (,,,10) 



  • 相关阅读:
    31. Ubuntu15.04系统中如何启用、禁用客人会话
    dpkg安装deb缺少依赖包的解决方法
    C语言宏中"#"和"##"的用法
    编译android6.0错误recipe for target 'out/host/linux-x86/obj/lib/libart.so' failed
    Android api level对照表
    Android 如何判断CPU是32位还是64位
    vim map nmap(转)
    vim配置及插件安装管理(超级详细)
    Java多线程总结(二)锁、线程池
    Python快速教程目录(转)
  • 原文地址:https://www.cnblogs.com/cl1024cl/p/6205411.html
Copyright © 2011-2022 走看看