zoukankan      html  css  js  c++  java
  • pig强制转换(字符到整数):首位0怎么处理,‘01’到1的转化,

    pig支持的类型转换(cast)

    Pig Latin supports casts as shown in this table.

     

    from / to

    bag

    tuple

    map

    int

    long

    float

    double

    chararray

    bytearray

    boolean

    bag

    error

    error

    error

    error

    error

    error

    error

    error

    error

    tuple

    error

    error

    error

    error

    error

    error

    error

    error

    error

    map

    error

    error

    error

    error

    error

    error

    error

    error

    error

    int

    error

    error

    error

    yes

    yes

    yes

    yes

    error

    error

    long

    error

    error

    error

    yes

    yes

    yes

    yes

    error

    error

    float

    error

    error

    error

    yes

    yes

    yes

    yes

    error

    error

    double

    error

    error

    error

    yes

    yes

    yes

    yes

    error

    error

    chararray

    error

    error

    error

    yes

    yes

    yes

    yes

    error

    yes

    bytearray

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    yes

    boolean

    error

    error

    error

    error

    error

    error

    error

    yes

    error


     表中,将字符类型转化做int是可以的。

    那么001 转化为int型后是1么?

    测试如下

    数据文件test_file.txt内容为:

    1011,012
    0111,100
    0011,010
    0001,010
    1010,001

    读入test_file.txt,将第一列的chararray类型数据,提取前三个字段,强制转换为int类型。

    而第二列直接按int类型读入,看看首位0怎么处理



    pig 代码:

    %default testFile /user/wizad/test/lmj/test_file.txt

    test_data = LOAD '$testFile' USING PigStorage(',')
    AS
    (str1:chararray,
     number:int
    );
     
    dump test_data;
    my_result = foreach test_data generate (int)SUBSTRING(str1,0,3);
    dump my_result;
    describe my_result;

    --myts = sample g_log 0.0001;
    --myts = limit g_log 10;
    --dump myts;
    --STORE myts INTO '/user/wizad/tmp/my' USING PigStorage(',');


    运行结果

    dump test_data:

    (1011,12)
    (0111,100)
    (0011,10)
    (0001,10)
    (1010,1)

    dump my_result:
    (101)
    (11)
    (1)
    (0)
    (101)

    可以看出首位0处理没有任何问题。


    顺便一提:
    debug或检查数据时,能用store,不用dump。要用dump就只用dump。我dump前,都先limit 10,只dump 10条数据
    
    
    因为,dump会让某些multi-query execution失效。看起来像降低运行数据。
    举例子:
    两个脚本一个执行 A > B >DUMP 
    而另一个执行A > B > C > STORE
    第一个脚本:
    A = LOAD 'input'AS (x, y, z);
    B = FILTER A BY x> 5;
    DUMP B;
    C = FOREACH BGENERATE y, z;
    STORE C INTO'output';
    store脚本:生成output1和output2两个文件,执行A > B > C > STORE
    A = LOAD 'input'AS (x, y, z);
    B = FILTER A BY x> 5;
    STORE B INTO'output1';
    C = FOREACH BGENERATE y, z;
    STORE C INTO'output2';        


    我工作中,需要比较两个日志的ip,time,os,来识别是否是相同用户。而time需要判定5分钟内相同用户。所以我做了一个小处理,

    将是时间从分钟切分:

    time_extract = foreach cookie_data generate cookie_id,ip,SUBSTRING(time,0,13) as time,SUBSTRING(time,14,16) as minute1,os_id,os_version;

    log_data = foreach click_log generate log_type,guid,ip,SUBSTRING(create_time,0,13) as time,SUBSTRING(create_time,14,16) as minute2,os_id,os_version;

    然后,将两个relation按 time进行join,就是比较到小时的:

    join_data = join time_extract by (ip,time,os_id), log_data by (ip,time,os_id);

    在相同小时的记录中,找5分钟内的:

    minute_compare = foreach join_data generate log_type,cookie_id,guid,(int)minute1,(int)minute2,time_extract::os_version,log_data::os_version;
    same_users = filter minute_compare by (ABS(minute1-minute2) <= 5);

    绝对值小于5的。


    完成代码如下:

    SET job.name 'mapping_from_mobile_to_pc';
    SET job.priority HIGH;

    REGISTER piggybank.jar;
    DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();

    %default cleanedLog /user/wizad/data/wizad/cleaned/2014-11-{0[3-9],1[0-8]}/*/part*
    %default cookieLog /user/wizad/tmp/ip_cookie.txt 

    %default output_path1 /user/wizad/tmp/mapping_cookie
    %default output_path2 /user/wizad/tmp/edition_compare

    cookie_data = LOAD '$cookieLog' USING PigStorage(',')
    AS(cookie_id:chararray,
       ip:chararray,
       time:chararray,
       os_id:chararray,
       os_version:chararray
    );
    --test = limit cookie_data 100;
    --dump test;

    time_extract = foreach cookie_data generate cookie_id,ip,SUBSTRING(time,0,13) as time,SUBSTRING(time,14,16) as minute1,os_id,os_version;
    describe time_extract;

    origin_cleaned_data = LOAD '$cleanedLog' USING SequenceFileLoader 
    AS (ad_network_id:chararray,
        wizad_ad_id:chararray,
        guid:chararray,
        id:chararray,
        create_time:chararray,
        action_time:chararray,
        log_type:chararray, 
        ad_id:chararray,
        positioning_method:chararray,
        location_accuracy:chararray,
        lat:chararray, 
        lon:chararray,
        cell_id:chararray,
        lac:chararray,
        mcc:chararray,
        mnc:chararray,
        ip:chararray,
        connection_type:chararray,
        imei:chararray,
        android_id:chararray,
        android_advertising_id:chararray,
        udid:chararray,
        openudid:chararray,
        idfa:chararray,
        mac_address:chararray,
        uid:chararray,
        density:chararray,
        screen_height:chararray,
        screen_chararray,
        user_agent:chararray,
        app_id:chararray,
        app_category_id:chararray,
        device_model_id:chararray,
        carrier_id:chararray,
        os_id:chararray,
        device_type:chararray,
        os_version:chararray,
        country_region_id:chararray,
        province_region_id:chararray,
        city_region_id:chararray,
        ip_lat:chararray,
        ip_lon:chararray,
        quadkey:chararray);

    click_log = filter origin_cleaned_data by log_type=='2'; 
    log_data = foreach click_log generate log_type,guid,ip,SUBSTRING(create_time,0,13) as time,SUBSTRING(create_time,14,16) as minute2,os_id,os_version,device_type;

    --join_data = join time_extract by ip, log_data by ip;
    --join_data = join time_extract by (ip,time), log_data by (ip,time);
    join_data = join time_extract by (ip,time,os_id), log_data by (ip,time,os_id);

    minute_compare = foreach join_data generate log_type,cookie_id,guid,(int)minute1,(int)minute2,time_extract::os_version,log_data::os_version;
    same_users = filter minute_compare by (ABS(minute1-minute2) <= 5);

    --dump same_users;
    describe same_users;

    mapping_cookie_id = foreach same_users generate cookie_id,guid;
    uniq_cookie_guid = distinct mapping_cookie_id;

    store uniq_cookie_guid INTO '$output_path1' USING PigStorage(',');

    os_edition = foreach same_users generate cookie_id,guid,SUBSTRING(time_extract::os_version,0,2) as os_ut,SUBSTRING(log_data::os_version,0,2) as os_mdm;
    same_os_edition = filter os_edition by (os_ut == os_mdm) or (os_ut == 'x')or (os_mdm == 'x');

    dump same_os_edition;
    cookie_guid_with_edition = foreach same_os_edition generate cookie_id,guid;
    uniq_c_g_editions = distinct cookie_guid_with_edition;

    store uniq_c_g_editions INTO '$output_path2' USING PigStorage(',');





  • 相关阅读:
    ASP.NET Core基于SignalR实现消息推送实战演练
    corn常用表达式
    muzejs 基于webassembly 的高性能数据可视化库
    xk6 构建原理说明
    集群、分布式、负载均衡区别
    Docker深入浅出系列 | Swarm多节点实战
    You Can’t Sacrifice Partition Tolerance
    Cookies are deleted when I close the browser?
    Using more than one index per table is dangerous?
    Multiple Indexes vs Multi-Column Indexes
  • 原文地址:https://www.cnblogs.com/cl1024cl/p/6205399.html
Copyright © 2011-2022 走看看