zoukankan      html  css  js  c++  java
  • [Spark][Python]Mapping Single Rows to Multiple Pairs

    Mapping Single Rows to Multiple Pairs
    目的:

    把如下的这种数据,

    Input Data

    00001 sku010:sku933:sku022
    00002 sku912:sku331
    00003 sku888:sku022:sku010:sku594
    00004 sku411


    转换为这样:
    一个Key值,带的这几个键值,分别罗列:

    (00001,sk010)
    (00001,sku933)
    (00001,sku022)

    ...
    (00002,sku912)
    (00002,sku331)
    (00003,sku888)

    这就是所谓的 Mapping Single Rows to Multiple Pairs

    步骤如下:

    [training@localhost ~]$ vim act001.txt
    [training@localhost ~]$
    [training@localhost ~]$ cat act001.txt
    00001 ku010:sku933:sku022
    00002 sku912:sku331
    00003 sku888:sku022:sku010:sku594
    00004 sku411
    [training@localhost ~]$ hdfs dfs -put act001.txt
    [training@localhost ~]$
    [training@localhost ~]$ hdfs dfs -cat act001.txt
    00001 ku010:sku933:sku022
    00002 sku912:sku331
    00003 sku888:sku022:sku010:sku594
    00004 sku411
    [training@localhost ~]$

    In [6]: mydata01=mydata.map(lambda line: line.split(" "))

    In [7]: type(mydata01)
    Out[7]: pyspark.rdd.PipelinedRDD

    In [8]: mydata02=mydata01.map(lambda fields: (fields[0],fields[1]))

    In [9]: type(mydata02)
    Out[9]: pyspark.rdd.PipelinedRDD

    In [10]:

    In [11]: mydata03 = mydata02.flatMapValues(lambda skus: skus.split(":"))

    In [12]: type(mydata03)
    Out[12]: pyspark.rdd.PipelinedRDD

    In [13]: mydata03.take(1)
    Out[13]: [(u'00001', u'ku010')]

  • 相关阅读:
    bzoj 2527: [Poi2011]Meteors 整体二分
    bzoj 2738 矩阵乘法
    bzoj 3110 K大数查询
    bzoj 3262 陌上花开
    cogs 577 蝗灾 CDQ分治
    bzoj 1101 zap
    bzoj 2005
    bzoj 3518 Dirichlet卷积
    bzoj 1257
    最优贸易 [NOIP 2009]
  • 原文地址:https://www.cnblogs.com/gaojian/p/7603900.html
Copyright © 2011-2022 走看看