zoukankan      html  css  js  c++  java
  • [Spark][Python]Mapping Single Rows to Multiple Pairs

    Mapping Single Rows to Multiple Pairs
    目的:

    把如下的这种数据,

    Input Data

    00001 sku010:sku933:sku022
    00002 sku912:sku331
    00003 sku888:sku022:sku010:sku594
    00004 sku411


    转换为这样:
    一个Key值,带的这几个键值,分别罗列:

    (00001,sk010)
    (00001,sku933)
    (00001,sku022)

    ...
    (00002,sku912)
    (00002,sku331)
    (00003,sku888)

    这就是所谓的 Mapping Single Rows to Multiple Pairs

    步骤如下:

    [training@localhost ~]$ vim act001.txt
    [training@localhost ~]$
    [training@localhost ~]$ cat act001.txt
    00001 ku010:sku933:sku022
    00002 sku912:sku331
    00003 sku888:sku022:sku010:sku594
    00004 sku411
    [training@localhost ~]$ hdfs dfs -put act001.txt
    [training@localhost ~]$
    [training@localhost ~]$ hdfs dfs -cat act001.txt
    00001 ku010:sku933:sku022
    00002 sku912:sku331
    00003 sku888:sku022:sku010:sku594
    00004 sku411
    [training@localhost ~]$

    In [6]: mydata01=mydata.map(lambda line: line.split(" "))

    In [7]: type(mydata01)
    Out[7]: pyspark.rdd.PipelinedRDD

    In [8]: mydata02=mydata01.map(lambda fields: (fields[0],fields[1]))

    In [9]: type(mydata02)
    Out[9]: pyspark.rdd.PipelinedRDD

    In [10]:

    In [11]: mydata03 = mydata02.flatMapValues(lambda skus: skus.split(":"))

    In [12]: type(mydata03)
    Out[12]: pyspark.rdd.PipelinedRDD

    In [13]: mydata03.take(1)
    Out[13]: [(u'00001', u'ku010')]

  • 相关阅读:
    JUnit单元测试--IntelliJ IDEA
    sublime Error executing: /usr/bin/security dump-trust-settings -d
    git 撤销commit
    mockito 初识
    Lucene 初识
    从list中随机选出几个数,并按照原来的顺序排列
    JVM原理
    JVM crash at ForUtil.readBlock
    Scrum之初体验
    Mac安装Gradle eclipse安装buildship插件
  • 原文地址:https://www.cnblogs.com/gaojian/p/7603900.html
Copyright © 2011-2022 走看看