zoukankan      html  css  js  c++  java
  • [Spark][Python]groupByKey例子

     Spark Python 索引页

    [Spark][Python]sortByKey 例子 的继续:

    [Spark][Python]groupByKey例子

    In [29]: mydata003.collect()

    Out[29]:
    [[u'00001', u'sku933'],
    [u'00001', u'sku022'],
    [u'00001', u'sku912'],
    [u'00001', u'sku331'],
    [u'00002', u'sku010'],
    [u'00003', u'sku888'],
    [u'00004', u'sku411']]

    In [30]: mydata005=mydata003.groupByKey()

    In [32]: mydata005.count()
    Out[32]: 4

    In [33]: mydata005.collect()
    Out[33]:
    [(u'00004', <pyspark.resultiterable.ResultIterable at 0x7fcebe436b10>),
    (u'00001', <pyspark.resultiterable.ResultIterable at 0x7fcebe436850>),
    (u'00003', <pyspark.resultiterable.ResultIterable at 0x7fcebe436050>),
    (u'00002', <pyspark.resultiterable.ResultIterable at 0x7fcebe4361d0>)]


    那么,对于这种:

    (00004,sku411)
    (00003,sku888)
    (00003,sku022)
    (00003,sku010)
    (00003,sku594)
    (00002,sku912)

    理论上变成了这样形式的:

    (00002,[sku912,sku331])
    (00001,[sku022,sku010,sku933])
    (00003,[sku888,sku022,sku010,sku594])
    (00004,[sku411])

    我们如何把它们都打印输出成如下的格式,我考虑需要用到函数,然后对RDD的每行的Value,看作list,再来遍历。
    (等待下次编写)

    00002
    sku912
    sku331

    00001
    sku022
    sku010
    sku933

    00003
    sku088
    sku022
    sku022
    sku010
    sku594

    00004
    sku411

     Spark Python 索引页

  • 相关阅读:
    Identifier expected after this token
    需要整理的
    Context
    SharedPreferences
    一些常规注意事项
    一个点亮屏幕的service
    BroadcastReceiver中调用Service
    BroadcastReceiver
    Service
    微服务简介
  • 原文地址:https://www.cnblogs.com/gaojian/p/7612896.html
Copyright © 2011-2022 走看看