zoukankan      html  css  js  c++  java
  • Spark中groupByKey、reduceByKey与sortByKey

    groupByKey把相同的key的数据分组到一个集合序列当中:

    [("hello",1), ("world",1), ("hello",1), ("fly",1), ("hello",1), ("world",1)] --> [("hello",(1,1,1)),("word",(1,1)),("fly",(1))]

    reduceByKey把相同的key的数据聚合到一起并进行相应的计算:

    [("hello",1), ("world",1), ("hello",1), ("fly",1), ("hello",1), ("world",1)]  add--> [("hello",3),("word",2),("fly",1)]

    sortByKey按key的大小排序,默认为升序排序:

     [(3,"hello"),(2,"word"),(1,"fly")]  -->   [(1,"fly"),(2,"word"),(3,"hello")] 

    groupByKey、reduceByKey及sortByKey的比较:

    from pyspark import SparkConf, SparkContext
    from operator import add
    
    conf = SparkConf()
    sc = SparkContext(conf=conf)
    
    
    def func_by_key():
        data = [
            "hello world", "hello fly", "hello world",
            "hello fly", "hello fly", "hello fly"
        ]
        data_rdd = sc.parallelize(data)
        word_rdd = data_rdd.flatMap(lambda s: s.split(" ")).map(lambda x: (x, 1))
        group_by_key_rdd = word_rdd.groupByKey()
        print("groupByKey:{}".format(group_by_key_rdd.mapValues(list).collect()))
        print("groupByKey mapValues(len):{}".format(
            group_by_key_rdd.mapValues(len).collect()
        ))
    
        reduce_by_key_rdd = word_rdd.reduceByKey(add)
        print("reduceByKey:{}".format(reduce_by_key_rdd.collect()))
    
        print("sortByKey:{}".format(reduce_by_key_rdd.map(
            lambda x: (x[1], x[0])
        ).sortByKey().map(lambda x: (x[0], x[1])).collect()))
    
    func_by_key()
    sc.stop()

    """

    result:

    groupByKey:[('fly', [1, 1, 1, 1]), ('world', [1, 1]), ('hello', [1, 1, 1, 1, 1, 1])]
    groupByKey mapValues(len):[('fly', 4), ('world', 2), ('hello', 6)]
    reduceByKey:[('fly', 4), ('world', 2), ('hello', 6)]
    sortByKey:[(2, 'world'), (4, 'fly'), (6, 'hello')]

    """

    从结果可以看出,groupByKey对分组后的每个key的value做mapValues(len)后的结果与reduceByKey的结果一致,即:如果分组后要对每一个key所对应的值进行操作则应直接用reduceByKey;sortByKey是按key排序,如果要对value排序,可以交换key与value的位置,再排序。

  • 相关阅读:
    【题解】CF#983 E-NN country
    【题解】CF#403 D-Beautiful Pairs of Numbers
    【题解】CF#285 E-Positions in Permutations
    【题解】FJOI2015火星商店问题
    【题解】Atcoder AGC#01 E-BBQ Hard
    【题解】Atcoder AGC#03 E-Sequential operations on Sequence
    【题解】CF#280 C-Game on Tree
    【题解】CF#833 B-The Bakery
    [BZOJ3600] 没有人的算术 [重量平衡树+权值线段树]
    [bzoj3514][CodeChef GERALD07] Chef ans Graph Queries [LCT+主席树]
  • 原文地址:https://www.cnblogs.com/FG123/p/9746830.html
Copyright © 2011-2022 走看看