zoukankan      html  css  js  c++  java
  • RDD的基本命令

    1 创建RDD

    intRDD=sc.parallelize([3,1,2,5,6])
    intRDD.collect()
    [4, 2, 3, 6, 7]

    2 单RDD转换

    (1) MAP

    def addone(x):
        return (x+1)
    intRDD.map(addone).collect()
    [4, 2, 3, 6, 7]

    intRDD.map(lambda x: x+1).collect()
    [4, 2, 3, 6, 7]

    stringRDD.map(lambda x:'fruit:'+x).collect()
    ['fruit:Apple', 'fruit:Orange', 'fruit:Banana', 'fruit:Grape', 'fruit:Apple']

    (2) filter

    intRDD.filter(lambda x: x<3).collect()
    [1, 2]
    intRDD.filter(lambda x:1<x and x<5).collect()
    [3, 2]
    stringRDD.filter(lambda x: "ra" in x).collect()
    ['Orange', 'Grape']

    (3) distinct

    intRDD.distinct().collect()
    [1, 5, 2, 6, 3]
    stringRDD.distinct().collect()
    ['Orange', 'Apple', 'Banana', 'Grape']

    (4) randomSplit

    sRDD=intRDD.randomSplit([0.4,0.6])
    sRDD[0].collect()
    [1, 2]
    sRDD[1].collect()
    [3, 5, 6]

    (5) groupby

    gRDD=intRDD.groupBy(lambda x:'even' if (x%2==0) else 'odd').collect()
    print('even')
    print(list(gRDD[0][1]))
    print('odd')
    print(gRDD[1][1])

    even
    [2, 6]
    odd
    <pyspark.resultiterable.ResultIterable object at 0x7f9ba805d438>

    3  多个RDD转换运算

    intRDD1=sc.parallelize([3,1,2,5,5])
    intRDD2=sc.parallelize([5,6])
    intRDD3=sc.parallelize([2,7])

    并集union

    intRDD1.union(intRDD2).union(intRDD3).collect()

    [3, 1, 2, 5, 5, 5, 6, 2, 7]

    交集intersection

    intRDD1.intersection(intRDD2).collect()

    [5]

    差集 subtract

    intRDD1.subtract(intRDD2).collect()

    [1, 2, 3]

    笛卡尔积乘积 cartesian

    intRDD1.cartesian(intRDD2).collect()

    [(3, 5),

    (3, 6),

    (1, 5),

    (1, 6),

    (2, 5),

    (2, 6),

    (5, 5),

    (5, 5),

    (5, 6),

    (5, 6)]

    动作 运算

    first() 读取第一项数据
    take(2) 取出前两项数据
    takeOrdered(3) 从小到大排序,取出前三项数据
    takeOrdered(3,key=lambda x:-x) 从大到小排序,取出前三项

    统计功能

    stats()
    min()
    max()
    stdev()
    count()
    sum()
    mean()

    RDD key-value transformation

    kvRDD1=sc.parallelize([(3,4),(3,6),(5,6),(1,2)])
    kvRDD2=sc.parallelize([(3,8)])

    kvRDD1.collect()
    [(3, 4), (3, 6), (5, 6), (1, 2)]
    kvRDD2.collect()
    [(3, 8)]

    join

    kvRDD1.join(kvRDD2).collect()
    [(3, (4, 8)), (3, (6, 8))]

    leftOuterJoin

    kvRDD1.leftOuterJoin(kvRDD2).collect()

    [(1, (2, None)), (3, (4, 8)), (3, (6, 8)), (5, (6, None))]

    rightOuterJoin

    kvRDD1.rightOuterJoin(kvRDD2).collect()

    [(3, (4, 8)), (3, (6, 8))]

    subtractByKey

    kvRDD1.subtractByKey(kvRDD2).collect()

    [(1, 2), (5, 6)]

    RDD key-value Action

    key-value first

    kvFirst=kvRDD1.first()
    print(kvFirst[0])
    print(kvFirst[1])

    3
    4

    key count

    kvRDD1.countByKey()

    defaultdict(int, {1: 1, 3: 2, 5: 1})

    create key-value map –>collectAsMap

    KV=kvRDD1.collectAsMap()
    KV

    {1: 2, 3: 6, 5: 6}

    print(type(KV))
    print(KV[3])
    <class 'dict'> 6

    input key to get value

    kvRDD1.lookup(3)

    [4, 6]
  • 相关阅读:
    <2014 04 29> *nix环境编程常用库总结
    <2014 04 29> c/c++常用库总结
    <2014 04 26> 《Coders at Work编程人生:15位软件先驱访谈录》
    <2014 04 16> 上班实习第一天
    <2014 04 15> C++语言回顾精要(原创By Andrew)
    [荐][转]为何应该使用 MacOS X(论GUI环境下开发人员对软件的配置与重用)
    [荐][转]王垠:我和权威的故事(2014)
    [荐][转]如何用美剧真正提升你的英语水平
    [转] 数学的用处(一)(二)(三)(四)(数学图谱)
    metadata 和 routing
  • 原文地址:https://www.cnblogs.com/xzjf/p/9593387.html
Copyright © 2011-2022 走看看