zoukankan      html  css  js  c++  java
  • [spark][python]Spark map 处理

    map 就是对一个RDD的各个元素都施加处理,得到一个新的RDD 的过程

    [training@localhost ~]$ cat names.txt
    Year,First Name,County,Sex,Count
    2012,DOMINIC,CAYUGA,M,6
    2012,ADDISON,ONONDAGA,F,14
    2012,ADDISON,ONONDAGA,F,14
    2012,JULIA,ONONDAGA,F,15
    [training@localhost ~]$ hdfs dfs -put names.txt
    [training@localhost ~]$ hdfs dfs -cat names.txt
    Year,First Name,County,Sex,Count
    2012,DOMINIC,CAYUGA,M,6
    2012,ADDISON,ONONDAGA,F,14
    2012,ADDISON,ONONDAGA,F,14
    2012,JULIA,ONONDAGA,F,15
    [training@localhost ~]$


    In [98]: t_names = sc.textFile("names.txt")
    17/09/24 06:24:22 INFO storage.MemoryStore: Block broadcast_27 stored as values in memory (estimated size 230.5 KB, free 2.3 MB)
    17/09/24 06:24:23 INFO storage.MemoryStore: Block broadcast_27_piece0 stored as bytes in memory (estimated size 21.5 KB, free 2.3 MB)
    17/09/24 06:24:23 INFO storage.BlockManagerInfo: Added broadcast_27_piece0 in memory on localhost:33950 (size: 21.5 KB, free: 208.6 MB)
    17/09/24 06:24:23 INFO spark.SparkContext: Created broadcast 27 from textFile at NativeMethodAccessorImpl.java:-2

    In [99]: rows=t_names.map(lambda line: line.split(","))

    In [100]: rows.take(1)


    17/09/24 06:25:23 INFO mapred.FileInputFormat: Total input paths to process : 1
    17/09/24 06:25:23 INFO spark.SparkContext: Starting job: runJob at PythonRDD.scala:393
    17/09/24 06:25:23 INFO scheduler.DAGScheduler: Got job 15 (runJob at PythonRDD.scala:393) with 1 output partitions
    17/09/24 06:25:23 INFO scheduler.DAGScheduler: Final stage: ResultStage 15 (runJob at PythonRDD.scala:393)
    17/09/24 06:25:23 INFO scheduler.DAGScheduler: Parents of final stage: List()
    17/09/24 06:25:23 INFO scheduler.DAGScheduler: Missing parents: List()
    17/09/24 06:25:23 INFO scheduler.DAGScheduler: Submitting ResultStage 15 (PythonRDD[46] at RDD at PythonRDD.scala:43), which has no missing parents
    17/09/24 06:25:23 INFO storage.MemoryStore: Block broadcast_28 stored as values in memory (estimated size 5.2 KB, free 2.3 MB)
    17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_26_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
    17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 8
    17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_18_piece0 on localhost:33950 in memory (size: 3.7 KB, free: 208.6 MB)
    17/09/24 06:25:24 INFO storage.MemoryStore: Block broadcast_28_piece0 stored as bytes in memory (estimated size 3.3 KB, free 2.3 MB)
    17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 9
    17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_19_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
    17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 10
    17/09/24 06:25:24 INFO storage.BlockManagerInfo: Added broadcast_28_piece0 in memory on localhost:33950 (size: 3.3 KB, free: 208.6 MB)
    17/09/24 06:25:24 INFO spark.SparkContext: Created broadcast 28 from broadcast at DAGScheduler.scala:1006
    17/09/24 06:25:24 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 15 (PythonRDD[46] at RDD at PythonRDD.scala:43)
    17/09/24 06:25:24 INFO scheduler.TaskSchedulerImpl: Adding task set 15.0 with 1 tasks
    17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_20_piece0 on localhost:33950 in memory (size: 3.7 KB, free: 208.6 MB)
    17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 11
    17/09/24 06:25:24 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 15.0 (TID 15, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
    17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_21_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
    17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 12
    17/09/24 06:25:24 INFO executor.Executor: Running task 0.0 in stage 15.0 (TID 15)
    17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_22_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
    17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 13
    17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_23_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
    17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 14
    17/09/24 06:25:24 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/names.txt:0+136
    17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_24_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
    17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 15
    17/09/24 06:25:24 INFO storage.BlockManagerInfo: Removed broadcast_25_piece0 on localhost:33950 in memory (size: 3.3 KB, free: 208.6 MB)
    17/09/24 06:25:24 INFO spark.ContextCleaner: Cleaned accumulator 16
    17/09/24 06:25:24 INFO python.PythonRunner: Times: total = 78, boot = 49, init = 25, finish = 4
    17/09/24 06:25:24 INFO executor.Executor: Finished task 0.0 in stage 15.0 (TID 15). 2203 bytes result sent to driver
    17/09/24 06:25:24 INFO scheduler.DAGScheduler: ResultStage 15 (runJob at PythonRDD.scala:393) finished in 0.438 s
    17/09/24 06:25:24 INFO scheduler.DAGScheduler: Job 15 finished: runJob at PythonRDD.scala:393, took 1.160085 s
    17/09/24 06:25:24 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 15.0 (TID 15) in 429 ms on localhost (1/1)
    17/09/24 06:25:24 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool
    Out[100]: [[u'Year', u'First Name', u'County', u'Sex', u'Count']]

    In [101]: rows.take(2)
    17/09/24 06:25:29 INFO spark.SparkContext: Starting job: runJob at PythonRDD.scala:393
    17/09/24 06:25:29 INFO scheduler.DAGScheduler: Got job 16 (runJob at PythonRDD.scala:393) with 1 output partitions
    17/09/24 06:25:29 INFO scheduler.DAGScheduler: Final stage: ResultStage 16 (runJob at PythonRDD.scala:393)
    17/09/24 06:25:29 INFO scheduler.DAGScheduler: Parents of final stage: List()
    17/09/24 06:25:29 INFO scheduler.DAGScheduler: Missing parents: List()
    17/09/24 06:25:29 INFO scheduler.DAGScheduler: Submitting ResultStage 16 (PythonRDD[47] at RDD at PythonRDD.scala:43), which has no missing parents
    17/09/24 06:25:29 INFO storage.MemoryStore: Block broadcast_29 stored as values in memory (estimated size 5.2 KB, free 2.2 MB)
    17/09/24 06:25:29 INFO storage.MemoryStore: Block broadcast_29_piece0 stored as bytes in memory (estimated size 3.3 KB, free 2.2 MB)
    17/09/24 06:25:29 INFO storage.BlockManagerInfo: Added broadcast_29_piece0 in memory on localhost:33950 (size: 3.3 KB, free: 208.6 MB)
    17/09/24 06:25:29 INFO spark.SparkContext: Created broadcast 29 from broadcast at DAGScheduler.scala:1006
    17/09/24 06:25:29 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 16 (PythonRDD[47] at RDD at PythonRDD.scala:43)
    17/09/24 06:25:29 INFO scheduler.TaskSchedulerImpl: Adding task set 16.0 with 1 tasks
    17/09/24 06:25:29 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 16.0 (TID 16, localhost, partition 0,PROCESS_LOCAL, 2147 bytes)
    17/09/24 06:25:29 INFO executor.Executor: Running task 0.0 in stage 16.0 (TID 16)
    17/09/24 06:25:29 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/names.txt:0+136
    17/09/24 06:25:29 INFO python.PythonRunner: Times: total = 71, boot = 25, init = 45, finish = 1
    17/09/24 06:25:29 INFO executor.Executor: Finished task 0.0 in stage 16.0 (TID 16). 2267 bytes result sent to driver
    17/09/24 06:25:30 INFO scheduler.DAGScheduler: ResultStage 16 (runJob at PythonRDD.scala:393) finished in 0.196 s
    17/09/24 06:25:30 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 16.0 (TID 16) in 202 ms on localhost (1/1)
    17/09/24 06:25:30 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all completed, from pool
    17/09/24 06:25:30 INFO scheduler.DAGScheduler: Job 16 finished: runJob at PythonRDD.scala:393, took 0.408908 s
    Out[101]:
    [[u'Year', u'First Name', u'County', u'Sex', u'Count'],
    [u'2012', u'DOMINIC', u'CAYUGA', u'M', u'6']]

    In [102]:

    来自:

    https://www.supergloo.com/fieldnotes/apache-spark-transformations-python-examples/

  • 相关阅读:
    A1066 Root of AVL Tree (25 分)
    A1099 Build A Binary Search Tree (30 分)
    A1043 Is It a Binary Search Tree (25 分) ——PA, 24/25, 先记录思路
    A1079; A1090; A1004:一般树遍历
    A1053 Path of Equal Weight (30 分)
    A1086 Tree Traversals Again (25 分)
    A1020 Tree Traversals (25 分)
    A1091 Acute Stroke (30 分)
    A1103 Integer Factorization (30 分)
    A1032 Sharing (25 分)
  • 原文地址:https://www.cnblogs.com/gaojian/p/7588619.html
Copyright © 2011-2022 走看看