zoukankan      html  css  js  c++  java
  • [Spark][Python]对HDFS 上的文件,采用绝对路径,来读取获得 RDD

    对HDFS 上的文件,采用绝对路径,来读取获得 RDD:

    In [102]: mydata=sc.textFile("file:/home/training/test.txt")
    17/09/24 06:31:04 INFO storage.MemoryStore: Block broadcast_30 stored as values in memory (estimated size 230.5 KB, free 2.4 MB)
    17/09/24 06:31:04 INFO storage.MemoryStore: Block broadcast_30_piece0 stored as bytes in memory (estimated size 21.5 KB, free 2.5 MB)
    17/09/24 06:31:04 INFO storage.BlockManagerInfo: Added broadcast_30_piece0 in memory on localhost:33950 (size: 21.5 KB, free: 208.6 MB)
    17/09/24 06:31:04 INFO spark.SparkContext: Created broadcast 30 from textFile at NativeMethodAccessorImpl.java:-2

    In [103]: mydata.take(1)
    17/09/24 06:31:09 INFO mapred.FileInputFormat: Total input paths to process : 1
    17/09/24 06:31:09 INFO spark.SparkContext: Starting job: runJob at PythonRDD.scala:393
    17/09/24 06:31:09 INFO scheduler.DAGScheduler: Got job 17 (runJob at PythonRDD.scala:393) with 1 output partitions
    17/09/24 06:31:09 INFO scheduler.DAGScheduler: Final stage: ResultStage 17 (runJob at PythonRDD.scala:393)
    17/09/24 06:31:09 INFO scheduler.DAGScheduler: Parents of final stage: List()
    17/09/24 06:31:09 INFO scheduler.DAGScheduler: Missing parents: List()
    17/09/24 06:31:09 INFO scheduler.DAGScheduler: Submitting ResultStage 17 (PythonRDD[50] at RDD at PythonRDD.scala:43), which has no missing parents
    17/09/24 06:31:09 INFO storage.MemoryStore: Block broadcast_31 stored as values in memory (estimated size 4.8 KB, free 2.5 MB)
    17/09/24 06:31:09 INFO storage.MemoryStore: Block broadcast_31_piece0 stored as bytes in memory (estimated size 3.0 KB, free 2.5 MB)
    17/09/24 06:31:09 INFO storage.BlockManagerInfo: Added broadcast_31_piece0 in memory on localhost:33950 (size: 3.0 KB, free: 208.6 MB)
    17/09/24 06:31:09 INFO spark.SparkContext: Created broadcast 31 from broadcast at DAGScheduler.scala:1006
    17/09/24 06:31:09 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 17 (PythonRDD[50] at RDD at PythonRDD.scala:43)
    17/09/24 06:31:09 INFO scheduler.TaskSchedulerImpl: Adding task set 17.0 with 1 tasks
    17/09/24 06:31:09 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 17.0 (TID 17, localhost, partition 0,PROCESS_LOCAL, 2130 bytes)
    17/09/24 06:31:09 INFO executor.Executor: Running task 0.0 in stage 17.0 (TID 17)
    17/09/24 06:31:09 INFO rdd.HadoopRDD: Input split: file:/home/training/test.txt:0+34
    17/09/24 06:31:10 INFO python.PythonRunner: Times: total = 28, boot = 11, init = 16, finish = 1
    17/09/24 06:31:10 INFO executor.Executor: Finished task 0.0 in stage 17.0 (TID 17). 2158 bytes result sent to driver
    17/09/24 06:31:10 INFO scheduler.DAGScheduler: ResultStage 17 (runJob at PythonRDD.scala:393) finished in 0.344 s
    17/09/24 06:31:10 INFO scheduler.DAGScheduler: Job 17 finished: runJob at PythonRDD.scala:393, took 0.750241 s
    17/09/24 06:31:10 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 17.0 (TID 17) in 348 ms on localhost (1/1)
    17/09/24 06:31:10 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 17.0, whose tasks have all completed, from pool
    Out[103]: [u'This is a test 1']

    In [104]:

  • 相关阅读:
    Java Map遍历方式的选择
    UriMatcher类的addURI()方法
    Java IO流分析整理[转]
    java基础一些注意细节
    java中static变量和方存在内存什么区域
    详细解析Java中抽象类和接口的区别
    mybatis一些记录
    Go语言简介(上)— 语法
    JavaScript相关-深入面向对象
    33个组件5
  • 原文地址:https://www.cnblogs.com/gaojian/p/7588750.html
Copyright © 2011-2022 走看看