zoukankan      html  css  js  c++  java
  • [Spark][Python]对HDFS 上的文件,采用绝对路径,来读取获得 RDD

    对HDFS 上的文件,采用绝对路径,来读取获得 RDD:

    In [102]: mydata=sc.textFile("file:/home/training/test.txt")
    17/09/24 06:31:04 INFO storage.MemoryStore: Block broadcast_30 stored as values in memory (estimated size 230.5 KB, free 2.4 MB)
    17/09/24 06:31:04 INFO storage.MemoryStore: Block broadcast_30_piece0 stored as bytes in memory (estimated size 21.5 KB, free 2.5 MB)
    17/09/24 06:31:04 INFO storage.BlockManagerInfo: Added broadcast_30_piece0 in memory on localhost:33950 (size: 21.5 KB, free: 208.6 MB)
    17/09/24 06:31:04 INFO spark.SparkContext: Created broadcast 30 from textFile at NativeMethodAccessorImpl.java:-2

    In [103]: mydata.take(1)
    17/09/24 06:31:09 INFO mapred.FileInputFormat: Total input paths to process : 1
    17/09/24 06:31:09 INFO spark.SparkContext: Starting job: runJob at PythonRDD.scala:393
    17/09/24 06:31:09 INFO scheduler.DAGScheduler: Got job 17 (runJob at PythonRDD.scala:393) with 1 output partitions
    17/09/24 06:31:09 INFO scheduler.DAGScheduler: Final stage: ResultStage 17 (runJob at PythonRDD.scala:393)
    17/09/24 06:31:09 INFO scheduler.DAGScheduler: Parents of final stage: List()
    17/09/24 06:31:09 INFO scheduler.DAGScheduler: Missing parents: List()
    17/09/24 06:31:09 INFO scheduler.DAGScheduler: Submitting ResultStage 17 (PythonRDD[50] at RDD at PythonRDD.scala:43), which has no missing parents
    17/09/24 06:31:09 INFO storage.MemoryStore: Block broadcast_31 stored as values in memory (estimated size 4.8 KB, free 2.5 MB)
    17/09/24 06:31:09 INFO storage.MemoryStore: Block broadcast_31_piece0 stored as bytes in memory (estimated size 3.0 KB, free 2.5 MB)
    17/09/24 06:31:09 INFO storage.BlockManagerInfo: Added broadcast_31_piece0 in memory on localhost:33950 (size: 3.0 KB, free: 208.6 MB)
    17/09/24 06:31:09 INFO spark.SparkContext: Created broadcast 31 from broadcast at DAGScheduler.scala:1006
    17/09/24 06:31:09 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 17 (PythonRDD[50] at RDD at PythonRDD.scala:43)
    17/09/24 06:31:09 INFO scheduler.TaskSchedulerImpl: Adding task set 17.0 with 1 tasks
    17/09/24 06:31:09 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 17.0 (TID 17, localhost, partition 0,PROCESS_LOCAL, 2130 bytes)
    17/09/24 06:31:09 INFO executor.Executor: Running task 0.0 in stage 17.0 (TID 17)
    17/09/24 06:31:09 INFO rdd.HadoopRDD: Input split: file:/home/training/test.txt:0+34
    17/09/24 06:31:10 INFO python.PythonRunner: Times: total = 28, boot = 11, init = 16, finish = 1
    17/09/24 06:31:10 INFO executor.Executor: Finished task 0.0 in stage 17.0 (TID 17). 2158 bytes result sent to driver
    17/09/24 06:31:10 INFO scheduler.DAGScheduler: ResultStage 17 (runJob at PythonRDD.scala:393) finished in 0.344 s
    17/09/24 06:31:10 INFO scheduler.DAGScheduler: Job 17 finished: runJob at PythonRDD.scala:393, took 0.750241 s
    17/09/24 06:31:10 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 17.0 (TID 17) in 348 ms on localhost (1/1)
    17/09/24 06:31:10 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 17.0, whose tasks have all completed, from pool
    Out[103]: [u'This is a test 1']

    In [104]:

  • 相关阅读:
    关闭防火墙,仍然无法访问80端口 centos
    apache添加虚拟主机(windows下)
    PHP实现文件下载
    chmod 777 修改权限之后,文件夹颜色变绿:解决方案
    element ui table(表格)点击一行展开
    vue中eventbus 多次触发的问题
    console.log、toString方法与js判断变量类型
    另一个维度:cocos-2d VS vue
    浏览器内置的base64方法
    H5网页涂鸦canvas
  • 原文地址:https://www.cnblogs.com/gaojian/p/7588750.html
Copyright © 2011-2022 走看看