zoukankan      html  css  js  c++  java
  • [Spark][Python]spark 从 avro 文件获取 Dataframe 的例子

    [Spark][Python]spark 从 avro 文件获取 Dataframe 的例子

    从如下地址获取文件:
    https://github.com/databricks/spark-avro/raw/master/src/test/resources/episodes.avro

    导入到 hdfs 系统:
    hdfs dfs -put episodes.avro

    读入:
    mydata001=sqlContext.read.format("com.databricks.spark.avro").load("episodes.avro")

    交互式运行结果:

    In [7]: mydata001=sqlContext.read.format("com.databricks.spark.avro").load("episodes.avro")
    17/10/03 07:00:47 INFO avro.AvroRelation: Listing hdfs://localhost:8020/user/training/episodes.avro on driver

    In [8]: type(mydata001)
    Out[8]: pyspark.sql.dataframe.DataFrame

    In [9]: mydata001.count()
    17/10/03 07:01:05 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 65.5 KB, free 65.5 KB)
    17/10/03 07:01:05 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 21.4 KB, free 86.9 KB)
    17/10/03 07:01:05 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:40075 (size: 21.4 KB, free: 208.8 MB)
    17/10/03 07:01:05 INFO spark.SparkContext: Created broadcast 3 from count at NativeMethodAccessorImpl.java:-2
    17/10/03 07:01:05 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 230.4 KB, free 317.3 KB)
    17/10/03 07:01:06 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 21.5 KB, free 338.8 KB)
    17/10/03 07:01:06 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:40075 (size: 21.5 KB, free: 208.8 MB)
    17/10/03 07:01:06 INFO spark.SparkContext: Created broadcast 4 from hadoopFile at AvroRelation.scala:121
    17/10/03 07:01:06 INFO mapred.FileInputFormat: Total input paths to process : 1
    17/10/03 07:01:07 INFO spark.SparkContext: Starting job: count at NativeMethodAccessorImpl.java:-2
    17/10/03 07:01:07 INFO scheduler.DAGScheduler: Registering RDD 16 (count at NativeMethodAccessorImpl.java:-2)
    17/10/03 07:01:07 INFO scheduler.DAGScheduler: Got job 1 (count at NativeMethodAccessorImpl.java:-2) with 1 output partitions
    17/10/03 07:01:07 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (count at NativeMethodAccessorImpl.java:-2)
    17/10/03 07:01:07 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
    17/10/03 07:01:07 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 2)
    17/10/03 07:01:07 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 2 (MapPartitionsRDD[16] at count at NativeMethodAccessorImpl.java:-2), which has no missing parents
    17/10/03 07:01:07 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 11.5 KB, free 350.3 KB)
    17/10/03 07:01:07 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 5.7 KB, free 356.0 KB)
    17/10/03 07:01:07 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:40075 (size: 5.7 KB, free: 208.8 MB)
    17/10/03 07:01:07 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
    17/10/03 07:01:07 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 2 (MapPartitionsRDD[16] at count at NativeMethodAccessorImpl.java:-2)
    17/10/03 07:01:07 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
    17/10/03 07:01:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2249 bytes)
    17/10/03 07:01:07 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2)
    17/10/03 07:01:07 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/episodes.avro:0+597
    17/10/03 07:01:08 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 2). 2484 bytes result sent to driver
    17/10/03 07:01:08 INFO scheduler.DAGScheduler: ShuffleMapStage 2 (count at NativeMethodAccessorImpl.java:-2) finished in 0.691 s
    17/10/03 07:01:08 INFO scheduler.DAGScheduler: looking for newly runnable stages
    17/10/03 07:01:08 INFO scheduler.DAGScheduler: running: Set()
    17/10/03 07:01:08 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 3)
    17/10/03 07:01:08 INFO scheduler.DAGScheduler: failed: Set()
    17/10/03 07:01:08 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 693 ms on localhost (1/1)
    17/10/03 07:01:08 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
    17/10/03 07:01:08 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[19] at count at NativeMethodAccessorImpl.java:-2), which has no missing parents
    17/10/03 07:01:08 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 12.6 KB, free 368.5 KB)
    17/10/03 07:01:08 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 6.1 KB, free 374.7 KB)
    17/10/03 07:01:08 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:40075 (size: 6.1 KB, free: 208.8 MB)
    17/10/03 07:01:08 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006
    17/10/03 07:01:08 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[19] at count at NativeMethodAccessorImpl.java:-2)
    17/10/03 07:01:08 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
    17/10/03 07:01:08 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, partition 0,NODE_LOCAL, 1999 bytes)
    17/10/03 07:01:08 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 3)
    17/10/03 07:01:08 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
    17/10/03 07:01:08 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
    17/10/03 07:01:08 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 3). 1666 bytes result sent to driver
    17/10/03 07:01:08 INFO scheduler.DAGScheduler: ResultStage 3 (count at NativeMethodAccessorImpl.java:-2) finished in 0.344 s
    17/10/03 07:01:08 INFO scheduler.DAGScheduler: Job 1 finished: count at NativeMethodAccessorImpl.java:-2, took 1.480495 s
    17/10/03 07:01:08 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 345 ms on localhost (1/1)
    17/10/03 07:01:08 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
    Out[9]: 8

    In [10]: mydata001.take(1)
    17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 230.1 KB, free 604.8 KB)
    17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 21.4 KB, free 626.2 KB)
    17/10/03 07:01:18 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:40075 (size: 21.4 KB, free: 208.7 MB)
    17/10/03 07:01:18 INFO spark.SparkContext: Created broadcast 7 from take at <ipython-input-10-35862abbc114>:1
    17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 230.5 KB, free 856.7 KB)
    17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 21.5 KB, free 878.2 KB)
    17/10/03 07:01:18 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:40075 (size: 21.5 KB, free: 208.7 MB)
    17/10/03 07:01:18 INFO spark.SparkContext: Created broadcast 8 from take at <ipython-input-10-35862abbc114>:1
    17/10/03 07:01:18 INFO mapred.FileInputFormat: Total input paths to process : 1
    17/10/03 07:01:18 INFO spark.SparkContext: Starting job: take at <ipython-input-10-35862abbc114>:1
    17/10/03 07:01:18 INFO scheduler.DAGScheduler: Got job 2 (take at <ipython-input-10-35862abbc114>:1) with 1 output partitions
    17/10/03 07:01:18 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (take at <ipython-input-10-35862abbc114>:1)
    17/10/03 07:01:18 INFO scheduler.DAGScheduler: Parents of final stage: List()
    17/10/03 07:01:18 INFO scheduler.DAGScheduler: Missing parents: List()
    17/10/03 07:01:18 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[27] at take at <ipython-input-10-35862abbc114>:1), which has no missing parents
    17/10/03 07:01:19 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 5.6 KB, free 883.8 KB)
    17/10/03 07:01:19 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 3.0 KB, free 886.9 KB)
    17/10/03 07:01:19 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:40075 (size: 3.0 KB, free: 208.7 MB)
    17/10/03 07:01:19 INFO spark.SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:1006
    17/10/03 07:01:19 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[27] at take at <ipython-input-10-35862abbc114>:1)
    17/10/03 07:01:19 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
    17/10/03 07:01:19 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, localhost, partition 0,PROCESS_LOCAL, 2260 bytes)
    17/10/03 07:01:19 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 4)
    17/10/03 07:01:19 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/episodes.avro:0+597
    17/10/03 07:01:19 INFO codegen.GenerateUnsafeProjection: Code generated in 124.624053 ms
    17/10/03 07:01:19 INFO executor.Executor: Finished task 0.0 in stage 4.0 (TID 4). 2237 bytes result sent to driver
    17/10/03 07:01:19 INFO scheduler.DAGScheduler: ResultStage 4 (take at <ipython-input-10-35862abbc114>:1) finished in 0.415 s
    17/10/03 07:01:19 INFO scheduler.DAGScheduler: Job 2 finished: take at <ipython-input-10-35862abbc114>:1, took 0.565858 s
    17/10/03 07:01:19 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 4) in 415 ms on localhost (1/1)
    17/10/03 07:01:19 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
    Out[10]: [Row(title=u'The Eleventh Hour', air_date=u'3 April 2010', doctor=11)]

    In [11]:

  • 相关阅读:
    kerberos系列之zookeeper的认证配置
    kafka概念扫盲
    linux不常用命令
    linux环境安装pip
    Hbase概念原理扫盲
    python语言中三个奇妙的返回值
    python通过http(multipart/form-data)上传文件的方法
    tp5.1 模型设置了软删除,detach 不能删除中间表的问题
    tp5.1 where in 写法
    tp 5.1 使用模型查询结果集插入另一个模型的问题
  • 原文地址:https://www.cnblogs.com/gaojian/p/7624631.html
Copyright © 2011-2022 走看看