zoukankan      html  css  js  c++  java
  • [Spark][python]以DataFrame方式打开Json文件的例子

    [Spark][python]以DataFrame方式打开Json文件的例子:

    [training@localhost ~]$ cat people.json
    {"name":"Alice","pcode":"94304"}
    {"name":"Brayden","age":30,"pcode":"94304"}
    {"name":"Carla","age":19,"pcoe":"10036"}
    {"name":"Diana","age":46}
    {"name":"Etienne","pcode":"94104"}
    [training@localhost ~]$

    [training@localhost ~]$ hdfs dfs -put people.json

    [training@localhost ~]$ hdfs dfs -cat people.json
    {"name":"Alice","pcode":"94304"}
    {"name":"Brayden","age":30,"pcode":"94304"}
    {"name":"Carla","age":19,"pcoe":"10036"}
    {"name":"Diana","age":46}
    {"name":"Etienne","pcode":"94104"}


    In [1]: sqlContext = HiveContext(sc)

    In [2]: peopleDF = sqlContext.read.json("people.json")

    17/10/01 05:20:22 INFO hive.HiveContext: Initializing execution hive, version 1.1.0
    17/10/01 05:20:22 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.7.0
    17/10/01 05:20:22 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
    17/10/01 05:20:23 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
    17/10/01 05:20:23 INFO hive.metastore: Opened a connection to metastore, current connections: 1
    17/10/01 05:20:23 INFO hive.metastore: Connected to metastore.
    17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training
    17/10/01 05:20:23 INFO session.SessionState: Created local directory: /tmp/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9_resources
    17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9
    17/10/01 05:20:23 INFO session.SessionState: Created local directory: /tmp/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9
    17/10/01 05:20:23 INFO session.SessionState: Created HDFS directory: file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9/_tmp_space.db
    17/10/01 05:20:23 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
    17/10/01 05:20:23 INFO json.JSONRelation: Listing hdfs://localhost:8020/user/training/people.json on driver
    17/10/01 05:20:25 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 251.1 KB, free 251.1 KB)
    17/10/01 05:20:25 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 21.6 KB, free 272.7 KB)
    17/10/01 05:20:25 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:42171 (size: 21.6 KB, free: 208.8 MB)
    17/10/01 05:20:25 INFO spark.SparkContext: Created broadcast 0 from json at NativeMethodAccessorImpl.java:-2
    17/10/01 05:20:26 INFO mapred.FileInputFormat: Total input paths to process : 1
    17/10/01 05:20:26 INFO spark.SparkContext: Starting job: json at NativeMethodAccessorImpl.java:-2
    17/10/01 05:20:26 INFO scheduler.DAGScheduler: Got job 0 (json at NativeMethodAccessorImpl.java:-2) with 1 output partitions
    17/10/01 05:20:26 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (json at NativeMethodAccessorImpl.java:-2)
    17/10/01 05:20:26 INFO scheduler.DAGScheduler: Parents of final stage: List()
    17/10/01 05:20:26 INFO scheduler.DAGScheduler: Missing parents: List()
    17/10/01 05:20:26 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at json at NativeMethodAccessorImpl.java:-2), which has no missing parents
    17/10/01 05:20:26 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.3 KB, free 277.1 KB)
    17/10/01 05:20:26 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KB, free 279.5 KB)
    17/10/01 05:20:26 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:42171 (size: 2.4 KB, free: 208.8 MB)
    17/10/01 05:20:26 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
    17/10/01 05:20:26 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at json at NativeMethodAccessorImpl.java:-2)
    17/10/01 05:20:26 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
    17/10/01 05:20:26 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2149 bytes)
    17/10/01 05:20:26 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
    17/10/01 05:20:26 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/people.json:0+179
    17/10/01 05:20:27 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
    17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
    17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
    17/10/01 05:20:27 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
    17/10/01 05:20:27 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
    17/10/01 05:20:27 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2354 bytes result sent to driver
    17/10/01 05:20:27 INFO scheduler.DAGScheduler: ResultStage 0 (json at NativeMethodAccessorImpl.java:-2) finished in 0.715 s
    17/10/01 05:20:27 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 667 ms on localhost (1/1)
    17/10/01 05:20:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
    17/10/01 05:20:27 INFO scheduler.DAGScheduler: Job 0 finished: json at NativeMethodAccessorImpl.java:-2, took 1.084685 s
    17/10/01 05:20:27 INFO hive.HiveContext: default warehouse location is /user/hive/warehouse
    17/10/01 05:20:28 INFO hive.HiveContext: Initializing metastore client version 1.1.0 using Spark classes.
    17/10/01 05:20:28 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.7.0
    17/10/01 05:20:28 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
    17/10/01 05:20:28 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on localhost:42171 in memory (size: 2.4 KB, free: 208.8 MB)
    17/10/01 05:20:28 INFO spark.ContextCleaner: Cleaned accumulator 2
    17/10/01 05:20:30 INFO hive.metastore: Trying to connect to metastore with URI thrift://localhost.localdomain:9083
    17/10/01 05:20:30 INFO hive.metastore: Opened a connection to metastore, current connections: 1
    17/10/01 05:20:30 INFO hive.metastore: Connected to metastore.
    17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training
    17/10/01 05:20:30 INFO session.SessionState: Created local directory: /tmp/8c1eba54-7260-4314-abbf-7b7de85bdf0a_resources
    17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
    17/10/01 05:20:30 INFO session.SessionState: Created local directory: /tmp/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
    17/10/01 05:20:30 INFO session.SessionState: Created HDFS directory: /tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a/_tmp_space.db
    17/10/01 05:20:30 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.


    In [3]: type(peopleDF)
    Out[3]: pyspark.sql.dataframe.DataFrame

    In [4]:

  • 相关阅读:
    springboot内置tomcat配置虚拟路径
    微信公众号支付备忘及填坑之路-java
    本地调试微信接口(内网穿透到外网)(公网访问局域网程序)
    SSM项目使用GoEasy 获取客户端上下线实时状态变化及在线客户列表
    SSM项目整合Quartz
    spring整合quartz异常:org.quartz.JobPersistenceException: Couldn't clean volatile data: Unknown column 'IS_VOLATILE' in 'where clause'
    Java NIO技术总结
    你懂redis吗
    mysql排序,同样的sql,mysql 每次查询结果顺序不一致
    mysql 不同索引的区别和适用情况总结
  • 原文地址:https://www.cnblogs.com/gaojian/p/7617731.html
Copyright © 2011-2022 走看看