zoukankan      html  css  js  c++  java
  • spark SQL学习(数据源之json)

    准备工作

    数据文件students.json

    {"id":1, "name":"leo", "age":18}
    {"id":2, "name":"jack", "age":19}
    {"id":3, "name":"marry", "age":17}
    

    存放目录:hdfs://master:9000/student/2016113012/spark/students.json

    scala代码

    package wujiadong_sparkSQL
    
    import org.apache.spark.sql.SQLContext
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
      * Created by Administrator on 2017/2/12.
      */
    
    //通过加载json数据源创建datafr
    object JsonOperation {
      def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setAppName("JsonOperation")
        val sc = new SparkContext(conf)
        val sqlContext = new SQLContext(sc)
        //直接读取json格式文件
        val df1 = sqlContext.read.json("hdfs://master:9000/student/2016113012/spark/students.json")
        //通过load读取json格式文件,需要指定格式,不指定默认读取的是parquet格式文件
        //sqlContext.read.format("json").load("hdfs://master:9000/student/2016113012/spark/students.json")
        df1.printSchema()
        df1.registerTempTable("t_students")
        val teenagers = sqlContext.sql("select name from t_students where age > 13 and age <19")
        teenagers.write.parquet("hdfs://master:9000/student/2016113012/teenagers")
    
      }
    
    }
    
    

    提交集群

    hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.JsonOperation  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
    
    

    运行结果

    hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.JsonOperation  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
    17/02/14 10:58:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    17/02/14 10:58:56 INFO Slf4jLogger: Slf4jLogger started
    17/02/14 10:58:56 INFO Remoting: Starting remoting
    17/02/14 10:58:56 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:58268]
    17/02/14 10:58:59 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
    17/02/14 10:59:05 INFO FileInputFormat: Total input paths to process : 1
    17/02/14 10:59:11 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
    17/02/14 10:59:11 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
    17/02/14 10:59:11 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
    17/02/14 10:59:11 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
    17/02/14 10:59:11 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
    root
     |-- age: long (nullable = true)
     |-- id: long (nullable = true)
     |-- name: string (nullable = true)
    
    17/02/14 10:59:18 INFO FileInputFormat: Total input paths to process : 1
    17/02/14 10:59:18 INFO CodecPool: Got brand-new compressor [.gz]
    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
    SLF4J: Defaulting to no-operation (NOP) logger implementation
    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
    17/02/14 10:59:19 INFO FileOutputCommitter: Saved output of task 'attempt_201702141059_0001_m_000000_0' to hdfs://master:9000/studnet/2016113012/teenagers/_temporary/0/task_201702141059_0001_m_000000
    
    
    

    常见报错

    Exception in thread "main" java.io.IOException: No input paths specified in job
    
    原因是读取数据源失败导致的,比如写错了数据源路径
    
  • 相关阅读:
    配置samba
    extern c
    剑指offer 孩子们的游戏
    剑指offer 扑克牌顺子
    剑指offer 翻转单词顺序列
    剑指offer 左旋转字符串
    mysql查看或显示当前存在多少数据库
    vim替换
    平衡二叉树
    将employees表的所有员工的last_name和first_name拼接起来作为Name,中间以一个空格区分
  • 原文地址:https://www.cnblogs.com/wujiadong2014/p/6516588.html
Copyright © 2011-2022 走看看