zoukankan      html  css  js  c++  java
  • 【慕课网实战】九、以慕课网日志分析为例 进入大数据 Spark SQL 的世界

    即席查询
    普通查询

    Load Data
    1) RDD DataFrame/Dataset
    2) Local Cloud(HDFS/S3)

    将数据加载成RDD
    val masterLog = sc.textFile("file:///home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-hadoop001.out")
    val workerLog = sc.textFile("file:///home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop001.out")
    val allLog = sc.textFile("file:///home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/logs/*out*")

    masterLog.count
    workerLog.count
    allLog.count

    存在的问题:使用使用SQL进行查询呢?

    import org.apache.spark.sql.Row
    val masterRDD = masterLog.map(x => Row(x))
    import org.apache.spark.sql.types._
    val schemaString = "line"

    val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
    val schema = StructType(fields)

    val masterDF = spark.createDataFrame(masterRDD, schema)
    masterDF.show

    JSON/Parquet
    val usersDF = spark.read.format("parquet").load("file:///home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet")
    usersDF.show

    spark.sql("select * from parquet.`file:///home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet`").show

    Drill 大数据处理框架

    从Cloud读取数据: HDFS/S3
    val hdfsRDD = sc.textFile("hdfs://path/file")
    val s3RDD = sc.textFile("s3a://bucket/object")
    s3a/s3n

    spark.read.format("text").load("hdfs://path/file")
    spark.read.format("text").load("s3a://bucket/object")

    val df=spark.read.format("json").load("file:///home/hadoop/app/spark-2.1.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json")

    df.show

    TPC-DS

    spark-packages.org

  • 相关阅读:
    学习Hadoop不错的系列文章(转)
    浏览器的渲染原理简介
    大数据人才缺乏,你准备好了吗?
    SVN分支与合并透析
    Windows下SVN服务端(Subversion)及客户端(TortoiseSVN)详细安装教程
    maven2介绍(转)
    eclipse安装velocity插件(转)
    为大数据时代做好准备——来自《大数据的冲击》一书精彩片段(转)
    【VB】Format 格式化日期时间数字函数详解
    获取本机ID和电脑名称
  • 原文地址:https://www.cnblogs.com/kkxwz/p/8493777.html
Copyright © 2011-2022 走看看