zoukankan      html  css  js  c++  java
  • Spark SQL例子

    综合案例分析
    现有数据集 department.jsonemployee.json,以部门名称和员工性别为粒度,试计算每个部门分性别平均年龄与平均薪资。 department.json如下:
    {"id":1,"name":"Tech Department"}
    {"id":2,"name":"Fina Department"}
    {"id":3,"name":"HR Department"}
    employee.json如下:
    {"name":"zhangsan","age":26,"depId":1,"gender":"male","salary":20000}
    {"name":"lisi","age":36,"depId":2,"gender":"female","salary":8500}
    {"name":"wangwu","age":23,"depId":1,"gender":"male","salary":5000}
    {"name":"zhaoliu","age":25,"depId":3,"gender":"male","salary":7000}
    {"name":"marry","age":19,"depId":2,"gender":"female","salary":6600}
    {"name":"Tom","age":36,"depId":1,"gender":"female","salary":5000}
    {"name":"kitty","age":43,"depId":2,"gender":"female","salary":6000}
    两份数据我们在演示的时候已经创建并上传至 hdfs 文件系统,用户在这里需要请自行创建。
    执行命令:
    root@foo2 cloudera]# cd /root/device-report/
    [root@foo2 device-report]# ls
    b.txt  test.sql
    [root@foo2 device-report]# vim department.json
    [root@foo2 device-report]# vim employee.json
    [root@foo2 device-report]# ls
    b.txt  department.json  employee.json  test.sql
    [root@foo2 device-report]# chown hdfs:hdfs department.json
    [root@foo2 device-report]# chown hdfs:hdfs employee.json
    [root@foo2 device-report]# ls
    b.txt  department.json  employee.json  test.sql
    [root@foo2 device-report]# ll
    总用量 16
    -rw-r--r-- 1 hdfs hdfs  22 8月  14 10:45 b.txt
    -rw-r--r-- 1 hdfs hdfs 100 8月  17 16:50 department.json
    -rw-r--r-- 1 hdfs hdfs 474 8月  17 16:53 employee.json [root@foo2 device-report]# su hdfs
    [hdfs@foo2 device-report]$ clear
    [hdfs@foo2 device-report]$ ls
    b.txt  test.sql
    [hdfs@foo2 device-report]$ cd /var/lib/hadoop-h
    hadoop-hdfs/   hadoop-httpfs/
    [hdfs@foo2 device-report]$ cd /var/lib/hadoop-hdfs/device-report/
    [hdfs@foo2 device-report]$ ls
    b.txt  department.json  employee.json  person.json  test.sql
    [hdfs@foo2 device-report]$ hadoop fs -put department.json /testdata
    [hdfs@foo2 device-report]$ hadoop fs -put employee.json /testdata
    [hdfs@foo2 device-report]$ hadoop fs -ls /testdata
    Found 3 items
    -rw-r--r--   2 hdfs supergroup        100 2018-08-17 16:54 /testdata/department.json
    -rw-r--r--   2 hdfs supergroup        474 2018-08-17 16:55 /testdata/employee.json
    -rw-r--r--   2 hdfs supergroup         71 2018-08-17 16:39 /testdata/person.json
    查看内容
    [hdfs@foo2 device-report]$ hadoop fs -cat hdfs://192.168.0.106:8020/testdata/department.json
    {"id":1,"name":"Tech Department"}
    {"id":2,"name":"Fina Department"}
    {"id":3,"name":"HR Department"}
    [hdfs@foo2 device-report]$ hadoop fs -cat hdfs://192.168.0.106:8020/testdata/employee.json
    {"name":"zhangsan","age":26,"depId":1,"gender":"male","salary":20000}
    {"name":"lisi","age":36,"depId":2,"gender":"female","salary":8500}
    {"name":"wangwu","age":23,"depId":1,"gender":"male","salary":5000}
    {"name":"zhaoliu","age":25,"depId":3,"gender":"male","salary":7000}
    {"name":"marry","age":19,"depId":2,"gender":"female","salary":6600}
    {"name":"Tom","age":36,"depId":1,"gender":"female","salary":5000}
    {"name":"kitty","age":43,"depId":2,"gender":"female","salary":6000}
     
    -rw-r--r-- 1 hdfs hdfs 237 8月  14 16:49 test.sql
    [root@foo2 device-report]# mv department.json /var/lib/hadoop-hdfs/device-report/
    [root@foo2 device-report]# mv employee.json /var/lib/hadoop-hdfs/device-report/
     
    1). 加载数据
    scala> val emp = spark.read.json("hdfs://192.168.0.106:8020/testdata/employee.json")
    emp: org.apache.spark.sql.DataFrame = [age: bigint, depId: bigint ... 3 more fields]
     
    scala> val dep = spark.read.json("hdfs://192.168.0.106:8020/testdata/department.json")
    dep: org.apache.spark.sql.DataFrame = [id: bigint, name: string]
    变成视图:
    scala> emp.createOrReplaceTempView("employee")
     
    scala> dep.createOrReplaceTempView("department")
     
    2). 用算子操作
          # 注意:两个表的字段的连接条件,需要使用三个等号
         emp.join(dep,$"id" === $"depId").groupBy(dep("name"),emp("gender")).agg(avg(emp("salary")),avg(emp("age"))).show()
        结果:
          +---------------+------+-----------------+------------------+                   
    |           name|gender|      avg(salary)|          avg(age)|
    +---------------+------+-----------------+------------------+
    |Tech Department|  male|          12500.0|              24.5|
    |Fina Department|female|7033.333333333333|32.666666666666664|
    |Tech Department|female|           5000.0|              36.0|
    |  HR Department|  male|           7000.0|              25.0|
    +---------------+------+-----------------+------------------+
    3). 用SQL操作
         scala> spark.sql("select department.name,avg(employee.salary),avg(employee.age) from employee left join department  on employee.depId=department.id group by department.name,employee.gender").show()
         结果:
    +-----------------+---------------------+----------------------+                         
    |                   name|                 avg(salary)|                      avg(age)|
    +-----------------+---------------------+----------------------+
    |Tech Department |                    12500.0|                            24.5|
    |Fina Department  |7033.333333333333|32.666666666666664|
    |Tech Department |                      5000.0|                            36.0|
    |  HR Department  |                      7000.0|                            25.0|
    +-----------------+---------------------+----------------------+
     
    2、3结果都是一样一样的。
     
  • 相关阅读:
    浅谈IIS处理Asp.net请求和 Asp.net页面生命周期
    Windows Phone 7与Android和iPhone的比较
    ExtJs开发需要引用的Js库文件
    Windows phone开发(1)一些基本概念
    synthesize
    Objective C 基础教程
    Objective C基础教程笔记
    ObjectiveC语法与Cocoa框架
    Handing basic Interactio IOS notes
    Learning ObjectiveC and Xcode
  • 原文地址:https://www.cnblogs.com/xiqing/p/9662310.html
Copyright © 2011-2022 走看看