zoukankan      html  css  js  c++  java
  • spark-shell 交互式编程

    数据集:

    Tom,DataBase,80

    Tom,Algorithm,50

    Tom,DataStructure,60

    Jim,DataBase,90

    Jim,Algorithm,60

    Jim,DataStructure,80

    ……

    请根据给定的实验数据,在 spark-shell 中通过编程来计算以下内容:

    (1) 该系总共有多少学生:

    val lines = sc.textFile("file:///usr/local/spark/sparksqldata/Data01.txt") 
    val par = lines.map(row=>row.split(",")(0))      
    val distinct_par = par.distinct()  //去重操作 
    distinct_par.count  //取得总数

    答案为265人。

    (2) 该系共开设来多少门课程:

    val lines = sc.textFile("file:///usr/local/spark/sparksqldata/Data01.txt") 
    val par = lines.map(row=>row.split(",")(1))  
    val distinct_par = par.distinct()  
    distinct_par.count

    答案为8门。

    (3) Tom 同学的总成绩平均分是多少:

    val lines = sc.textFile("file:///usr/local/spark/sparksqldata/Data01.txt") 
    val pare = lines.filter(row=>row.split(",")(0)=="Tom") 
    pare.foreach(println) 
    //Tom,DataBase,26 
    //Tom,Algorithm,12 
    //Tom,OperatingSystem,16 
    //Tom,Python,40 
    //Tom,Software,60 
    pare.map(row=>(row.split(",")(0),row.split(",")(2).toInt)).mapValues(x=>(x,1)).reduceByKey((x,y ) => (x._1+y._1,x._2 + y._2)).mapValues(x => (x._1 / x._2)).collect()
    //res9: Array[(String, Int)] = Array((Tom,30)) 

    Tom的平均分为30分。

    (4) 求每名同学的选修的课程门数:

    val lines = sc.textFile("file:///usr/local/spark/sparksqldata/Data01.txt") 
    val pare = lines.map(row=>(row.split(",")(0),row.split(",")(1)))
    pare.mapValues(x => (x,1)).reduceByKey((x,y) => (" ",x._2 + y._2)).mapValues(x => x._2).foreach(println)

    答案为265行。

    (5) 该系 DataBase 课程共有多少人选修:

    val lines = sc.textFile("file:///usr/local/spark/sparksqldata/Data01.txt") 
    val pare = lines.filter(row=>row.split(",")(1)=="DataBase") 
    pare.count 
    res1: Long = 126

    答案为126人。

    (6) 各门课程的平均分是多少:

    val lines = sc.textFile("file:///usr/local/spark/sparksqldata/Data01.txt") 
    val pare = lines.map(row=>(row.split(",")(1),row.split(",")(2).toInt)) 
    pare.mapValues(x=>(x,1)).reduceByKey((x,y) => (x._1+y._1,x._2 + y._2)).mapValues(x => (x._1 / x._2)).collect()
    res0: Array[(String, Int)] = Array((Python,57), (OperatingSystem,54), (CLanguage,50), (Software,50), (Algorithm,48), (DataStructure,47), (DataBase,50), (ComputerNetwork,51))

    答案为: (CLanguage,50) (Python,57) (Software,50) (OperatingSystem,54) (Algorithm,48) (DataStructure,47) (DataBase,50) (ComputerNetwork,51)

    (7)使用累加器计算共有多少人选了 DataBase 这门课:

    val lines = sc.textFile("file:///usr/local/spark/sparksqldata/Data01.txt") 
    val pare = lines.filter(row=>row.split(",")(1)=="DataBase").map(row=>(row.split(",")(1),1)) 
    val accum = sc.longAccumulator("My Accumulator") 
    pare.values.foreach(x => accum.add(x)) 
    accum.value 
    res19: Long = 126 

    答案为126 人。

  • 相关阅读:
    Java中final、finally、finalize的区别
    GC垃圾回收机制详解
    spring ioc Di
    获取不同语言版本的任务状态
    转:系统架构师-基础到企业应用架构
    转:SharePoint【Site Definition 系列】
    转:SharePoint【ECMAScript对象模型系列】
    转:SharePoint【Ribbon系列】
    SharaPoint Farm Administrator密码变换及管理员转换
    转:Programming with Features(操作Feature)
  • 原文地址:https://www.cnblogs.com/yuanxiaochou/p/12290593.html
Copyright © 2011-2022 走看看