zoukankan      html  css  js  c++  java
  • 通过Spark SQL关联查询两个HDFS上的文件操作

    order_created.txt   订单编号  订单创建时间

    10703007267488  2014-05-01 06:01:12.334+01
    10101043505096  2014-05-01 07:28:12.342+01
    10103043509747  2014-05-01 07:50:12.33+01
    10103043501575  2014-05-01 09:27:12.33+01
    10104043514061  2014-05-01 09:03:12.324+01

    order_picked.txt   订单编号  订单提取时间

    10703007267488  2014-05-01 07:02:12.334+01
    10101043505096  2014-05-01 08:29:12.342+01
    10103043509747  2014-05-01 10:55:12.33+01

    上传上述两个文件到HDFS:

    hadoop fs -put order_created.txt /data/order_created.txt
    hadoop fs -put order_picked.txt /data/order_picked.txt

    通过Spark SQL关联查询两个文件

    val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
    import hiveContext._
    
    case class OrderCreated(order_no:String,create_date:String)
    case class OrderPicked(order_no:String,picked_date:String)
    
    val order_created = sc.textFile("/data/order_created.txt").map(_.split("	")).map( d => OrderCreated(d(0),d(1)))
    val order_picked = sc.textFile("/data/order_picked.txt").map(_.split("	")).map( d => OrderPicked(d(0),d(1)))
    
    order_created.registerTempTable("t_order_created")
    order_picked.registerTempTable("t_order_picked")
    
    #手工设置Spark SQL task个数
    hiveContext.setConf("spark.sql.shuffle.partitions","10")
    hiveContext.sql("select a.order_no, a.create_date, b.picked_date from t_order_created a join t_order_picked b on a.order_no = b.order_no").collect.foreach(println)

    执行结果如下:

    [10101043505096,2014-05-01 07:28:12.342+01,2014-05-01 08:29:12.342+01]
    [10703007267488,2014-05-01 06:01:12.334+01,2014-05-01 07:02:12.334+01]
    [10103043509747,2014-05-01 07:50:12.33+01,2014-05-01 10:55:12.33+01]
  • 相关阅读:
    雅礼集训2017day5乱写
    任意值域最长公共子序列问题
    雅礼集训2017day4乱写
    雅礼集训2017day2乱写
    SP839
    雅礼集训2017day1乱写
    CF671E
    仅维护当前区间影响类问题的线段树
    「雅礼集训 2017 Day4」编码
    Codeforces Round #503 Div. 2
  • 原文地址:https://www.cnblogs.com/luogankun/p/4268431.html
Copyright © 2011-2022 走看看