Spark SQL编程之DataSet篇
作者:尹正杰
版权声明:原创作品,谢绝转载!否则将追究法律责任。
一.创建DataSet
温馨提示: Dataset是具有强类型的数据集合,需要提供对应的类型信息。下面是具体案例。 scala> case class Person(name: String, age: Long) #创建一个样例类 defined class Person scala> val caseClassDS = Seq(Person("YinZhengjie", 18)).toDS() #创建DataSet caseClassDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] scala> caseClassDS.show #不难发现DataSet的方法和DataFrame的方法使用上很相似。 +-----------+---+ | name|age| +-----------+---+ |YinZhengjie| 18| +-----------+---+ scala> caseClassDS.createTempView("person") scala> spark.sql("select * from person").show +-----------+---+ | name|age| +-----------+---+ |YinZhengjie| 18| +-----------+---+ scala>
二.RDD转换为DataSet
scala> case class Person(name: String, age: Long) #创建一个样例类 defined class Person scala> val listRDD = sc.makeRDD(List(("YinZhengjie",18),("Jason Yin",20),("Danny",28))) #创建一个RDD listRDD: org.apache.spark.rdd.RDD[(Int, String, Int)] = ParallelCollectionRDD[84] at makeRDD at <console>:27 scala> val mapRDD = listRDD.map( t => { Person( t._1,t._2) }) #使用map算子将listRDD各元素转换成Person对象 mapRDD: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[102] at map at <console>:30 scala> val ds = mapRDD.toDS #将rdd转换为DataSet ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] scala> ds.show +-----------+---+ | name|age| +-----------+---+ |YinZhengjie| 18| | Jason Yin| 20| | Danny| 28| +-----------+---+ scala>
三.DataSet转换为RDD
scala> ds.show #查看DataSet数据 +-----------+---+ | name|age| +-----------+---+ |YinZhengjie| 18| | Jason Yin| 20| | Danny| 28| +-----------+---+ scala> ds res6: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] scala> ds.rdd #将DataSet转换成RDD res7: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[26] at rdd at <console>:29 scala> res7.collect #查看RDD的数据 res8: Array[Person] = Array(Person(YinZhengjie,18), Person(Jason Yin,20), Person(Danny,28)) scala>