zoukankan      html  css  js  c++  java
  • DataFrame的创建

    DataFrame的创建
    从Spark2.0以上版本开始,Spark使用全新的SparkSession接口替代Spark1.6中的SQLContext及HiveContext接口来实现其对数据加载、转换、处理等功能。SparkSession实现了SQLContext及HiveContext所有功能
    SparkSession支持从不同的数据源加载数据,并把数据转换成DataFrame,并且支持把DataFrame转换成SQLContext自身中的表,然后使用SQL语句来操作数据。SparkSession亦提供了HiveQL以及其他依赖于Hive的功能的支持
    可以通过如下语句创建一个SparkSession对象:

    scala> import org.apache.spark.sql.SparkSession
    scala> val spark=SparkSession.builder().getOrCreate()

    在创建DataFrame之前,为了支持RDD转换为DataFrame及后续的SQL操作,需要通过import语句(即import spark.implicits._)导入相应的包,启用隐式转换。
    在创建DataFrame时,可以使用spark.read操作,从不同类型的文件中加载数据创建DataFrame,例如:

    spark.read.json("people.json"):读取people.json文件创建DataFrame;在读取本地文件或HDFS文件时,要注意给出正确的文件路径;
    spark.read.parquet("people.parquet"):读取people.parquet文件创建DataFrame;
    spark.read.csv("people.csv"):读取people.csv文件创建DataFrame。

    在“/usr/local/spark/examples/src/main/resources/”这个目录下,这个目录下有两个样例数据people.json和people.txt。 people.json文件的内容如下:
    {"name":"Michael"}
    {"name":"Andy", "age":30}
    {"name":"Justin", "age":19}
    people.txt文件的内容如下:
    Michael, 29
    Andy, 30
    Justin, 19

    scala> import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.SparkSession
    
    scala> val spark=SparkSession.builder().getOrCreate()
    spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@2bdab835
    
    //使支持RDDs转换为DataFrames及后续sql操作
    scala> import spark.implicits._
    import spark.implicits._
    
    scala> val df = spark.read.json("file:///usr/local/spark/examples/src/main/resources/people.json")
    df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
    
    scala> df.show()
    +----+-------+
    | age| name|
    +----+-------+
    |null|Michael|
    | 30| Andy|
    | 19| Justin|
    +----+-------+

    DataFrame的保存

    可以使用spark.write操作,把一个DataFrame保存成不同格式的文件,例如,把一个名称为df的DataFrame保存到不同格式文件中,方法如下:

    df.write.json("people.json“)
    df.write.parquet("people.parquet“)
    df.write.csv("people.csv")

    下面从示例文件people.json中创建一个DataFrame,然后保存成csv格式文件,代码如下:

    scala> val peopleDF = spark.read.format("json").load("file:///usr/local/spark/examples/src/main/resources/people.json")
    scala> peopleDF.select("name", "age").write.format("csv").save("file:///usr/local/spark/mycode/sql/newpeople.csv")

    DataFrame的常用操作

    //打印模式信息
    scala> df.printSchema()
    root
    |-- age: long (nullable = true)
    |-- name: string (nullable = true)
    
    //选择多列
    scala> df.select(df("name"),df("age"+1).show)
    
    //条件过滤
    scala> df.filter(df("age") > 20).show()
    
    //分组聚合
    scala> df.groupBy("age").count().show()
    
    //排序
    scala> df.sort(df("age").desc).show()
    
    //多列排序
    scala> df.sort(df.("age").desc,df("name").asc).show()
    
    //对列进行重命名
    scala> df.select(df("name").as("username"),df("age")).show()

    在“/usr/local/spark/examples/src/main/resources/”目录下,有个Spark安装时自带的样例数据people.txt,其内容如下:
    Michael, 29
    Andy, 30
    Justin, 19
    现在要把people.txt加载到内存中生成一个DataFrame,并查询其中的数据
    在利用反射机制推断RDD模式时,需要首先定义一个case class,因为,只有case class才能被Spark隐式地转换为DataFrame

    scala> import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
    import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
    
    scala> import org.apache.spark.sql.Encoder
    import org.apache.spark.sql.Encoder
    
    scala> import spark.implicits._ //导入包,支持把一个RDD隐式转换为一个DataFrame
    import spark.implicits._
    
    scala> case class Person(name: String, age: Long) //定义一个case class
    defined class Person
    
    scala> val peopleDF = spark.sparkContext.textFile("file:///usr/local/spark/examples/src/main/resources/people.txt").map(_.split(",")).map(attributes => Person(attributes(0), attributes(1).trim.toInt)).toDF()
    peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
    
    scala> peopleDF.createOrReplaceTempView("people") //必须注册为临时表才能供下面的查询使用
    
    scala> val personsRDD = spark.sql("select name,age from people where age > 20")
    //最终生成一个DataFrame,下面是系统执行返回的信息
    personsRDD: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
    
    scala> personsRDD.map(t => "Name: "+t(0)+ ","+"Age: "+t(1)).show() //DataFrame中的每个元素都是一行记录,包含name和age两个字段,分别用t(0)和t(1)来获取值
    
    //下面是系统执行返回的信息
    +------------------+ 
    | value|
    +------------------+
    |Name:Michael,Age:29|
    | Name:Andy,Age:30|
    +------------------+

    当无法提前定义case class时,就需要采用编程方式定义RDD模式。
    比如,现在需要通过编程方式把people.txt加载进来生成DataFrame,并完成SQL查询。

    scala> import org.apache.spark.sql.types._
    import org.apache.spark.sql.types._
    
    scala> import org.apache.spark.sql.Row
    import org.apache.spark.sql.Row
    //生成字段
    
    scala> val fields = Array(StructField("name",StringType,true), StructField("age",IntegerType,true))
    fields: Array[org.apache.spark.sql.types.StructField] = Array(StructField(name,StringType,true), StructField(age,IntegerType,true))
    
    scala> val schema = StructType(fields)
    schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age, IntegerType,true))
    //从上面信息可以看出,schema描述了模式信息,模式中包含name和age两个字段
    //shcema就是“表头”
    
    //下面加载文件生成RDD
    scala> val peopleRDD = spark.sparkContext.textFile("file:///usr/local/spark/examples/src/main/resources/people.txt")
    peopleRDD: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/examples/src/main/resources/people.txt MapPartitionsRDD[1] at textFile at <console>:26
    
    //对peopleRDD 这个RDD中的每一行元素都进行解析
    scala> val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).trim.toInt))
    rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at map at <console>:29
    //上面得到的rowRDD就是“表中的记录”
    
    //下面把“表头”和“表中的记录”拼装起来
    scala> val peopleDF = spark.createDataFrame(rowRDD, schema)
    peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: int]
    //必须注册为临时表才能供下面查询使用
    scala> peopleDF.createOrReplaceTempView("people")
    
    scala> val results = spark.sql("SELECT name,age FROM people")
    results: org.apache.spark.sql.DataFrame = [name: string, age: int] 
    
    scala> results.map(attributes => "name: " + attributes(0)+","+"age:"+attributes(1)).show()
    +--------------------+
    | value|
    +--------------------+
    |name: Michael,age:29|
    | name: Andy,age:30|
    | name: Justin,age:19|
    +--------------------+

    通过JDBC连接数据库
    在Linux中启动MySQL数据库
    $ service mysql start
    $ mysql -u root -p
    #屏幕会提示你输入密码
    输入下面SQL语句完成数据库和表的创建:

    mysql> create database spark;
    mysql> use spark;
    mysql> create table student (id int(4), name char(20), gender char(4), age int(4));
    mysql> insert into student values(1,'Xueqian','F',23);
    mysql> insert into student values(2,'Weiliang','M',24);
    mysql> select * from student;

    下载MySQL的JDBC驱动程序,比如mysql-connector-java-5.1.40.tar.gz
    把该驱动程序拷贝到spark的安装目录” /usr/local/spark/jars”下
    启动一个spark-shell,启动Spark Shell时,必须指定mysql连接驱动jar包
    $ cd /usr/local/spark
    $ ./bin/spark-shell
    --jars /usr/local/spark/jars/mysql-connector-java-5.1.40/mysql-connector-java-5.1.40-bin.jar
    --driver-class-path /usr/local/spark/jars/mysql-connector-java-5.1.40/mysql-connector-java-5.1.40-bin.jar
    读取MySQL数据库中的数据

    scala> val jdbcDF = spark.read.format("jdbc").
    | option("url","jdbc:mysql://localhost:3306/spark").
    | option("driver","com.mysql.jdbc.Driver").
    | option("dbtable", "student").
    | option("user", "root").
    | option("password", "hadoop").
    | load()
    scala> jdbcDF.show()
    +---+--------+------+---+
    | id| name|gender|age|
    +---+--------+------+---+
    | 1| Xueqian| F| 23|
    | 2|Weiliang| M| 24|
    +---+--------+------+---+

    向MySQL数据库写入数据

    import java.util.Properties
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.Row
     
    //下面我们设置两条数据表示两个学生信息
    val studentRDD = spark.sparkContext.parallelize(Array("3 Rongcheng M 26","4 Guanhua M 27")).map(_.split(" "))
     
    //下面要设置模式信息
    val schema = StructType(List(StructField("id", IntegerType, true),StructField("name", StringType, true),StructField("gender", StringType, true),StructField("age", IntegerType, true)))
    //下面创建Row对象,每个Row对象都是rowRDD中的一行
    val rowRDD = studentRDD.map(p => Row(p(0).toInt, p(1).trim, p(2).trim, p(3).toInt))
     
    //建立起Row对象和模式之间的对应关系,也就是把数据和模式对应起来
    val studentDF = spark.createDataFrame(rowRDD, schema)
     
    //下面创建一个prop变量用来保存JDBC连接参数
    val prop = new Properties()
    prop.put("user", "root") //表示用户名是root
    prop.put("password", "hadoop") //表示密码是hadoop
    prop.put("driver","com.mysql.jdbc.Driver") //表示驱动程序是com.mysql.jdbc.Driver
     
    //下面就可以连接数据库,采用append模式,表示追加记录到数据库spark的student表中
    studentDF.write.mode("append").jdbc("jdbc:mysql://localhost:3306/spark", "spark.student", prop) 

    连接Hive读写数据
    2.在Hive中创建数据库和表
    进入Hive,新建一个数据库sparktest,并在这个数据库下面创建一个表student,并录入两条数据

    hive> create database if not exists sparktest;//创建数据库sparktest
    hive> show databases; //显示一下是否创建出了sparktest数据库
    //下面在数据库sparktest中创建一个表student
    hive> create table if not exists sparktest.student(
    > id int,
    > name string,
    > gender string,
    > age int);
    hive> use sparktest; //切换到sparktest
    hive> show tables; //显示sparktest数据库下面有哪些表
    hive> insert into student values(1,'Xueqian','F',23); //插入一条记录
    hive> insert into student values(2,'Weiliang','M',24); //再插入一条记录
    hive> select * from student; //显示student表中的记录

    3.连接Hive读写数据
    需要修改“/usr/local/sparkwithhive/conf/spark-env.sh”这个配置文件:

    export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
    export CLASSPATH=$CLASSPATH:/usr/local/hive/lib
    export SCALA_HOME=/usr/local/scala
    export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
    export HIVE_CONF_DIR=/usr/local/hive/conf
    export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/local/hive/lib/mysql-connector-java-5.1.40-bin.jar

    请在spark-shell(包含Hive支持)中执行以下命令从Hive中读取数据:

    Scala> import org.apache.spark.sql.Row
    Scala> import org.apache.spark.sql.SparkSession 
    Scala> case class Record(key: Int, value: String) 
    // warehouseLocation points to the default location for managed databases and tables
    Scala> val warehouseLocation = "spark-warehouse” 
    Scala> val spark = SparkSession.builder().appName("Spark Hive Example").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate() 
    Scala> import spark.implicits._
    Scala> import spark.sql
    //下面是运行结果
    scala> sql("SELECT * FROM sparktest.student").show()
    +---+--------+------+---+
    | id| name|gender|age|
    +---+--------+------+---+
    | 1| Xueqian| F| 23|
    | 2|Weiliang| M| 24|
    +---+--------+------+---+

    编写程序向Hive数据库的sparktest.student表中插入两条数据:

    scala> import java.util.Properties
    scala> import org.apache.spark.sql.types._
    scala> import org.apache.spark.sql.Row 
    //下面我们设置两条数据表示两个学生信息
    scala> val studentRDD = spark.sparkContext.parallelize(Array("3 Rongcheng M 26","4 Guanhua M 27")).map(_.split(" ")) 
    //下面要设置模式信息
    scala> val schema = StructType(List(StructField("id", IntegerType, true),StructField("name", StringType, true),StructField("gender", StringType, true),StructField("age", IntegerType, true)))
     //下面创建Row对象,每个Row对象都是rowRDD中的一行
    scala> val rowRDD = studentRDD.map(p => Row(p(0).toInt, p(1).trim, p(2).trim, p(3).toInt)) 
    //建立起Row对象和模式之间的对应关系,也就是把数据和模式对应起来
    scala> val studentDF = spark.createDataFrame(rowRDD, schema)
    //查看studentDF
    scala> studentDF.show()
    +---+---------+------+---+
    | id| name|gender|age|
    +---+---------+------+---+
    | 3|Rongcheng| M| 26|
    | 4| Guanhua| M| 27|
    +---+---------+------+---+
    //下面注册临时表
    scala> studentDF.registerTempTable("tempTable")
     
    scala> sql("insert into sparktest.student select * from tempTable")
    

      

  • 相关阅读:
    正则表达式
    JS逻辑算法
    js函数
    JS运算符的应用
    JS课堂笔记
    json模块学习
    什么是面向对象,以及如何定义对象,实例化对象
    什么是面向过程,以及代码展示
    什么是元类,以及用type类来产生类.
    python下载安装
  • 原文地址:https://www.cnblogs.com/flw0322/p/12284701.html
Copyright © 2011-2022 走看看