SparkStream：4)foreachRDD详解

zoukankan html css js c++ java

SparkStream：4)foreachRDD详解
转载自：http://blog.csdn.net/jiangpeng59/article/details/53318761

foreachRDD通常用来把SparkStream运行得到的结果保存到外部系统比如HDFS、Mysql、Redis等等。了解下面的知识可以帮助我们避免很多误区

误区1：实例化外部连接对象的位置不正确，比如下面代码

dstream.foreachRDD { rdd =>

  val connection = createNewConnection()  // executed at the driver

  rdd.foreach { record =>

    connection.send(record) // executed at the worker

  }

}

其实例化的连接对象在driver中，然后通过序列化的方式发送到各个Worker，但实际上Connection的序列化通常是无法正确序列化的

误区2：为每条记录都创建一个连接对象

dstream.foreachRDD { rdd =>

  rdd.foreach { record =>

    val connection = createNewConnection()

    connection.send(record)

    connection.close()

  }

}

虽然误区1的问题得到了解决，但通常情况下，外部系统如mysql，其连接对象是非常可贵的，如果一条记录就申请一个连接资源，系统性能会非常糟糕

然后，给出了一个比较好的方法，为每一个分区创建一个连接对象，其具体代码如下

dstream.foreachRDD { rdd =>

  rdd.foreachPartition { partitionOfRecords =>

    val connection = createNewConnection()

    partitionOfRecords.foreach(record => connection.send(record))

    connection.close()

  }

}

最后给出一个较优的方案，使用一个连接池来维护连接对象

dstream.foreachRDD { rdd =>

  rdd.foreachPartition { partitionOfRecords =>

    // ConnectionPool is a static, lazily initialized pool of connections

    val connection = ConnectionPool.getConnection()

    partitionOfRecords.foreach(record => connection.send(record))

    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse

  }

}

正如上面代码阐述的，连接对象推荐是使用lazy关键字来修饰，用到的时候才去实例化

下面给出网上一段把SparkStream的结果保存到Mysql中的代码示例

package spark.examples.streaming



import java.sql.{PreparedStatement, Connection, DriverManager}

import java.util.concurrent.atomic.AtomicInteger



import org.apache.spark.SparkConf

import org.apache.spark.streaming.{Seconds, StreamingContext}

import org.apache.spark.streaming._

import org.apache.spark.streaming.StreamingContext._



object SparkStreamingForPartition {

  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("NetCatWordCount")

    conf.setMaster("local[3]")

    val ssc = new StreamingContext(conf, Seconds(5))

    //The DStream is a collection of RDD, which makes the method foreachRDD reasonable

    val dstream = ssc.socketTextStream("192.168.26.140", 9999)

    dstream.foreachRDD(rdd => {

      //embedded function

      def func(records: Iterator[String]) {

        var conn: Connection = null

        var stmt: PreparedStatement = null

        try {

          val url = "jdbc:mysql://192.168.26.140:3306/person";

          val user = "root";

          val password = ""

          conn = DriverManager.getConnection(url, user, password)

          records.flatMap(_.split(" ")).foreach(word => {

            val sql = "insert into TBL_WORDS(word) values (?)";

            stmt = conn.prepareStatement(sql);

            stmt.setString(1, word)

            stmt.executeUpdate();

          })

        } catch {

          case e: Exception => e.printStackTrace()

        } finally {

          if (stmt != null) {

            stmt.close()

          }

          if (conn != null) {

            conn.close()

          }

        }

      }

      val repartitionedRDD = rdd.repartition(3)

      repartitionedRDD.foreachPartition(func)

    })

    ssc.start()

    ssc.awaitTermination()

  }

}

注意的细节：

Dstream和RDD一样是延迟执行，只有遇到action操作才会真正去计算。因此在Dstream的内部RDD必须包含Action操作才能是接受到的数据得到处理。即使代码中包含foreachRDD,但在内部却没有action的RDD，SparkStream只会简单地接受数据数据而不进行处理
查看全文

相关阅读:
gcc
linux下的多线程，pthread_create函数
 Linux开启ssh服务
 Leaky Images: Targeted Privacy Attacks in the Web
20199112 2019-2020-2 《网络攻防实践》第 10 周作业
 20199112 2019-2020-2 《网络攻防实践》第 9 周作业
 tinymce下载地址
 element-ui重要参考
 SpringCloud在线教育平台（重要--重要--重要--重要--重要--重要）
在线教育项目（全）

原文地址：https://www.cnblogs.com/yangcx666/p/8723828.html