Spark Streaming ---------------- 流计算,不间断。 Spark Streaming模块, 实现方式是批量计算,按照时间片对stream切割形成静态数据。 //创建上下文时,指定时间片。 val ssc = new StreamingContext(conf, Seconds(1)) //已经限定了时间片 ssc.socketTextStream(); ... socket文本流运行在executor端,不在driver端。 SockeTextStream执行过程 ------------------------------- driver端创建StreamingContext对象,启动上下文时,依次创建 JobScheduler和ReceiverTracker,并调用他们的start方法。 ReceiverTracker在start方法中发送启动接收器消息给远程Executor, 消息内部含有ServerSocket的地址信息,在executor一侧,由Receiver TrackerEndpoint终端接受消息,抽取消息内容,利用sparkContext结合 消息内容创建ReceiverRDD对象,最后提交rdd给spark集群. 流计算的窗口化处理 ------------------------ 在批次的基础上扩展应用, 窗口长度和滑动间隔(计算频率)这个指标都需要是batch的整倍数。 reduceByKeyAndWindow((a:Int,b:Int)=> {a + b}, Seconds(5) , Seconds(3)) windows().reduceByKey(...); DStream的分区 ------------------------ DStream的分区是对内部每个RDD的分区。 dstream.repartition(num){ // this.transform(_.repartition(numPartitions)) } updateStateByKey() ----------------------- 计算自流应用启动以来,每个单词的数量。 更新可以结合windows出计算。 val ds3 = ds2.window(Seconds(5),Seconds(3)) def update(a:Seq[Int] , state:Option[ArrayBuffer[(Long, Int)]]) : Option[ArrayBuffer[(Long, Int)]] = { //println("a + " + a) val count = a.sum val time = System.currentTimeMillis() if(state.isEmpty){ val buf:ArrayBuffer[(Long,Int)] = ArrayBuffer[(Long, Int)]() buf.append((time ,count)) Some(buf) } else{ val buf2: ArrayBuffer[(Long, Int)] = ArrayBuffer[(Long, Int)]() var buf = state.get for(t <- buf){ if((time - t._1) <= 4000){ buf2.+=(t) } } buf2.append((time, count)) Some(buf2) } } val ds3 = ds2.window(Seconds(5),Seconds(3)) val ds4 = ds3.updateStateByKey(update _) SparkStreaming计算中分区的计算方式 ----------------------------------- DStream分区是RDD的分区,分区由conf.set("spark.streaming.blockInterval" ,"200ms"), 就是哪找指定时间片切割数据成小块,对应一个分区。 DStream.foreachRDD ---------------------- 针对流中的每个RDD进行操作。 import java.sql.DriverManager import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} /** * Created by Administrator on 2018/3/8. */ object SparkStreamingForeachRDDScala { def createNewConnection() = { Class.forName("com.mysql.jdbc.Driver") val conn = DriverManager.getConnection("jdbc:mysql://192.168.231.1:3306/big9","root","root") conn } def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName("worldCount") conf.setMaster("local[4]") //时间片是2秒 val ssc = new StreamingContext(conf ,Seconds(2)) ssc.checkpoint("file:///d:/java/chk") //创建套接字文本流 val ds1 = ssc.socketTextStream("s101", 8888) val ds2 = ds1.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_) ds2.foreachRDD(rdd=>{ rdd.foreachPartition(it=>{ val conn = createNewConnection() // executed at the driver val ppst = conn.prepareStatement("insert into wc(word,cnt) values(?,?)") conn.setAutoCommit(false) for(e <- it){ ppst.setString(1 , e._1) ppst.setInt(2,e._2) ppst.executeUpdate() } conn.commit() conn.close() ppst.close() }) }) //启动流 ssc.start() ssc.awaitTermination() } } Spark Stream + Spark SQL组合使用 -------------------------------- import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession import org.apache.spark.streaming.{Seconds, StreamingContext} /** * Created by Administrator on 2018/3/8. */ object SparkStreamingWordCountSparkSQLScala { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName("worldCount") conf.setMaster("local[2]") //时间片是2秒 val ssc = new StreamingContext(conf ,Seconds(2)) ssc.checkpoint("file:///d:/java/chk") //创建套接字文本流 val lines = ssc.socketTextStream("s101", 8888) //压扁生成单词流 val words = lines.flatMap(_.split(" ")) words.foreachRDD(rdd=>{ val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate() import spark.implicits._ val df1= rdd.toDF("word") df1.createOrReplaceTempView("_temp") spark.sql("select word,count(*) from _temp group by word").show() }) //启动流 ssc.start() ssc.awaitTermination() } } Kafka -------------------- 消息系统. Scala 针对分区,n - 1 Spark Streaming集成kafka ------------------------- 1.注意 spark-streaming-kafka-0-10_2.11不兼容之前的版本, spark-streaming-kafka-0-8_2.11兼容0.9和0.10. 2.启动kafka集群并创建主题. xkafka.sh start 3.验证kafka是否ok 3.1)启动消费者 kafka-console-consumer.sh --zookeeper s102:2181 --topic t1 3.2)启动生产者 kafka-console-producer.sh --broker-list s102:9092 --topic t1 3.3)发送消息 ... 4.引入maven依赖 <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-10_2.11</artifactId> <version>2.1.0</version> </dependency> 5.编程 import org.apache.kafka.clients.consumer.ConsumerRecord import org.apache.kafka.common.serialization.StringDeserializer import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.kafka010._ import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe /** * Created by Administrator on 2018/3/8. */ object SparkStreamingKafkaScala { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName("kafka") conf.setMaster("local[*]") val ssc = new StreamingContext(conf , Seconds(2)) //kafka参数 val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "s102:9092,s103:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "g1", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val topics = Array("topic1") //val topics = Array("topicl") val stream = KafkaUtils.createDirectStream[String, String]( //val stream = KafkaUtils.createDirectStream[String,String] ssc, //ssc,PreferConsistent,Subscribe[String,String](topics,kafkaParams) PreferConsistent, //位置策略 Subscribe[String, String](topics, kafkaParams) //消费者策略 ) val ds2 = stream.map(record => (record.key, record.value)) //val ds2 = stream.map(record=>(record.key,record.value)) ds2.print() //ds2.print() ssc.start() ssc.awaitTermination() ssc.start() ssc.awaitTermination() } } 6.在控制台生产者发送消息 SparkKafka直接流(createDirectStream)和kafka分区 -------------------------------- 每个kafka主题分区对应一个RDD分区。 spark可以通过spark.streaming.kafka.maxRatePerPartition 配置,对每个分区每秒接受的消息树进行控制。 LocationStrategies ---------------- 位置策略, 控制特定的主题分区在哪个执行器上消费的。 在executor针对主题分区如何对消费者进行调度。 位置的选择是相对的,位置策略有三种方案: 1.PreferBrokers 首选kafka服务器,只有在kafka服务器和executor位于同一主机,可以使用该中策略。 2.PreferConsistent 首选一致性. 多数时候采用该方式,在所有可用的执行器上均匀分配kakfa的主题的所有分区。 综合利用集群的计算资源。 3.PreferFixed 首选固定模式。 如果负载不均衡,可以使用该中策略放置在特定节点使用指定的主题分区。手动控制方案。 没有显式指定的分区仍然采用(2)方案。 ConsumerStrategy ------------------- ConsumerStrategies -------------------- 消费者策略,是控制如何创建和配制消费者对象。 或者对kafka上的消息进行如何消费界定,比如t1主题的分区0和1, 或者消费特定分区上的特定消息段。 该类可扩展,自行实现。 1.ConsumerStrategies.Assign 指定固定的分区集合,指定了特别详细的方范围。 def Assign[K, V]( topicPartitions: Iterable[TopicPartition], kafkaParams: collection.Map[String, Object], offsets: collection.Map[TopicPartition, Long]) 2.ConsumerStrategies.Subscribe 允许消费订阅固定的主题集合。 3.ConsumerStrategies.SubscribePattern 使用正则表达式指定感兴趣的主题集合。 消费者策略和语义模型 ----------------------------- import java.net.Socket import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession import org.apache.spark.streaming.{Seconds, StreamingContext} import scala.collection.mutable.ArrayBuffer import org.apache.kafka.clients.consumer.ConsumerRecord import org.apache.kafka.common.TopicPartition import org.apache.kafka.common.serialization.StringDeserializer import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.kafka010._ import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe /** * Created by Administrator on 2018/3/8. */ object SparkStreamingKafkaScala { def sendInfo(msg: String, objStr: String) = { //获取ip val ip = java.net.InetAddress.getLocalHost.getHostAddress //得到pid val rr = java.lang.management.ManagementFactory.getRuntimeMXBean(); val pid = rr.getName().split("@")(0); //pid //线程 val tname = Thread.currentThread().getName //对象id val sock = new java.net.Socket("s101", 8888) val out = sock.getOutputStream val m = ip + " :" + pid + " :" + tname + " :" + msg + " :" + objStr + " " out.write(m.getBytes) out.flush() out.close() } def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName("kafka") // conf.setMaster("spark://s101:7077") conf.setMaster("local[8]") val ssc = new StreamingContext(conf, Seconds(5)) //kafka参数 val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "s102:9092,s103:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "g1", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val map = scala.collection.mutable.Map[TopicPartition,String]() map.put(new TopicPartition("t1" , 0) , "s102") map.put(new TopicPartition("t1" , 1) , "s102") map.put(new TopicPartition("t1" , 2) , "s102") map.put(new TopicPartition("t1" , 3) , "s102") val locStra = LocationStrategies.PreferFixed(map) ; val consit = LocationStrategies.PreferConsistent val topics = Array("t1") //主题分区集合 val tps = scala.collection.mutable.ArrayBuffer[TopicPartition]() tps.+=(new TopicPartition("t1" , 0)) // tps.+=(new TopicPartition("t2" , 1)) // tps.+=(new TopicPartition("t3" , 2)) //偏移量集合 val offsets = scala.collection.mutable.Map[TopicPartition,Long]() offsets.put(new TopicPartition("t1", 0), 3) // offsets.put(new TopicPartition("t2", 1), 3) // offsets.put(new TopicPartition("t3", 2), 0) val conss = ConsumerStrategies.Assign[String,String](tps , kafkaParams , offsets) //创建kakfa直向流 val stream = KafkaUtils.createDirectStream[String,String]( ssc, locStra, ConsumerStrategies.Assign[String, String](tps, kafkaParams, offsets) ) val ds2 = stream.map(record => { val t = Thread.currentThread().getName val key = record.key() val value = record.value() val offset = record.offset() val par = record.partition() val topic = record.topic() val tt = ("k:"+key , "v:" + value , "o:" + offset, "p:" + par,"t:" + topic ,"T : " + t) //xxxx(tt) ; //sendInfo(tt.toString() ,this.toString) tt }) ds2.print() ssc.start() ssc.awaitTermination() } } kafka消费语义 //tt.offset ------------------- 1.at most once 最多消费一次 commit(offset) //wrong xxx(tt) //ok 2.at least once 最少一次 xxx(tt) //ok commit(offset) //wrong 3.extact once 精准消费一次。 mysql Assign: --->