zoukankan      html  css  js  c++  java
  • Spark Streaming job的生成及数据清理总结

    关于这次总结还是要从一个bug说起。。。。。。。

    场景描述:项目的基本处理流程为:从文件系统读取每隔一分钟上传的日志并由Spark Streaming进行计算消费,最后将结果写入InfluxDB中,然后在监控系统中进行展示,监控。这里的spark版本为2.2.1。

    Bug:程序开发完成之后,每个batch处理时间在15~20s左右,上线之后一直在跑,监控系统中数据也没有什么异常,sparkui中只关注了任务处理时间,其他并没有在意。后来程序运行了2天18个小时之后,监控系统发出报警NO DATA,先去数据库查数据,确实没有数据,在去sparkui看程序并没有结束,状态还是RUNNING,但是不处理任务,就在那里卡住了,后来看日志发现报了内存溢出异常:

    Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
        at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
        at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
        at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
        at net.jpountz.lz4.LZ4BlockOutputStream.finish(LZ4BlockOutputStream.java:235)
        at net.jpountz.lz4.LZ4BlockOutputStream.close(LZ4BlockOutputStream.java:175)
        at java.io.ObjectOutputStream$BlockDataOutputStream.close(ObjectOutputStream.java:1828)
        at java.io.ObjectOutputStream.close(ObjectOutputStream.java:742)
        at org.apache.spark.serializer.JavaSerializationStream.close(JavaSerializer.scala:57)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$1.apply$mcV$sp(TorrentBroadcast.scala:278)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
        at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:277)
        at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:126)
        at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:88)
        at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
        at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
        at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1488)
        at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1006)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:776)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:775)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
        at org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:775)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1278)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1729)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

    后来以为是程序里面的资源没有回收,仔细排查了一遍代码,也没找出来问题,后来在本地跑,发现sparkUI中EXECUTOR中storage memory和RDD blocks会一直增加,虽然每个Batch后俩者会下降,但是每一个Batch之后和上一个batch比较还是增加的。

    解决:由于是Storage memory和RDD blocks在增长,觉得和内存相关,用内存调优改了下参数还是不行,然后以为是contentcleaner问题,把它调成5分钟一次也不行,后来在我的另一个电脑跑的时候发现没有问题了,这个电脑上的spark 版本是2.3.0;现在可以确定是版本问题,就直接去官网2.2.2版本里面找关于内存溢出修复的bug,当时就下载了2.2.2,然后以跑程序还是和原来2.2.1一样,再然后就心态崩了,也没想着去看2.3.0的BUG修复了,当时我在一个知识星球提问过这个问题,后来星球的主人帮我解决了这个问题,原来这个问题在2.3.0里面才被解决,具体网址:https://issues.apache.org/jira/browse/SPARK-21357,原因是因为FileInputSream会重写Dstream中的clearMetadata方法,但是在FileInputStream中claerMetadata方法只是清理了文件并没有清理generatedRDDs,因此才会出现内存溢出。

    总结:本次bug本来在确定了版本问题之后,理应很好解决,但是由于自身原因,多走了弯路,后来得高人相助才得以顺利解决问题。由此也发现了自己的一些问题,遇到问题不能只能留在表面,要深入代码,在了解原理的基础上在了解具体实现,在遇到问题是才能快速定位问题,并找到解决办法。下面就是这次bug之后翻看spark streaming源码之后对出现这个bug的前因后果的分析。

    bug分析:spark Streaming程序只要启动就会一直的运转,期间从数据源得到数据,然后消费,最后输出,在每一个的batch里面,都会根据具体的业务逻辑生成对应的jod,然后spark就处理提交的job,这里只要明白了job的生成及生成之后对缓存的数据的处理,也就好理解这个bug的出现原因了。

    StreamingContent在启动之后,会启动JobScheduler;在JobSchedluer里面会启动JobGenerator和ReceiveTracker;JobGenerator负责job相关的处理,ReceiveTracker负责Receive分发和worker端的receive通信,并处理其发来的信息。

    如下是JobSchedluer的start方法:

     def start(): Unit = synchronized {
        if (eventLoop != null) return // scheduler has already been started
    
        logDebug("Starting JobScheduler")
        eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
          override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
    
          override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
        }
        eventLoop.start()
    
        // attach rate controllers of input streams to receive batch completion updates
        for {
          inputDStream <- ssc.graph.getInputStreams
          rateController <- inputDStream.rateController
        } ssc.addStreamingListener(rateController)
    
        listenerBus.start()
        receiverTracker = new ReceiverTracker(ssc)
        inputInfoTracker = new InputInfoTracker(ssc)
    
        val executorAllocClient: ExecutorAllocationClient = ssc.sparkContext.schedulerBackend match {
          case b: ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient]
          case _ => null
        }
    
        executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
          executorAllocClient,
          receiverTracker,
          ssc.conf,
          ssc.graph.batchDuration.milliseconds,
          clock)
        executorAllocationManager.foreach(ssc.addStreamingListener)
        receiverTracker.start()
        jobGenerator.start()
        executorAllocationManager.foreach(_.start())
        logInfo("Started JobScheduler")
      }

    在JobScheduler的start方法里面,它首先创建了EventLoop[JobSchedulerEvent],它主要用来处理job的调度事件的,具体事件定义在processEvent里面:

     private def processEvent(event: JobSchedulerEvent) {
        try {
          event match {
            case JobStarted(job, startTime) => handleJobStart(job, startTime)
            case JobCompleted(job, completedTime) => handleJobCompletion(job, completedTime)
            case ErrorReported(m, e) => handleError(m, e)
          }
        } catch {
          case e: Throwable =>
            reportError("Error in job scheduler", e)
        }
      }

    其后启动了这个eventloop,在启动之后会开启一个线程来消费eventQueue发送的事件消息,eventQueue是LinkedBlockingDeque类型的。

    private val eventThread = new Thread(name) {
        setDaemon(true)
    
        override def run(): Unit = {
          try {
            while (!stopped.get) {
              val event = eventQueue.take()
              try {
                onReceive(event)
              } catch {
                case NonFatal(e) =>
                  try {
                    onError(e)
                  } catch {
                    case NonFatal(e) => logError("Unexpected error in " + name, e)
                  }
              }
            }
          } catch {
            case ie: InterruptedException => // exit even if eventQueue is not empty
            case NonFatal(e) => logError("Unexpected error in " + name, e)
          }
        }
    
      }

    在这个事件的接收处理启启动之后,JobScheduler启动了receiverTracker和jobGenerator,receiverTracker负责Receive分发和worker端的receive通信,并处理其发来的信息。接下来主要看jobGenerator.start的逻辑:

    /** Start generation of jobs */
      def start(): Unit = synchronized {
        if (eventLoop != null) return // generator has already been started
    
        // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
        // See SPARK-10125
        checkpointWriter
    
        eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
          override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)
    
          override protected def onError(e: Throwable): Unit = {
            jobScheduler.reportError("Error in job generator", e)
          }
        }
        eventLoop.start()
    
        if (ssc.isCheckpointPresent) {
          restart()
        } else {
          startFirstTime()
        }
      }

    在JobGenertor的start方法里面创建了EventLoop[JobGeneratorEvent],用来处理具体的关于job的操作,具体的定义在processEvent中:

    /** Processes all events */
      private def processEvent(event: JobGeneratorEvent) {
        logDebug("Got event " + event)
        event match {
          case GenerateJobs(time) => generateJobs(time)
          case ClearMetadata(time) => clearMetadata(time)
          case DoCheckpoint(time, clearCheckpointDataLater) =>
            doCheckpoint(time, clearCheckpointDataLater)
          case ClearCheckpointData(time) => clearCheckpointData(time)
        }
      }

    在启动完eventloop之后,接下来会看检查点,如果第一次运行就进入startFirstTime方法中:

    /** Starts the generator for the first time */
      private def startFirstTime() {
        val startTime = new Time(timer.getStartTime())
        graph.start(startTime - graph.batchDuration)
        timer.start(startTime.milliseconds)
        logInfo("Started JobGenerator at " + startTime)
      }

    在startFirstTime方法里面首先会设置一个startTime,其后启动DstreamGraph,然后调用timer.start方法,timer的创建:

    private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
        longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

    通过跟代码可以确定最后在线程里面运行的是triggerActionForNextInterval方法

    def start(startTime: Long): Long = synchronized {
        nextTime = startTime
        thread.start()
        logInfo("Started timer for " + name + " at time " + nextTime)
        nextTime
      }
    private val thread = new Thread("RecurringTimer - " + name) {
        setDaemon(true)
        override def run() { loop }
      }
    private def triggerActionForNextInterval(): Unit = {
        clock.waitTillTime(nextTime)
        callback(nextTime)
        prevTime = nextTime
        nextTime += period
        logDebug("Callback for " + name + " called at time " + prevTime)
      }
    
      /**
       * Repeatedly call the callback every interval.
       */
      private def loop() {
        try {
          while (!stopped) {
            triggerActionForNextInterval()
          }
          triggerActionForNextInterval()
        } catch {
          case e: InterruptedException =>
        }
      }
    }

    在triggerActionForNextInterval方法中调用的callback方法,即timer创建的时候的eventLoop.post(GenerateJobs(new Time(longTime))方法,这里的eventloop是EventLoop[JobGeneratorEvent]类型的,所以最后会调用generateJobs方法:

    /** Generate jobs and perform checkpointing for the given `time`.  */
      private def generateJobs(time: Time) {
        // Checkpoint all RDDs marked for checkpointing to ensure their lineages are
        // truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
        ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
        Try {
          jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
          graph.generateJobs(time) // generate jobs using allocated block
        } match {
          case Success(jobs) =>
            val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
            jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
          case Failure(e) =>
            jobScheduler.reportError("Error generating jobs for time " + time, e)
            PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
        }
        eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
      }

    这个方法里面先是调用jobScheduler.receiverTracker.allocateBlocksToBatch(time)方法将receive分配的block获取到这个batch中,然后在调用graph.generateJobs(time)利用上面的block来生成具体的job。接下来看jobScheduler.submitJobSet方法:

    def submitJobSet(jobSet: JobSet) {
        if (jobSet.jobs.isEmpty) {
          logInfo("No jobs added for time " + jobSet.time)
        } else {
          listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
          jobSets.put(jobSet.time, jobSet)
          jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
          logInfo("Added jobs for time " + jobSet.time)
        }
      }

    submitJobSet方法中会根据JobSet为每一个job新建JobHandler,放入job的线程池中,等待spark的调度处理。到此job在逻辑上已经完成。

    下面是根据代码画的关于job流入线程池的时序图:

    接下来看JobHandler的run方法。

    def run() {
          val oldProps = ssc.sparkContext.getLocalProperties
          try {
            ssc.sparkContext.setLocalProperties(SerializationUtils.clone(ssc.savedProperties.get()))
            val formattedTime = UIUtils.formatBatchTime(
              job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)
            val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"
            val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"
    
            ssc.sc.setJobDescription(
              s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""")
            ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
            ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)
            // Checkpoint all RDDs marked for checkpointing to ensure their lineages are
            // truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
            ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
    
            // We need to assign `eventLoop` to a temp variable. Otherwise, because
            // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
            // it's possible that when `post` is called, `eventLoop` happens to null.
            var _eventLoop = eventLoop
            if (_eventLoop != null) {
              _eventLoop.post(JobStarted(job, clock.getTimeMillis()))
              // Disable checks for existing output directories in jobs launched by the streaming
              // scheduler, since we may need to write output to an existing directory during checkpoint
              // recovery; see SPARK-4835 for more details.
              SparkHadoopWriterUtils.disableOutputSpecValidation.withValue(true) {
                job.run()
              }
              _eventLoop = eventLoop
              if (_eventLoop != null) {
                _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
              }
            } else {
              // JobScheduler has been stopped.
            }
          } finally {
            ssc.sparkContext.setLocalProperties(oldProps)
          }
        }
      }

    在run方法中,在调起job.run()方法运行job之后,会往evenloop发送post(JobCompleted(job, clock.getTimeMillis())这里的eventloop是EventLoop[JobSchedulerEvent],因此具体的处理方法是handleJobCompletion:

    private def handleJobCompletion(job: Job, completedTime: Long) {
        val jobSet = jobSets.get(job.time)
        jobSet.handleJobCompletion(job)
        job.setEndTime(completedTime)
        listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo))
        logInfo("Finished job " + job.id + " from job set of time " + jobSet.time)
        if (jobSet.hasCompleted) {
          listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo))
        }
        job.result match {
          case Failure(e) =>
            reportError("Error running job " + job, e)
          case _ =>
            if (jobSet.hasCompleted) {
              jobSets.remove(jobSet.time)
              jobGenerator.onBatchCompletion(jobSet.time)
              logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format(
                jobSet.totalDelay / 1000.0, jobSet.time.toString,
                jobSet.processingDelay / 1000.0
              ))
            }
        }
      }

    在这个方法里面根据job.result会调用(若无错误)jobGenerator.onBatchCompletion(jobSet.time)

     def onBatchCompletion(time: Time) {
        eventLoop.post(ClearMetadata(time))
      }

    这个方法中eventloop发送了ClearMatadata信号,即清理元数据信号,这个信号会被EventLoop[JobGeneratorEvent]接收处理;调用claerMetadata方法

    /** Clear DStream metadata for the given `time`. */
      private def clearMetadata(time: Time) {
        ssc.graph.clearMetadata(time)
    
        // If checkpointing is enabled, then checkpoint,
        // else mark batch to be fully processed
        if (shouldCheckpoint) {
          eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = true))
        } else {
          // If checkpointing is not enabled, then delete metadata information about
          // received blocks (block data not saved in any case). Otherwise, wait for
          // checkpointing of this batch to complete.
          val maxRememberDuration = graph.getMaxInputStreamRememberDuration()
          jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration)
          jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration)
          markBatchFullyProcessed(time)
        }
      }

    这里调用了ssc.graph.clearMetadata(time)方法:

    def clearMetadata(time: Time) {
        logDebug("Clearing metadata for time " + time)
        this.synchronized {
          outputStreams.foreach(_.clearMetadata(time))
        }
        logDebug("Cleared old metadata for time " + time)
      }

    上面会根据每个outputStreams来调用clearMatadata方法,这个outputstreams在DstreamGraph中定义,在调用类似foreachrdd这类触发job的算子的时候,会调用Dstream.register方法新增outputstream。最后会调用到Dstream的claerMetadata方法:

    private[streaming] def clearMetadata(time: Time) {
        val unpersistData = ssc.conf.getBoolean("spark.streaming.unpersist", true)
        val oldRDDs = generatedRDDs.filter(_._1 <= (time - rememberDuration))
        logDebug("Clearing references to old RDDs: [" +
          oldRDDs.map(x => s"${x._1} -> ${x._2.id}").mkString(", ") + "]")
        generatedRDDs --= oldRDDs.keys
        if (unpersistData) {
          logDebug(s"Unpersisting old RDDs: ${oldRDDs.values.map(_.id).mkString(", ")}")
          oldRDDs.values.foreach { rdd =>
            rdd.unpersist(false)
            // Explicitly remove blocks of BlockRDD
            rdd match {
              case b: BlockRDD[_] =>
                logInfo(s"Removing blocks of RDD $b of time $time")
                b.removeBlocks()
              case _ =>
            }
          }
        }
        logDebug(s"Cleared ${oldRDDs.size} RDDs that were older than " +
          s"${time - rememberDuration}: ${oldRDDs.keys.mkString(", ")}")
        dependencies.foreach(_.clearMetadata(time))
      }

    这里清理了generatedRDDs中的RDD,最后还调用了dependencies.foreach(_.clearMetadata(time))来清理数据;这个dependencies是Dstream定义的def dependencies: List[DStream[_]],其实在Dstream的子类里面会重写,对于inputstream由于其是依赖的第一个,因此list为空,在其他Dstream中,例如MappedDStream中,其定义是list(parent)指向父类,这样依赖的关系就可以用dependencies来表示。

    override def dependencies: List[DStream[_]] = List()

    在项目里面用的textFileStream()方法接收数据,其具体的实现在FileInputDstream中,在FileInputDstream中就重写了clearMetadata方法:

    protected[streaming] override def clearMetadata(time: Time) {
        batchTimeToSelectedFiles.synchronized {
          val oldFiles = batchTimeToSelectedFiles.filter(_._1 < (time - rememberDuration))
          batchTimeToSelectedFiles --= oldFiles.keys
          recentlySelectedFiles --= oldFiles.values.flatten
          logInfo("Cleared " + oldFiles.size + " old files that were older than " +
            (time - rememberDuration) + ": " + oldFiles.keys.mkString(", "))
          logDebug("Cleared files are:
    " +
            oldFiles.map(p => (p._1, p._2.mkString(", "))).mkString("
    "))
        }
        // Delete file mod times that weren't accessed in the last round of getting new files
        fileToModTime.clearOldValues(lastNewFileFindingTime - 1)
      }

    上面是FileInputDstream重写的方法,可以看到只是清理了file,但是并没有针对generatedRDDs中的RDD进行操作,因此在每一个batch结束后,由于这里的数据清理不完全,导致内存一直增加,最后OOM。这个bug在2.3.0已经修改。

  • 相关阅读:
    vscode如何将less编译到指定css目录中
    md文档的书写《二》
    关于页面scroolTop的获取
    git学习 c的某位老哥的,(侵删)
    学习git使用网址
    git,github,gitlab,码云的区别
    Git的基本使用
    php_review_day1
    shell脚本编程基础-构建基本脚本
    Linux基本命令
  • 原文地址:https://www.cnblogs.com/ldsggv/p/9416166.html
Copyright © 2011-2022 走看看