zoukankan      html  css  js  c++  java
  • [Storm] java.io.FileNotFoundException: File '../stormconf.ser' does not exist

    This bug will kill supervisors

    Affects Version/s: 0.9.2-incubating, 0.9.3, 0.9.4 

    Fix Version/s: 0.10.0, 0.9.5

    问题背景

    最近发现刚搭起的Storm集群,没过多久,Supervisor 便悄然死去了一大半。查看死去Supervisor的log,发现java.io.FileNotFoundException: File '../stormconf.ser' does not exist异常。网上给出的答案大多是

        将 { storm.local.dir } 目录下的文件清空,重启就好了。

    但这是指标不治本,即时重启可以跑起来,可是为什么会出现这个问题,依然不知道。

    然后才发现线STORM-130解决了这个问题。该问题的重现场景:

    1) Run a storm cluster with atleast 2 supervisors with 4 slots each
    2) Deploy a topology that uses 4 workers, topology will be distributed with each supervisor having two workers each
    3) kill one of the supervisor lets say supervisor1 
    4) wait till topology re-balances to occupy 4 workers on supervisor2
    5) now bring up supervisor1, It goes through the cycle of cleaning up old topology code
    6) nimbus re-balances topology which triggers supervisor.sync-process method
    7) sync-process tries to launch a worker for the topology whose code data is delete when the supervisor started causing it throw up following exception

    问题原因

    上面场景分析提到的 sync-process是supervisor运行的一个函数。Supervisor会在后台运行这两个函数:

    • synchronize-supervisor: This is called whenever assignments in Zookeeper change and also every 10 seconds. 
      • Downloads code from Nimbus for topologies assigned to this machine for which it doesn't have the code yet. 
      • Writes into local filesystem what this node is supposed to be running. It writes a map from port -> LocalAssignment. LocalAssignment contains a topology id as well as the list of task ids for that worker. 
    • sync-processes: Reads from the LFS what synchronize-supervisor wrote and compares that to what's actually running on the machine. It then starts/stops worker processes as necessary to synchronize. 

    从描述中可以看出,synchronized-supervisor 和 sync-process 两个函数是通过 LFS 进行同步。The key reason is "synchronize-supervisor" which responsible for download file and remove file thread and "sync-processes" which responsible for start worker process thread is Asynchronous. 

    in synchronize-supervisor read assigment information from zk, supervisor download necessary file from nimbus and write local state. In aother thread sync-processes funciton read local state to launch workor process, when the worker process has not start ,synchronize-supervisor function is called again topology's assignment information has changed (cased by rebalance,or worker time out etc) worker assignment to this supervisor has move to another supervisor, synchronize-supervisor remove the unnecessary file (jar file and ser file etc.) , after this, worker launched by " sync-processes" ,ser file was not exsit , this issue occur. 

    可能解决办法

    • 换一个storm
    • 调整参数
      • Change "synchronize-supervisor" thread loop time to a longger than 10(default time) sec, such as 30 sec。
      • supervisor.worker.timeout.secs: 30 -> 5

    References:

    • https://issues.apache.org/jira/browse/STORM-130
    • http://storm.apache.org/documentation/Lifecycle-of-a-topology.html

     

  • 相关阅读:
    3.1《想成为黑客,不知道这些命令行可不行》(Learn Enough Command Line to Be Dangerous)——下载文件
    rem实现手机页面自动缩放
    Git 常用命令
    使用 canvas+JS绘制钟表
    JS 操作数组的方法
    Node.js Request方法
    兼容浏览器的点击事件
    ES6知识点
    上传项目到github上
    JavaScript 编码风格
  • 原文地址:https://www.cnblogs.com/qingwen/p/4997302.html
Copyright © 2011-2022 走看看