zoukankan      html  css  js  c++  java
  • 记录一次线上yarn RM频繁切换的故障

    周末一大早被报警惊醒,rm频繁切换 

    急急忙忙排查 看到两处错误日志

    错误信息1

    ervation <memory:0, vCores:0>
    2019-12-21 11:51:57,781 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
    java.lang.NullPointerException
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode.unreserveResource(FSSchedulerNode.java:88)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.unreserve(FSAppAttempt.java:589)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainerInternal(FairScheduler.java:899)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:564)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:846)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1479)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:117)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:804)
        at java.lang.Thread.run(Thread.java:748)

    错误信息2

    明月照我去搬砖 2019/12/21 14:51:07
    2019-12-21 07:37:45,533 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
    java.lang.NullPointerException
            at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainerInternal(FairScheduler.java:902)
            at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:564)
            at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:837)
            at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1475)
            at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:117)
            at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:804)
            at java.lang.Thread.run(Thread.java:748)
    2019-12-21 07:37:45,534 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..

    查看源码处FairScheduler

     @Override
      protected void completedContainerInternal(
          RMContainer rmContainer, ContainerStatus containerStatus,
          RMContainerEventType event) {
        try {
          writeLock.lock();
          Container container = rmContainer.getContainer();
    
          // Get the application for the finished container
          FSAppAttempt application =
            getCurrentAttemptForContainer(container.getId());
          ApplicationId appId =
            container.getId().getApplicationAttemptId().getApplicationId();
          if (application == null) {
            LOG.info("Container " + container + " of" +
              " finished application " + appId +
              " completed with event " + event);
            return;
          }
    
          // Get the node on which the container was allocated
          FSSchedulerNode node = getFSSchedulerNode(container.getNodeId());
    
          if (rmContainer.getState() == RMContainerState.RESERVED) {
            application.unreserve(rmContainer.getReservedPriority(), node); //这里将node上该container资源释放
          } else {
            try {
              application.containerCompleted(rmContainer, containerStatus, event); 
              node.releaseContainer(rmContainer.getContainerId(), false);
              updateRootQueueMetrics();
              LOG.info("Application attempt " + application.getApplicationAttemptId()
                      + " released container " + container.getId() + " on node: " + node
                      + " with event: " + event);
            }catch (Exception e){
              LOG.error(e.getMessage(), e);
            }
          }
        } finally {
          writeLock.unlock();
        }
      }

    跟进去看下

      /**
       * Remove the reservation on {@code node} at the given {@link Priority}.
       * This dispatches SchedulerNode handlers as well.
       */
      public void unreserve(Priority priority, FSSchedulerNode node) {
        RMContainer rmContainer = node.getReservedContainer();
        unreserveInternal(priority, node);
        node.unreserveResource(this);
        clearReservation(node);
        getMetrics().unreserveResource(node.getPartition(),
            getUser(), rmContainer.getContainer().getResource());
      }
      @Override
      public synchronized void unreserveResource(
          SchedulerApplicationAttempt application) {
        // Cannot unreserve for wrong application...
        ApplicationAttemptId reservedApplication = 
            getReservedContainer().getContainer().getId().getApplicationAttemptId(); //获取不到该container的attemptId 报空指针
        if (!reservedApplication.equals(
            application.getApplicationAttemptId())) {
          throw new IllegalStateException("Trying to unreserve " +  
              " for application " + application.getApplicationId() + 
              " when currently reserved " + 
              " for application " + reservedApplication.getApplicationId() + 
              " on node " + this);
        }
        
        setReservedContainer(null);
        this.reservedAppSchedulable = null;
      }

    第二处报错是

    rmContainer为null 了对removeapplicationattent的调用和对相同尝试的moveApplication的处理顺序很短则应用程序尝试仍将包含队列引用,
    但已从队列的应用程序列表中删除
    如果对removeapplicationattent的两个调用连续出现,则应用程序仍将包含队列引用,但已从队列的应用程序列表
    中删除
    在这两种情况下,第二个调用必须在进行removeApplication调
    用之前进入。

    其实就是重复释放container 但container已经在该节点上释放了 有一个状态不一致问题
    这边是用的写锁 当一个线程已经读到containerId 另一线程释放掉 再次释放 就会出现异常

    修改方法一
     /**
       * Clean up a completed container.
       */
      @Override
      protected synchronized void completedContainerInternal(
          RMContainer rmContainer, ContainerStatus containerStatus,
          RMContainerEventType event) {
        try {
         // writeLock.lock();//注释写锁 改用重锁
    
          Container container = rmContainer.getContainer();
    
          // Get the application for the finished container
          FSAppAttempt application =
            getCurrentAttemptForContainer(container.getId());
          ApplicationId appId =
            container.getId().getApplicationAttemptId().getApplicationId();
          if (application == null) {
            LOG.info("Container " + container + " of" +
              " finished application " + appId +
              " completed with event " + event);
            return;
          }

    修改方法二 

    // Get the node on which the container was allocated
          FSSchedulerNode node = getFSSchedulerNode(container.getNodeId());
          try {
          if (rmContainer.getState() == RMContainerState.RESERVED) {
            application.unreserve(rmContainer.getReservedPriority(), node);
          } else {
           // try {  //将try移到上方  覆盖unreserve方法
      application.containerCompleted(rmContainer, containerStatus, event);
    node.releaseContainer(rmContainer.getContainerId(),
    false);
    updateRootQueueMetrics();
    LOG.info(
    "Application attempt " + application.getApplicationAttemptId() + " released container " + container.getId(
    ) + " on node: " + node + " with event: " + event);
    }
    catch (Exception e){
    LOG.error(e.getMessage(), e); //将该异常处理掉而不是抛出
    } }
     
  • 相关阅读:
    Redis安装测试
    linux 查看磁盘空间大小
    冷备份与热备份、双机热备与容错
    IDEA在编辑时提示could not autowire
    IntelliJ IDEA 快捷键和设置
    POI实现EXCEL单元格合并及边框样式
    metaq架构原理
    二叉树
    开启“树”之旅
    巧妙的邻接表(数组实现)
  • 原文地址:https://www.cnblogs.com/songchaolin/p/12076999.html
Copyright © 2011-2022 走看看