zoukankan      html  css  js  c++  java
  • yarn查询/cluster/nodes均返回localhost

    背景:

      1、已禁用ipv6。

      2、所有节点的/etc/hosts正确配置,任务在ResourceManager提交。

      3、yarn-site.xml中指定了

        yarn.resourcemanager.hostname=Master
        yarn.nodemanager.aux-services=mapreduce_shuffle
        并在各NodeManager配置了相应的yarn.nodemanager.hostname

    4、mapred-site.xml中指定了mapreduce.framework.name=yarn

    现象:

      提交MR任务的连接拒绝的堆栈,其中连接的container地址为localhost,与实际需要的不一致。

    ser: root
    Name: Bigdata-Hadoop-1.0-SNAPSHOT.jar
    Application Type: MAPREDUCE
    Application Tags:  
    YarnApplicationState: FAILED
    Queue: default
    FinalStatus Reported by AM: FAILED
    Started: Thu Nov 22 21:59:31 +0800 2018
    Elapsed: 6mins, 1sec
    Tracking URL: History
    Diagnostics:
    Application application_1542889591013_0006 failed 2 times due to Error launching appattempt_1542889591013_0006_000002. Got exception: java.net.ConnectException: Call From localhost/127.0.0.1 to localhost:33070 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
    at sun.reflect.GeneratedConstructorAccessor59.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
    at org.apache.hadoop.ipc.Client.call(Client.java:1480)
    at org.apache.hadoop.ipc.Client.call(Client.java:1413)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy83.startContainers(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
    at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy84.startContainers(Unknown Source)
    at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
    at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:250)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:615)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:713)
    at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
    at org.apache.hadoop.ipc.Client.call(Client.java:1452)
    ... 15 more
    . Failing the application.

     

    同时在底部的两次尝试时,driver地址也为localhost

     

    通过查询发现yarn返回的集群节点信息中,所有的NodeManager地址均为localhost。

    以上均证实通过yarn查询到的NodeManager地址异常,无法远程调用NodeManager来启动Container,直接导致MR任务失败。

    方案:

      1、四方博客,撸遍全网,无果。

      2、游走各群,虚心请教,无果。

      3、自力更生,强撸源码,待续 ... ...

    源码:

      找不到入口就别看了。

      org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java:252

      @GET
      @Path("/nodes")
      @Produces({ MediaType.APPLICATION_JSON, MediaType.APPLICATION_XML })
      public NodesInfo getNodes(@QueryParam("states") String states) {
        init();
        ResourceScheduler sched = this.rm.getResourceScheduler();
        if (sched == null) {
          throw new NotFoundException("Null ResourceScheduler instance");
        }
        
        EnumSet<NodeState> acceptedStates;
        if (states == null) {
          acceptedStates = EnumSet.allOf(NodeState.class);
        } else {
          acceptedStates = EnumSet.noneOf(NodeState.class);
          for (String stateStr : states.split(",")) {
            acceptedStates.add(
                NodeState.valueOf(StringUtils.toUpperCase(stateStr)));
          }
        }
        
        Collection<RMNode> rmNodes = RMServerUtils.queryRMNodes(this.rm.getRMContext(),
            acceptedStates);
        NodesInfo nodesInfo = new NodesInfo();
        for (RMNode rmNode : rmNodes) {
          NodeInfo nodeInfo = new NodeInfo(rmNode, sched);
          if (EnumSet.of(NodeState.LOST, NodeState.DECOMMISSIONED, NodeState.REBOOTED)
              .contains(rmNode.getState())) {
            nodeInfo.setNodeHTTPAddress(EMPTY);
          }
          nodesInfo.add(nodeInfo);
        }
        
        return nodesInfo;
      }

     

      这里在生成的节点信息。

      org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeInfo.java:57

    public NodeInfo(RMNode ni, ResourceScheduler sched) {
        NodeId id = ni.getNodeID();
        SchedulerNodeReport report = sched.getNodeReport(id);
        this.numContainers = 0;
        this.usedMemoryMB = 0;
        this.availMemoryMB = 0;
        if (report != null) {
          this.numContainers = report.getNumContainers();
          this.usedMemoryMB = report.getUsedResource().getMemory();
          this.availMemoryMB = report.getAvailableResource().getMemory();
          this.usedVirtualCores = report.getUsedResource().getVirtualCores();
          this.availableVirtualCores = report.getAvailableResource().getVirtualCores();
        }
        this.id = id.toString();
        this.rack = ni.getRackName();
        this.nodeHostName = ni.getHostName();
        this.state = ni.getState();
        this.nodeHTTPAddress = ni.getHttpAddress();
        this.lastHealthUpdate = ni.getLastHealthReportTime();
        this.healthReport = String.valueOf(ni.getHealthReport());

      三个关键信息全是ni这个怪胎来的,那就看你怎么来的行不。

      org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java:63

     public static List<RMNode> queryRMNodes(RMContext context,
          EnumSet<NodeState> acceptedStates) {
        // nodes contains nodes that are NEW, RUNNING OR UNHEALTHY
        ArrayList<RMNode> results = new ArrayList<RMNode>();
        if (acceptedStates.contains(NodeState.NEW) ||
            acceptedStates.contains(NodeState.RUNNING) ||
            acceptedStates.contains(NodeState.UNHEALTHY)) {
          for (RMNode rmNode : context.getRMNodes().values()) {
            if (acceptedStates.contains(rmNode.getState())) {
              results.add(rmNode);
            }
          }
        }

      来这个context里有点东西,具体怎么初始化这个context下回再研究,先看里面对RMNodes的操作。

      接下的时间里就是在跟Yarn挣扎,但是事实证明并不能找到这个hostname究竟是怎么成了localhost,而不是期望的工作节的hostname。毕竟代码量不少,里面错综复杂,还需要点时间缕缕,那就下次接着看源码。不过在了解了一定原理后,搂一遍源码确实对理解原理还是蛮有效的。

      虽然看源码没有得到想要的结果,但是有个大胆想法:通过IP解析hostname是取hosts文件里IP匹配上的第一个hostname(待确认)。因此就将工作节点的ip和hostname挪到第一行,重启yarn集群,MR任务瞬间畅通。

  • 相关阅读:
    ubuntu18.04安装ssh服务
    跳转
    【WinForm】—窗体之间传值的几种方式
    使用jQuery完成复选框的全选和全不选
    VS2015下载安装随笔记录
    关于c#数据类型,类型转换,变量,常量,转义符。
    浅谈表单同步提交和异步提交
    form表单提交和跳转
    2019年8月19日矩阵
    C# WinForm快捷键设置技巧
  • 原文地址:https://www.cnblogs.com/tyxuanCX/p/10004673.html
Copyright © 2011-2022 走看看