zoukankan      html  css  js  c++  java
  • yarn查询/cluster/nodes均返回localhost

    背景:

      1、已禁用ipv6。

      2、所有节点的/etc/hosts正确配置,任务在ResourceManager提交。

      3、yarn-site.xml中指定了

        yarn.resourcemanager.hostname=Master
        yarn.nodemanager.aux-services=mapreduce_shuffle
        并在各NodeManager配置了相应的yarn.nodemanager.hostname

    4、mapred-site.xml中指定了mapreduce.framework.name=yarn

    现象:

      提交MR任务的连接拒绝的堆栈,其中连接的container地址为localhost,与实际需要的不一致。

    ser: root
    Name: Bigdata-Hadoop-1.0-SNAPSHOT.jar
    Application Type: MAPREDUCE
    Application Tags:  
    YarnApplicationState: FAILED
    Queue: default
    FinalStatus Reported by AM: FAILED
    Started: Thu Nov 22 21:59:31 +0800 2018
    Elapsed: 6mins, 1sec
    Tracking URL: History
    Diagnostics:
    Application application_1542889591013_0006 failed 2 times due to Error launching appattempt_1542889591013_0006_000002. Got exception: java.net.ConnectException: Call From localhost/127.0.0.1 to localhost:33070 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
    at sun.reflect.GeneratedConstructorAccessor59.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
    at org.apache.hadoop.ipc.Client.call(Client.java:1480)
    at org.apache.hadoop.ipc.Client.call(Client.java:1413)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy83.startContainers(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
    at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy84.startContainers(Unknown Source)
    at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
    at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:250)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:615)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:713)
    at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
    at org.apache.hadoop.ipc.Client.call(Client.java:1452)
    ... 15 more
    . Failing the application.

     

    同时在底部的两次尝试时,driver地址也为localhost

     

    通过查询发现yarn返回的集群节点信息中,所有的NodeManager地址均为localhost。

    以上均证实通过yarn查询到的NodeManager地址异常,无法远程调用NodeManager来启动Container,直接导致MR任务失败。

    方案:

      1、四方博客,撸遍全网,无果。

      2、游走各群,虚心请教,无果。

      3、自力更生,强撸源码,待续 ... ...

    源码:

      找不到入口就别看了。

      org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java:252

      @GET
      @Path("/nodes")
      @Produces({ MediaType.APPLICATION_JSON, MediaType.APPLICATION_XML })
      public NodesInfo getNodes(@QueryParam("states") String states) {
        init();
        ResourceScheduler sched = this.rm.getResourceScheduler();
        if (sched == null) {
          throw new NotFoundException("Null ResourceScheduler instance");
        }
        
        EnumSet<NodeState> acceptedStates;
        if (states == null) {
          acceptedStates = EnumSet.allOf(NodeState.class);
        } else {
          acceptedStates = EnumSet.noneOf(NodeState.class);
          for (String stateStr : states.split(",")) {
            acceptedStates.add(
                NodeState.valueOf(StringUtils.toUpperCase(stateStr)));
          }
        }
        
        Collection<RMNode> rmNodes = RMServerUtils.queryRMNodes(this.rm.getRMContext(),
            acceptedStates);
        NodesInfo nodesInfo = new NodesInfo();
        for (RMNode rmNode : rmNodes) {
          NodeInfo nodeInfo = new NodeInfo(rmNode, sched);
          if (EnumSet.of(NodeState.LOST, NodeState.DECOMMISSIONED, NodeState.REBOOTED)
              .contains(rmNode.getState())) {
            nodeInfo.setNodeHTTPAddress(EMPTY);
          }
          nodesInfo.add(nodeInfo);
        }
        
        return nodesInfo;
      }

     

      这里在生成的节点信息。

      org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeInfo.java:57

    public NodeInfo(RMNode ni, ResourceScheduler sched) {
        NodeId id = ni.getNodeID();
        SchedulerNodeReport report = sched.getNodeReport(id);
        this.numContainers = 0;
        this.usedMemoryMB = 0;
        this.availMemoryMB = 0;
        if (report != null) {
          this.numContainers = report.getNumContainers();
          this.usedMemoryMB = report.getUsedResource().getMemory();
          this.availMemoryMB = report.getAvailableResource().getMemory();
          this.usedVirtualCores = report.getUsedResource().getVirtualCores();
          this.availableVirtualCores = report.getAvailableResource().getVirtualCores();
        }
        this.id = id.toString();
        this.rack = ni.getRackName();
        this.nodeHostName = ni.getHostName();
        this.state = ni.getState();
        this.nodeHTTPAddress = ni.getHttpAddress();
        this.lastHealthUpdate = ni.getLastHealthReportTime();
        this.healthReport = String.valueOf(ni.getHealthReport());

      三个关键信息全是ni这个怪胎来的,那就看你怎么来的行不。

      org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java:63

     public static List<RMNode> queryRMNodes(RMContext context,
          EnumSet<NodeState> acceptedStates) {
        // nodes contains nodes that are NEW, RUNNING OR UNHEALTHY
        ArrayList<RMNode> results = new ArrayList<RMNode>();
        if (acceptedStates.contains(NodeState.NEW) ||
            acceptedStates.contains(NodeState.RUNNING) ||
            acceptedStates.contains(NodeState.UNHEALTHY)) {
          for (RMNode rmNode : context.getRMNodes().values()) {
            if (acceptedStates.contains(rmNode.getState())) {
              results.add(rmNode);
            }
          }
        }

      来这个context里有点东西,具体怎么初始化这个context下回再研究,先看里面对RMNodes的操作。

      接下的时间里就是在跟Yarn挣扎,但是事实证明并不能找到这个hostname究竟是怎么成了localhost,而不是期望的工作节的hostname。毕竟代码量不少,里面错综复杂,还需要点时间缕缕,那就下次接着看源码。不过在了解了一定原理后,搂一遍源码确实对理解原理还是蛮有效的。

      虽然看源码没有得到想要的结果,但是有个大胆想法:通过IP解析hostname是取hosts文件里IP匹配上的第一个hostname(待确认)。因此就将工作节点的ip和hostname挪到第一行,重启yarn集群,MR任务瞬间畅通。

  • 相关阅读:
    1024X768大图 (Wallpaper)
    (Mike Lynch)Application of linear weight neural networks to recognition of hand print characters
    瞬间模糊搜索1000万基本句型的语言算法
    单核与双核的竞争 INTEL P4 670对抗820
    FlashFTP工具的自动缓存服务器目录的功能
    LDAP over SSL (LDAPS) Certificate
    Restart the domain controller in Directory Services Restore Mode Remotely
    How do I install Active Directory on my Windows Server 2003 server?
    指针与指针变量(转)
    How to enable LDAP over SSL with a thirdparty certification authority
  • 原文地址:https://www.cnblogs.com/tyxuanCX/p/10004673.html
Copyright © 2011-2022 走看看