zoukankan      html  css  js  c++  java
  • Hadoop HA on Yarn——集群启动

    这里分两部分,第一部分是NameNode HA,第二部分是ResourceManager HA

    (ResourceManager HA是hadoop-2.4.1之后加上的)

     

    NameNode HA 

    1.启动Zookeeper 

    zkServer.sh start
    可以用zkServer.sh status查看状态(看看该节点是不是leader还是follower)

     2.hadoop001上执行,格式化ZooKeeper集群,目的是在ZooKeeper集群上建立HA的相应节点

    hdfs zkfc -formatZK

    ...
    15/07/17 14:50:08 INFO ha.ActiveStandbyElector: Successfully deleted /hadoop-ha/appcluster from ZK. 15/07/17 14:50:08 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/appcluster in ZK.

    验证:zkCli.sh

    ...
    Welcome to ZooKeeper!
    2015-07-17 14:51:32,531 [myid:] - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@975] - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
    2015-07-17 14:51:32,544 [myid:] - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@852] - Socket connection established to localhost/127.0.0.1:2181, initiating session
    JLine support is enabled
    2015-07-17 14:51:32,561 [myid:] - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1235] - Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x14e9ac4b6a60001, negotiated timeout = 30000
    
    WATCHER::
    
    WatchedEvent state:SyncConnected type:None path:null
    [zk: localhost:2181(CONNECTED) 0]

    ls /

    [rmstore, yarn-leader-election, hadoop-ha, zookeeper]

    ls /hadoop-ha

    [appcluster]

    3.在hadoop001,hadoop002,hadoop003上启动日志程序journalnode

     hadoop-daemon.sh start journalnode

    starting journalnode, logging to /data/hadoop-2.6.0/logs/hadoop-root-journalnode-hadoop001.out

    jps

    14183 QuorumPeerMain
    14680 Jps
    14459 JournalNode

    4.格式化NameNode(必须开启JournalNode进程)

    hdfs namenode -format

    如果不是首次format的话还是把NameNode和DataNode存放数据地址下的数据手动删除一下,否则会造成NameNode ID和DataNode ID不一致,

    rm -rf /data/hadoop/storage/hdfs/name/* & rm -rf /data/hadoop/storage/hdfs/data/* 

    (如果是HDFS联盟,即有多个HDFS集群同时工作,则用hdfs namenode -format -clusterId [clusterID])

    5.启动NameNode

    hadoop-daemon.sh start namenode

    6.把NameNode的数据从hadoop001同步到hadoop002中

    注意,在hadoop002(namenode standby)下执行:

    hdfs namenode -bootstrapStandby

    ...
    =====================================================
    About to bootstrap Standby ID nn2 from:
               Nameservice ID: appcluster
            Other Namenode ID: nn1
      Other NN's HTTP address: http://hadoop001:50070
      Other NN's IPC  address: hadoop001/**.**.**.**:8020
                 Namespace ID: 1358416288
                Block pool ID: BP-503387195-**.**.**.**-1437119166865 
             Cluster ID: CID-51e580f5-f003-463d-ae45-e109a7ec31d4
           Layout version: -60
    =====================================================
    ...

    7.启动所有的DataNode

    hadoop-daemons.sh start datanode

    8.启动Yarn

    start-yarn.sh

    9.在hadoop001,hadoop002启动ZooKeeperFailoverController(这里不用在hadoop003中启动,因为hadoop003这个节点是纯粹的DataNode)

    hadoop-daemon.sh start zkfc

    10.验证HA的故障自动转移是否好用

    因为用的公司的远程服务器,无法通过web查看NameNode的Standby或者Active状态,只能从指定namenode名称空间的存储地址下看edits文件的更新时间

    namenode名称空间在上一节集群配置中设置如下

    <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///data/hadoop/storage/hdfs/name</value>
    </property>

    在两个namennode的该路径下分别有两个fsimage文件,fsimage是存储元数据的文件,在Active的NameNode中还会有edit log,并且每对hdfs操作一次 edit log都会更新,从时间的更新就能看出。而Standby NameNode的 edit log不会更新。当Active的NameNode被kill掉之后可以立马在Standby NameNode的name路径下看到最新的edit log更新。这一切都要归功于JournalNode。在journalNode路径下可以看到完整的edit log备份。

    小结:

    集群启动要特别小心,很容易因为操作顺序不对导致failover失败的。

    之前还因为kill掉Hadoop001的NameNode而hadoop002的NameNode的也跟着down掉。导致操作hdfs的时候connection refused。一直在找connection的问题,比如端口、/etc/hosts的问题。结果重新按流程启动了一遍又好了,不知道之前的问题出在哪,莫名其妙,搞的心力憔悴,浪费了不少时间。

    所以每一步操作的检查很重要,看看进程、name路径下的edit log更新。

     

    ResourceManager HA 

    NameNode HA操作完之后我们可以发现只有一个节点(这里是hadoop001)启动,需要手动启动另外一个节点(hadoop002)的resourcemanager。

    yarn-daemon.sh start resourcemanager

    然后用以下指令查看resourcemanager状态

    yarn rmadmin –getServiceState rm1

    结果显示Active

    而rm2是standby。

    验证HA和NameNode HA同理,kill掉Active resourcemanager,则standby的resourcemanager则会转换为Active。

    还有一条指令可以强制转换

    yarn rmadmin –transitionToStandby rm1

    参考文献

    [1] hdfs-site.xml:http://www.21ops.com/front-tech/10744.html  

    [2] yarn-site.xml: http://www.aboutyun.com/thread-10572-1-1.html 评论也值得参考

    [3] http://www.cnblogs.com/meiyuanbao/p/3545929.html (没有做到Yarn的HA)

  • 相关阅读:
    Linkerd 2.10(Step by Step)—将 GitOps 与 Linkerd 和 Argo CD 结合使用
    Linkerd 2.10(Step by Step)—多集群通信
    Linkerd 2.10(Step by Step)—使用 Kustomize 自定义 Linkerd 的配置
    Linkerd 2.10(Step by Step)—控制平面调试端点
    Linkerd 2.10(Step by Step)—配置超时
    Linkerd 2.10(Step by Step)—配置重试
    Linkerd 2.10(Step by Step)—配置代理并发
    本地正常运行,线上环境诡异异常原因集合
    Need to invoke method 'xxx' declared on target class 'yyy', but not found in any interface(s) of the exposed proxy type
    alpine 安装常用命令
  • 原文地址:https://www.cnblogs.com/captainlucky/p/4652078.html
Copyright © 2011-2022 走看看