zoukankan      html  css  js  c++  java
  • HBase client访问ZooKeeper获取rootregionserver DeadLock问题(zookeeper.ClientCnxn Unable to get data of znode /hbase/rootregionserver)

    2012年11月28日 出现故障," Unable to get data of znode /hbase/root-region-server"

    问题比较诡异,两个机房,只有一个机房故障,5台服务器相续故障,错误日志相同。使用的HBase客户端版本为0.94.0

    1)分析步骤:

    1 jstack jmap 查看是否有死锁、block或内存溢出

    jmap 看内存回收状况没有什么异常,内存和CPU占用都不多

     jstack pid > test.log
    pid: Unable to open socket file: target process not responding or HotSpot VM not loaded
    The -F option can be used when the target process is not responding
    出现这种错误,使用-F参数
    jstack -l -F pid >jstack.log
    jstack -l pid >jstack.log
    这时jstack.log并不会有有用信息,要去catalina.out里看,注意在服务下线后使用,容易造成tomcat僵死
    catalina.out里发现deadlock死锁
    Found one Java-level deadlock:
    =============================
    "catalina-exec-800":
      waiting to lock monitor 0x000000005f1f6530 (object 0x0000000731902200, a java.lang.Object),
      which is held by "catalina-exec-710"
    "catalina-exec-710":
      waiting to lock monitor 0x00002aaab9a05bd0 (object 0x00000007321f8708, a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation),
      which is held by "catalina-exec-29-EventThread"
    "catalina-exec-29-EventThread":
      waiting to lock monitor 0x000000005f9f0af0 (object 0x0000000732a9c7e0, a org.apache.hadoop.hbase.zookeeper.RootRegionTracker),
      which is held by "catalina-exec-710"
    Java stack information for the threads listed above:
    ===================================================
    "catalina-exec-800":
            at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:943)
            - waiting to lock <0x0000000731902200> (a java.lang.Object)
            at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
            at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
            at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
            at org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
            at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
            at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
            at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
            at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:171)
            at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:160)
            at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:150)

    "catalina-exec-710": at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:599) - waiting to lock <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:158) - locked <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker) at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933) ...... at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948) - locked <0x0000000731902200> (a java.lang.Object)
    "catalina-exec-29-EventThread": at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.stop(ZooKeeperNodeTracker.java:98) - waiting to lock <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:604) - locked <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:374) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:271) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497) Found 1 deadlock.

     warn.log中报告Interrupted异常 是由上述死锁引起的

    2012-11-28 20:06:17 [WARN] hconnection-0x4384d0a47f41c63 Unable to get data of znode /hbase/root-region-server
    java.lang.InterruptedException
            at java.lang.Object.wait(Native Method)
            at java.lang.Object.wait(Object.java:485)
            at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1253)
            at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1129)
            at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:264)
            at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:522)
            at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:498)
            at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:156)
            at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)
            at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)
            ......
            at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
            at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1042)
            at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
            at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
            at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
            at org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
            at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
            at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
            at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
          
    2012-11-28 20:06:17 [ERROR] [hbase_error]
    java.io.IOException: Giving up after tries=1
            at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:192)
            at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
            at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
     
    Caused by: java.lang.InterruptedException: sleep interrupted
            at java.lang.Thread.sleep(Native Method)
            at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:189)
            ... 13 more

    https://issues.apache.org/jira/browse/HBASE-5060  HBase client is blocked forever ,跟这个问题有点相似,但没有解决这问题

    2 同进程另启动线程访问root-region-server

                        // check if zk is ok
                        ZooKeeper zk = null;
                        Watcher watch = new Watcher() {
                            public void process(WatchedEvent event) {
                            }
                        };
                        String zookeeperQuorum = wbobjectHBaseConfMap.get("hbase.zookeeper.quorum");
                        
                        if (StringUtils.isNotBlank(zookeeperQuorum)) {
                            try {
                                zk = new ZooKeeper(zookeeperQuorum, 30000, watch);
                                byte[] data = zk.getData("/hbase/root-region-server", watch, null);
                                ApiLogger.info(" get root-region-server success! ip:"
                                        + Util.toStr(data));
                            } catch (Exception e) {
                                ApiLogger.error(" get root-region-server error!" + e.getMessage());
                            } finally {
                                try {
                                    zk.close();
                                } catch (InterruptedException e) {
                                    ApiLogger.error("close zk error!");
                                }
                            }
                        }
                        

    发现独立线程在整个进程死锁时还能正常工作,HBase的zookeeper实例异常死锁后就不能恢复,导致scan操作超时到30s+,并且不能恢复,正常应该在ms级别。

    因此认为是HBase客户端连接ZooKeeper时出问题,流程:

    出现网络抖动或者root表迁移,缓存表未命中,重新去获取root-region-server,结果获取失败,进行ZooKeeper重置操作

    经过研究死锁代码

    HConnectionImplementation 发现 get root-region-server异常,锁定RootRegionTracker,试图锁定自身,执行resetZooKeeperTrackers并调用RootRegionTracker.stop 去重置 ZooKeeper
    而与此同时RootRegionTracker在执行stop方法时,已经锁定了HConnectionImplementation,还需要锁定它本身(synchronized方法),但其已经在HConnectionImplementation中被锁定。
    HConnectionImplementation代码中:
     this.rootRegionTracker = new RootRegionTracker(this.zooKeeper, this);
    调用时也没有超时cancel代码 这样就导致了两个资源循环等待的死锁
    2)仿真模拟 
    以下是个小程序,简单模拟这种死锁问题:
    import java.util.concurrent.TimeUnit;
    
    import org.apache.hadoop.hbase.ServerName;
    
    class ZooKeeperNodeTracker {
        private boolean stopped = false;
        private AbortAble hConnectionImplementation;
        public ZooKeeperNodeTracker(AbortAble hConnectionImplementation) {
            this.hConnectionImplementation = hConnectionImplementation;
        }
    
        public synchronized void stop()  throws InterruptedException {
            this.stopped = true;
            System.out.println(Thread.currentThread()+"|"+Thread.currentThread().getId()+"stop zknode");
            TimeUnit.MICROSECONDS.sleep(100);
            notifyAll();
        }
    
        public boolean condition() {
            return stopped;
        }
    
        public boolean start() {
             stopped=false;
             return true;
        }
        
        public synchronized boolean getData(int i) throws InterruptedException {
            //error in get root region server
            if (i %100 ==0){
                hConnectionImplementation.resetZooKeeperTrackers();
                throw new InterruptedException("interrupted");            
            }
            return true;
        }
    
    }
    
    public class HConnectionManagerTest {
    
        static class HConnectionImplementation implements AbortAble{
            public HConnectionImplementation() {
                rootRegionTracker = new ZooKeeperNodeTracker(this);
            }
    
            private volatile ZooKeeperNodeTracker rootRegionTracker;
            @Override
            public synchronized void resetZooKeeperTrackers() {
                try{
                    if (rootRegionTracker != null) {
                        rootRegionTracker.stop();
                        rootRegionTracker = null;
                        System.out.println(Thread.currentThread()+"|"+Thread.currentThread().getId()+"resetZooKeeperTrackers");
                    }
                }catch(InterruptedException e){
                    System.out.println(Thread.currentThread()+"----------resetZooKeeperTrackers Interrupted-----------");
                    
                }
            }
    
    
            public void testGetData(String name) {
                int i = 1;
                while (i >0) {
                    i++;
                    try {
                        rootRegionTracker.getData(i);
                    } catch (Exception e) {
                        resetZooKeeperTrackers();
                    }
                    if(i %100 ==0){
                        rootRegionTracker = new ZooKeeperNodeTracker(this);
                        System.out.println(name+" restart test");
                    }
                }
            }
        }
    
        public static void main(String[] args)  {
    
            final HConnectionImplementation hcon = new HConnectionImplementation();
            Thread scan1 = new Thread(new Runnable() {
                public void run() {
                    hcon.testGetData("[scan1]");
                }
            });
            
            Thread scan2 = new Thread(new Runnable() {
                public void run() {
                    hcon.testGetData("[scan2]");
                }
            });
            try{
                scan1.start();
                scan2.start();
                TimeUnit.SECONDS.sleep(2);
            }catch (InterruptedException e){
                System.out.println("----------testgetdata -------interrupt");
            }
        }
    
    }

    类名构造函数都模拟HBase Client,并且放大getData error的情形,当同时并发两个scan操作时,前一个scan过程中,获取不到root-region-server,在ZooKeeperNodeTracker中做stop() 时,后一个scan也开始在HConnectionImplementation中执行resetZooKeeperTrackers(),两个资源ZooKeeperNodeTracker和HConnectionImplementation被各自分别占用等待,导致死锁。模拟程序的死锁解除可以更改

    public synchronized boolean getData
    方法为
    public boolean getData

    public synchronized void resetZooKeeperTrackers() {
                try{
                    if (rootRegionTracker != null) {
                        rootRegionTracker.stop();

    public void resetZooKeeperTrackers() {
                try{
                    if (rootRegionTracker != null) {
                synchronized(rootRegionTracker){
                  rootRegionTracker.stop();
    解除互斥条件解决问题

     3)最终解决方案

    通过Hadoop大会现场跟HBase开发者Ted Yu咨询,称0.94.0有很多bug不稳定,建议升级到0.94.2,通过查看relase note, 官方的两个patch地址已在0.94.1中修复 (通过对hbase源码进行分析找对问题对应点,再看对应源码svn详细修改记录)
    1 通过避免嵌套重试循环来解决rpc线程卡死: https://issues.apache.org/jira/browse/HBASE-6326  .
    2 通过等待-root-的region地址设置到root region tracker 来避免deadlock问题:https://issues.apache.org/jira/browse/HBASE-6115  

  • 相关阅读:
    CSS的应用
    关于新手html的认识 以及对table的基本用法
    javascript的使用方法
    CSS的使用方式和选择器的用法
    html基础知识点
    前端课堂第四课
    前端课堂第三课
    前端实训第二课
    前端实训随笔
    JS02
  • 原文地址:https://www.cnblogs.com/shenguanpu/p/2798217.html
Copyright © 2011-2022 走看看