2012年11月28日 出现故障," Unable to get data of znode /hbase/root-region-server"
问题比较诡异,两个机房,只有一个机房故障,5台服务器相续故障,错误日志相同。使用的HBase客户端版本为0.94.0
1)分析步骤:
1 jstack jmap 查看是否有死锁、block或内存溢出
jmap 看内存回收状况没有什么异常,内存和CPU占用都不多
pid: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding
jstack -l -F pid >jstack.log
catalina.out里发现deadlock死锁
Found one Java-level deadlock: ============================= "catalina-exec-800": waiting to lock monitor 0x000000005f1f6530 (object 0x0000000731902200, a java.lang.Object), which is held by "catalina-exec-710" "catalina-exec-710": waiting to lock monitor 0x00002aaab9a05bd0 (object 0x00000007321f8708, a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation), which is held by "catalina-exec-29-EventThread" "catalina-exec-29-EventThread": waiting to lock monitor 0x000000005f9f0af0 (object 0x0000000732a9c7e0, a org.apache.hadoop.hbase.zookeeper.RootRegionTracker), which is held by "catalina-exec-710" Java stack information for the threads listed above: =================================================== "catalina-exec-800": at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:943) - waiting to lock <0x0000000731902200> (a java.lang.Object) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725) at org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82) at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685) at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366) at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:171) at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:160) at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:150)
"catalina-exec-710": at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:599) - waiting to lock <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:158) - locked <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker) at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933) ...... at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948) - locked <0x0000000731902200> (a java.lang.Object)
"catalina-exec-29-EventThread": at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.stop(ZooKeeperNodeTracker.java:98) - waiting to lock <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:604) - locked <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:374) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:271) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497) Found 1 deadlock.
warn.log中报告Interrupted异常 是由上述死锁引起的
https://issues.apache.org/jira/browse/HBASE-5060 HBase client is blocked forever ,跟这个问题有点相似,但没有解决这问题
2 同进程另启动线程访问root-region-server
// check if zk is ok ZooKeeper zk = null; Watcher watch = new Watcher() { public void process(WatchedEvent event) { } }; String zookeeperQuorum = wbobjectHBaseConfMap.get("hbase.zookeeper.quorum"); if (StringUtils.isNotBlank(zookeeperQuorum)) { try { zk = new ZooKeeper(zookeeperQuorum, 30000, watch); byte[] data = zk.getData("/hbase/root-region-server", watch, null); ApiLogger.info(" get root-region-server success! ip:" + Util.toStr(data)); } catch (Exception e) { ApiLogger.error(" get root-region-server error!" + e.getMessage()); } finally { try { zk.close(); } catch (InterruptedException e) { ApiLogger.error("close zk error!"); } } }
发现独立线程在整个进程死锁时还能正常工作,HBase的zookeeper实例异常死锁后就不能恢复,导致scan操作超时到30s+,并且不能恢复,正常应该在ms级别。
因此认为是HBase客户端连接ZooKeeper时出问题,流程:
出现网络抖动或者root表迁移,缓存表未命中,重新去获取root-region-server,结果获取失败,进行ZooKeeper重置操作
经过研究死锁代码
import java.util.concurrent.TimeUnit;
import org.apache.hadoop.hbase.ServerName;
class ZooKeeperNodeTracker {
private boolean stopped = false;
private AbortAble hConnectionImplementation;
public ZooKeeperNodeTracker(AbortAble hConnectionImplementation) {
this.hConnectionImplementation = hConnectionImplementation;
}
public synchronized void stop() throws InterruptedException {
this.stopped = true;
System.out.println(Thread.currentThread()+"|"+Thread.currentThread().getId()+"stop zknode");
TimeUnit.MICROSECONDS.sleep(100);
notifyAll();
}
public boolean condition() {
return stopped;
}
public boolean start() {
stopped=false;
return true;
}
public synchronized boolean getData(int i) throws InterruptedException {
//error in get root region server
if (i %100 ==0){
hConnectionImplementation.resetZooKeeperTrackers();
throw new InterruptedException("interrupted");
}
return true;
}
}
public class HConnectionManagerTest {
static class HConnectionImplementation implements AbortAble{
public HConnectionImplementation() {
rootRegionTracker = new ZooKeeperNodeTracker(this);
}
private volatile ZooKeeperNodeTracker rootRegionTracker;
@Override
public synchronized void resetZooKeeperTrackers() {
try{
if (rootRegionTracker != null) {
rootRegionTracker.stop();
rootRegionTracker = null;
System.out.println(Thread.currentThread()+"|"+Thread.currentThread().getId()+"resetZooKeeperTrackers");
}
}catch(InterruptedException e){
System.out.println(Thread.currentThread()+"----------resetZooKeeperTrackers Interrupted-----------");
}
}
public void testGetData(String name) {
int i = 1;
while (i >0) {
i++;
try {
rootRegionTracker.getData(i);
} catch (Exception e) {
resetZooKeeperTrackers();
}
if(i %100 ==0){
rootRegionTracker = new ZooKeeperNodeTracker(this);
System.out.println(name+" restart test");
}
}
}
}
public static void main(String[] args) {
final HConnectionImplementation hcon = new HConnectionImplementation();
Thread scan1 = new Thread(new Runnable() {
public void run() {
hcon.testGetData("[scan1]");
}
});
Thread scan2 = new Thread(new Runnable() {
public void run() {
hcon.testGetData("[scan2]");
}
});
try{
scan1.start();
scan2.start();
TimeUnit.SECONDS.sleep(2);
}catch (InterruptedException e){
System.out.println("----------testgetdata -------interrupt");
}
}
}
类名构造函数都模拟HBase Client,并且放大getData error的情形,当同时并发两个scan操作时,前一个scan过程中,获取不到root-region-server,在ZooKeeperNodeTracker中做stop() 时,后一个scan也开始在HConnectionImplementation中执行resetZooKeeperTrackers(),两个资源ZooKeeperNodeTracker和HConnectionImplementation被各自分别占用等待,导致死锁。模拟程序的死锁解除可以更改
public synchronized boolean getData
方法为
public boolean getData
或
public synchronized void resetZooKeeperTrackers() { try{ if (rootRegionTracker != null) { rootRegionTracker.stop();
为
public void resetZooKeeperTrackers() { try{ if (rootRegionTracker != null) { synchronized(rootRegionTracker){ rootRegionTracker.stop();
解除互斥条件解决问题
3)最终解决方案
通过Hadoop大会现场跟HBase开发者Ted Yu咨询,称0.94.0有很多bug不稳定,建议升级到0.94.2,通过查看relase note, 官方的两个patch地址已在0.94.1中修复 (通过对hbase源码进行分析找对问题对应点,再看对应源码svn详细修改记录)
1 通过避免嵌套重试循环来解决rpc线程卡死: https://issues.apache.org/jira/browse/HBASE-6326 .
2 通过等待-root-的region地址设置到root
region tracker 来避免deadlock问题:https://issues.apache.org/jira/browse/HBASE-6115