zoukankan      html  css  js  c++  java
  • 转: regionserver挂掉分析

    转载自:https://www.cnblogs.com/barneywill/p/9811971.html

    最近hbase集群很多region server挂掉,查看其中一个RegionServer1日志发现,17:17:14挂的时候服务器压力很大,有大量的responseTooSlow,也有不少gc,但是当时内存还有很多剩余,不是因为oom被kill

    2018-03-13T17:17:13.372+0800: [GC (Allocation Failure) 2018-03-13T17:17:13.372+0800: [ParNew: 3280066K->256481K(3762880K), 0.0429718 secs] 23536952K->20513367K(32801920K), 0.0432932 secs] [Times: user=1.92 sys=0.01, real=0.04 secs]

    Heap

    par new generation   total 3762880K, used 1416770K [0x00007f2f4c000000, 0x00007f305f990000, 0x00007f305f990000)

      eden space 3010368K,  38% used [0x00007f2f4c000000, 0x00007f2f92d182c0, 0x00007f3003bd0000)

      from space 752512K,  34% used [0x00007f3003bd0000, 0x00007f30136487c0, 0x00007f3031ab0000)

      to   space 752512K,   0% used [0x00007f3031ab0000, 0x00007f3031ab0000, 0x00007f305f990000)

    concurrent mark-sweep generation total 29039040K, used 19905973K [0x00007f305f990000, 0x00007f374c000000, 0x00007f374c000000)

    Metaspace       used 49530K, capacity 50072K, committed 55492K, reserved 57344K   

    region server挂掉的原因是tryRegionServerReport时发现被标记为dead server,抛出YouAreDeadException,然后HRegionServer.run方法会退出,意味着region server进程退出

    2018-03-13 17:17:08,159 FATAL [regionserver/RegionServer1/RegionServer1:16020] regionserver.HRegionServer: ABORTING region server RegionServer1,16020,1519805863844: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing RegionServer1,16020,1519805863844 as dead server

    org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing RegionServer1,16020,1519805863844 as dead server

             at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:434)

             at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:339)

             at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:339)

             at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(Reg

    ionServerStatusProtos.java:8617)

             at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2196)

             at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)

             at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)

             at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)

             at java.lang.Thread.run(Thread.java:745)

             at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

             at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

             at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

             at java.lang.reflect.Constructor.newInstance(Constructor.java:422)

             at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)

             at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)

             at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:330)

             at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1153)

             at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:962)

    被标记为dead server是因为zk session expired

    2018-03-13 17:17:08,128 INFO  [regionserver/RegionServer1/RegionServer1:16020-SendThread(ZKServer1:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 41499ms for sessionid 0x161d1433ba86c84, closing socket connection and attempting reconnect

    2018-03-13 17:17:08,445 FATAL [main-EventThread] regionserver.HRegionServer: ABORTING region server RegionServer1,16020,1519805863844: regionserver:16020-0x35fe18af217e685, quorum=ZKServer1:2181, baseZNode=/hbase regionserver:16020-0x35fe18af217e685 received expired from ZooKeeper, aborting

    在17:17:08超时41499ms是因为17:16:38的时候有一个full gc耗时29s

    2018-03-13T17:16:38.268+0800: [GC (Allocation Failure)

    2018-03-13T17:16:38.268+0800: [ParNew (promotion failed): 3762880K->3762880K(3762880K), 3.2398750 secs]

    2018-03-13T17:16:41.508+0800: [CMS2018-03-13T17:16:47.078+0800: [CMS-concurrent-sweep: 10.829/17.135 secs] [Times: user=161.97 sys=2.65, real=17.13 secs]

     (concurrent mode failure): 28187725K->21036514K(29039040K), 26.5884371 secs] 31874835K->21036514K(32801920K), [Metaspace: 49335K->49335K(57344K)], 29.8297421 secs] [Times: user=40.45 sys=0.14, real=29.82 secs]

    但是hbase-site.xml中配置session超时是180s,为什么41s就超时

      <property>

        <name>zookeeper.session.timeout</name>

        <value>180000</value>

      </property>

    建立zk session时的negotiated timeout = 40000,也就是40s,为什么配置的是180s,实际是40s?

    2018-03-13 17:15:11,522 INFO  [main-SendThread(ZKServer1:2181)] zookeeper.ClientCnxn: Session establishment complet

    e on server ZKServer1/ZKServer1:2181, sessionid = 0x35fe18af217e685, negotiated timeout = 40000

    2018-03-13 17:15:11,526 INFO  [regionserver/RegionServer1/RegionServer1:16020-SendThread(ZKServer1:2181)]

    zookeeper.ClientCnxn: Session establishment complete on server RegionServer1/RegionServer1:2181, sessionid = 0x161d1433b

    a86c84, negotiated timeout = 40000

    zk中建立连接后输出

    复制代码
                LOG.info("Session establishment complete on server "
    
                        + clientCnxnSocket.getRemoteSocketAddress()
    
                        + ", sessionid = 0x" + Long.toHexString(sessionId)
    
                        + ", negotiated timeout = " + negotiatedSessionTimeout
    
                        + (isRO ? " (READ-ONLY mode)" : ""));
    复制代码

    negotiatedSessionTimeout是在onConnected中被回调

            void onConnected(int _negotiatedSessionTimeout, long _sessionId,
    
                    byte[] _sessionPasswd, boolean isRO) throws IOException {

    回调是在ClientCnxnSocket中,取得是ConnectResponse.getTimeout()

    复制代码
            ConnectResponse conRsp = new ConnectResponse();
    
            conRsp.deserialize(bbia, "connect");
    
                      sendThread.onConnected(conRsp.getTimeOut(), this.sessionId,
    
                    conRsp.getPasswd(), isRO);
    复制代码

    这里取的是ServerCnxn中的sessionTimeout,这个值在ZooKeeperServer中做初始化

    复制代码
            int minSessionTimeout = getMinSessionTimeout();
    
            if (sessionTimeout < minSessionTimeout) {
    
                sessionTimeout = minSessionTimeout;
    
            }
    
            int maxSessionTimeout = getMaxSessionTimeout();
    
            if (sessionTimeout > maxSessionTimeout) {
    
                sessionTimeout = maxSessionTimeout;
    
            }
    复制代码

    可以看到会根据minSessionTimeout和maxSessionTimeout限制

        public int getMaxSessionTimeout() {
    
            return maxSessionTimeout == -1 ? tickTime * 20 : maxSessionTimeout;
    
        }

    由于zookeeper服务器tickTime设置的是2000ms,所以maxSessionTimeout默认会被设置为40000ms,所以解决这个问题需要修改zk的maxSessionTimeout;

  • 相关阅读:
    Linux中rar解压软件
    Linux中rpm和yum安装软件
    查看nova日志
    po编译为mo;django翻译多义性问题解决
    某服務器開端口
    linux環境查找某文件夾下含有某字符串的所有文件
    gerrit +git使用
    ubuntu 暂时使用root权限
    mysql-求中位数方法
    phonecat-angular.js route.js加载启动失败
  • 原文地址:https://www.cnblogs.com/wanxqing/p/10863632.html
Copyright © 2011-2022 走看看