zoukankan      html  css  js  c++  java
  • hbase regionserver异常宕机

    原因分析:

    线上hbase,在凌晨1点左右,发现某一台regionserver进行了重启(regionserver加了守护线程)

    1、查看master日志:

    2020-02-27 01:04:57,001 ERROR [RpcServer.FifoRWQ.default.read.handler=26,queue=10,port=16000] master.MasterRpcServices: Region server a3ster,16020,1582342923163 reported a fatal error:
    ABORTING region server a3ser,16020,1582342923163: Replay of WAL required. Forcing server shutdown
    Cause:
    org.apache.hadoop.hbase.DroppedSnapshotException: region: T_BL,x0Ax00x00x00x00x00x00x00x00x00x00x00x00,1572576275632.069e4d877a4ff46f9964ac8bcddb09ef.
            at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2509)
            at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2186)
            at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2148)
            at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2039)
            at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1965)
            at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:505)
            at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:475)
            at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75)
            at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:263)
            at java.lang.Thread.run(Thread.java:748)
    Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 300000 ms for ringBufferSequence=101793126, WAL system stuck?
            at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:174)
            at org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1406)
            at org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:1400)
            at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1512)
            at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:126)
            at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:75)
            at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2486)
            ... 9 more
    
    2020-02-27 01:04:57,032 ERROR [RpcServer.FifoRWQ.default.read.handler=29,queue=8,port=16000] master.MasterRpcServices: Region server a3ser,16020,1582342923163 reported a fatal error:
    ABORTING region server a3serz,16020,1582342923163: Replay of WAL required. Forcing server shutdown
    Cause:

    2、查看regioserver 日志

    2020-02-27 01:04:56,813 WARN  [ResponseProcessor for block BP-1884348122-10.62.2.1-1545175191847:blk_1489206371_467735337] hdfs.DFSClient: Slow ReadProcessor read fields took 327586ms (threshold=30000ms); ack: seqno: 1 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 965211 4: "0000", targets: [11.23.3.3:9866, 11.23.3.5:9866]
    2020-02-27 01:04:56,816 FATAL [MemStoreFlusher.6] regionserver.HRegionServer: ABORTING region server a3serz,16020,1582342923163: Replay of WAL required. Forcing server shutdown
    org.apache.hadoop.hbase.DroppedSnapshotException: region: T_BL,x0Ax00x00x00x00x00x00x00x00x00x00x00x00,1572576275632.069e4d877a4ff46f9964ac8bcddb09ef.
            at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2509)
            at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2186)
            at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2148)
            at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2039)
            at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1965)
            at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:505)
            at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:475)
            at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75)
            at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:263)
            at java.lang.Thread.run(Thread.java:748)
    Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 300000 ms for ringBufferSequence=101793126, WAL system stuck?
            at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:174)
            at org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1406)
            at org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:1400)
            at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1512)
            at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:126)
            at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:75)
            at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2486)

    分析:

    出现DroppedSnapshotException错误,一般都是由于在进行刷新memstore时,出现了问题,上诉标黄的地方,说明在刷新某一个region级别的memstore时,往hdfs写入数据时间过长,导致regionserver挂掉

    hbase memstore 刷新触发条件如下:

    HBase会在如下几种情况下触发flush操作,需要注意的是MemStore的最小flush单元是HRegion而不是单个MemStore。可想而知,如果一个HRegion中Memstore过多,每次flush的开销必然会很大,因此我们也建议在进行表设计的时候尽量减少ColumnFamily的个数。
    
    Memstore级别限制:当Region中任意一个MemStore的大小达到了上限(hbase.hregion.memstore.flush.size,默认128MB),会触发Memstore刷新。
    
    Region级别限制:当Region中所有Memstore的大小总和达到了上限(hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size,默认 2* 128M = 256M),会触发memstore刷新。
    
    Region Server级别限制:当一个Region Server中所有Memstore的大小总和达到了上限(hbase.regionserver.global.memstore.upperLimit * hbase_heapsize,默认 40%的JVM内存使用量),会触发部分Memstore刷新。Flush顺序是按照Memstore由大到小执行,先Flush Memstore最大的Region,再执行次大的,直至总体Memstore内存使用量低于阈值(hbase.regionserver.global.memstore.lowerLimit * hbase_heapsize,默认 38%的JVM内存使用量)。
    
    当一个Region Server中HLog数量达到上限(可通过参数hbase.regionserver.maxlogs配置)时,系统会选取最早的一个 HLog对应的一个或多个Region进行flush
    
    HBase定期刷新Memstore:默认周期为1小时,确保Memstore不会长时间没有持久化。为避免所有的MemStore在同一时间都进行flush导致的问题,定期的flush操作有20000左右的随机延时。
    
    手动执行flush:用户可以通过shell命令 flush ‘tablename’或者flush ‘region name’分别对一个表或者一个Region进行flush。

    上诉的这个问题不仅仅是hbase本身的问题,跟hdfs也相关。

    3、查看hbase 写入数据时,datanode节点的Slow状态情况

    $ egrep -o "Slow.*?(took|cost)" hadoop-hduser-datanode-a3ser.log.1 |sort |uniq -c
         36 Slow BlockReceiver write data to disk cost
       2743 Slow BlockReceiver write packet to mirror took
          2 Slow flushOrSync took
         35 Slow manageWriterOsCache took
         21 Slow PacketResponder send ack to upstream took

    说明:

    Slow BlockReceiver write data to disk cost : 表明在将块写入OS缓存或磁盘时存在延迟

    Slow BlockReceiver write packet to mirror took :表明在网络上写入块时有延迟

    Slow manageWriterOsCache took : 表明在将块写入OS缓存或磁盘时存在延迟

    Slow PacketResponder send ack to upstream took : 母鸡 。。。

    Slow flushOrSync took : 表明在将块写入OS缓存或磁盘时存在延迟

    4.解决方案

    1.设置memstore大小;HloG数量设置;
    
    2.check hdfs 并且修复
    3、检查datanode 集群负载,网络情况
    4.重启server。

    借鉴:http://ddrv.cn/a/258124

    a3ster
  • 相关阅读:
    几种常用的曲线
    0188. Best Time to Buy and Sell Stock IV (H)
    0074. Search a 2D Matrix (M)
    0189. Rotate Array (E)
    0148. Sort List (M)
    0859. Buddy Strings (E)
    0316. Remove Duplicate Letters (M)
    0452. Minimum Number of Arrows to Burst Balloons (M)
    0449. Serialize and Deserialize BST (M)
    0704. Binary Search (E)
  • 原文地址:https://www.cnblogs.com/yjt1993/p/12370837.html
Copyright © 2011-2022 走看看