今天查看生产环境的sentinel控制台,发现某dubbo应用一共5个节点,有3个失联了。
查看失联节点的应用日志,服务没有挂,各dubbo接口的日志正常在打印。
在应用节点ping/telnet sentinel控制台节点,ip和端口能够连通。
查看应用节点的sentinel-record日志,grep "Heartbeat" sentinel-record.log.2019-01-1*
,
sentinel-record.log.2019-01-14.0:2019-01-14 16:50:43 [Sentinel InitExecutor] Found init func: com.alibaba.csp.sentinel.transport.init.HeartbeatSenderInitFunc
sentinel-record.log.2019-01-14.0:2019-01-14 16:50:43 [SimpleHttpHeartbeatSender] Default console address list retrieved: [/xxx:xxx]
sentinel-record.log.2019-01-14.0:2019-01-14 16:50:43 [HeartbeatSenderInit] HeartbeatSender started: com.alibaba.csp.sentinel.transport.heartbeat.SimpleHttpHeartbeatSender
sentinel-record.log.2019-01-14.0:2019-01-14 16:50:43 [Sentinel InitExecutor] Initialized: com.alibaba.csp.sentinel.transport.init.HeartbeatSenderInitFunc with order 2147483647
发现1月14号有日志输出,当时应用程序有改动,构建发布后节点重新启动,日志显示心跳初始化正常。
用jmc查看各节点sentinel定时发送心跳的线程情况,
失联的节点:
正常的节点:
注意到失联节点线程状态全部变成了WAITING,而正常节点有一个线程是TIMED_WAITING。
HeartbeatSenderInitFunc
类的发送心跳代码:
private void scheduleHeartbeatTask(/*@NonNull*/ final HeartbeatSender sender, /*@Valid*/ long interval) {
pool.scheduleAtFixedRate(new Runnable() {
@Override
public void run() {
try {
sender.sendHeartbeat();
} catch (Throwable e) {
RecordLog.warn("[HeartbeatSender] Send heartbeat error", e);
}
}
}, 5000, interval, TimeUnit.MILLISECONDS);
RecordLog.info("[HeartbeatSenderInit] HeartbeatSender started: "
+ sender.getClass().getCanonicalName());
}
线程中sender.sendHeartbeat();
是捕获了Throwable
并记录了异常日志的;
而在日志中并没有搜到异常信息;
由此推断,定时任务的线程已经失效了。
查询资料可能是应用中有内存溢出,会导致线程挂掉。
在最近两周的应用日志里搜索OutOfMemoryError
[ ERROR] [2019-01-14 10:25:29] [6beacd73653e7f50/6beacd73653e7f50] [DubboServerHandler-xxx:xxx-thread-397] com.alibaba.dubbo.rpc.filter.ExceptionFilter [91] - [DUBBO] Got unchecked and undeclared exception which called by xxx. service: com.winxuan.services.shopps.service.ShopItemService, method: getShopItemInfoId, exception: java.lang.OutOfMemoryError: GC overhead limit exceeded, dubbo version: 2.6.0, current host: xxx
总结:
JAVA应用如果出现OutOfMemoryError,可能导致ScheduledExecutorService
失效。
参考:
ScheduledExecutorService is broken https://community.oracle.com/thread/1144316