一、
现象:压测过程中接口返回慢、有超时情况,后其中一台容器自动重启,重启后生成了dump目录
分析:猜测是发生了OOM并且容器配置了OOM自动输出dump文件。得到的dump文件有thread dump,gcutil,netstat
解决过程:
(1)
从jstack日志入手,依次根据deadlock、blocked关键字查找, 找到以下处于blocked状态的线程堆栈信息
"asyncExecutor-1093" #9490 prio=5 os_prio=0 tid=0x00007f12b400f800 nid=0x25c7 waiting for monitor entry [0x00007f0d1aa03000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.checkRollover(RollingFileManager.java:217)
- waiting to lock <0x000000065804dcc8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAcceslalaileManager)
at org.apache.logging.log4j.core.appender.RollingRandomAcceslalaileAppender.append(RollingRandomAcceslalaileAppender.java:207)
at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:448)
at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:433)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:403)
at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2091)
at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2011)
at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1884)
at com.lala.sgs.lalalog.log4j2.Log4j2Logger.error(Log4j2Logger.java:520)
"http-nio-8888-exec-4" #363 daemon prio=5 os_prio=0 tid=0x00007f0f48003000 nid=0x222 waiting for monitor entry [0x00007f11abeb8000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.logging.log4j.core.layout.TextEncoderHelper.copyDataToDestination(TextEncoderHelper.java:61)
- waiting to lock <0x000000065804dcc8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAcceslalaileManager)
at org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeTextWithCopy(TextEncoderHelper.java:57)
at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encodeWithThreadLocals(StringBuilderEncoder.java:70)
at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:63)
at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:32)
at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:219)
at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:57)
该线程blocked现象通过引入disruptor框架并配置asyncLogger后没有再出现blocked现象。(参考:https://www.cnblogs.com/waterystone/p/11170540.html )
(2)解决日志打印性能问题后重新压测,发现在send kafka,redis pool方面也有相应的性能问题。
"http-nio-8888-exec-295" #1570 daemon prio=5 os_prio=0 tid=0x00007f0fa40f6000 nid=0x6d7 waiting for monitor entry [0x00007f0d839e1000]
java.lang.Thread.State: BLOCKED (on object monitor)
at kafka.producer.Producer.send(Producer.scala:72)
- waiting to lock <0x0000000656965bf8> (a java.lang.Object)
at kafka.javaapi.producer.Producer.send(Producer.scala:33)
at com.lala.kafka.api.produce.KafkaProducer.send(KafkaProducer.java:65)
at com.lala.kafka.api.produce.KafkaProducer.sendBytes(KafkaProducer.java:54)
at com.lala.kafka.api.produce.ProducerPool.sendBytes(ProducerPool.java:49)
at com.lala.kafka.api.produce.BaseProducer.sendString(BaseProducer.java:14)
通过增大发送kafka的线程数poolSize得以解决。
"asyncExecutor-253" #1470 prio=5 os_prio=0 tid=0x00007f151000b000 nid=0x672 waiting on condition [0x00007f0d857da000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000659daa5c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:524)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:438)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:361)
at redis.clients.util.Pool.getResource(Pool.java:49)
at redis.clients.jedis.JedisSentinelPool.getResource(JedisSentinelPool.java:209)
at redis.clients.jedis.JedisSentinelPool.getResource(JedisSentinelPool.java:17)
at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.fetchJedisConnector(JedisConnectionFactory.java:194)
at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.getConnection(JedisConnectionFactory.java:348)
at org.springframework.data.redis.core.RedisConnectionUtils.doGetConnection(RedisConnectionUtils.java:129)
at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:92)
at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:79)
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:194)
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:169)
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:157)
看redis的配置使用的是默认的配置,默认值可从GenericObjectPoolConfig查到。
通过增大maxTotal、maxIdle、minIdle值得以解决。
(3)压测还在进行中
(4)通过gcutil查看gc的情况,发现YGC比较频繁,
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
12.42 0.00 96.21 50.59 90.24 86.46 1770 201.800 8 0.950 202.750
JVM配置是:
-server -Xmx6g -Xms6g -Xmn256m -XX:PermSize=128m -Xss256k -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:LargePageSizeInBytes=128m -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70
发现年轻代配得太小,于是加大了该项配置。改成如下:
-server -Xmx6g -Xms6g -Xmn3072m -Xss512k -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:LargePageSizeInBytes=128m -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintClassHistogram -Xloggc:../logs/gc.log
重新压测后gc情况有好转
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
15.07 0.00 77.15 32.25 90.37 86.32 116 24.551 8 6.216 30.767
dump获取方法(转):
(1)获取thread dump
使用 jstack
jstack 是 JDK 自带的工具,用于 dump 指定进程 ID(PID)的 JVM 的线程堆栈信息。
# 打印堆栈信息到标准输出
jstack PID
# 打印堆栈信息到标准输出,会打印关于锁的信息
jstack -l PID
强制打印堆栈信息到标准输出,如果使用 jstack PID 没有响应的情况下(此时 JVM 进程可能挂起),加 -F 参数
jstack -F PID
使用 jcmd
jcmd 是 JDK 自带的工具,用于向 JVM 进程发送命令,根据命令的不同,可以代替或部分代替 jstack、jmap 等。可以发送命令 Thread.print 来打印出 JVM 的线程堆栈信息。
# 下面的命令同等于 jstack PID
jcmd PID Thread.print
# 同等于 jstack -l PID
jcmd PID Thread.print -l
使用 kill -3
kill 可以向特定的进程发送信号(SIGNAL),缺省情况是发送终止(TERM) 的信号 ,即 kill PID 与 kill -15 PID 或 kill -TERM PID 是等价的。JVM 进程会监听 QUIT 信号(其值为 3),当收到这个信号时,会打印出当时的线程堆栈和堆内存使用概要,相比 jstack,此时多了堆内存的使用概要情况。但 jstack 可以指定 -l 参数,打印锁的信息。
kill -3 PID
# 或
kill -QUIT PID
(2)获取heap dump
# 将 JVM 的堆 dump 到指定文件,如果堆中对象较多,需要的时间会较长,子参数 format 只支持 b,即二进制格式
jmap -dump:format=b,file=FILE_WITH_PATH PID
# 如果 JVM 进程未响应命令,可以加上参数 -F 尝试
jmap -F -dump:format=b,file=FILE_WITH_PATH PID
# 可以只 dump 堆中的存活对象,加上 live 子参数,但使用 -F 时不支持 live
jmap -dump:live,format=b,file=FILE_WITH_PATH PID
(3)获取堆概要信息
# -heap 参数用于查看指定 JVM 进程的堆的信息,包括堆的各个参数的值,堆中新生代、年老代的内存大小、使用率等
jmap -heap PID
# 同样,如果 JVM 进程未响应命令,可以加上参数 -F 尝试
jmap -F -heap PID
(4)gcutil、gccapacity
jstat -gcutil PID
jstat -gccapacity PID