zoukankan      html  css  js  c++  java
  • 记一次 JVM 周期性 FGC 排查

    问题

    观察生产环境 JMX 监控, 发现应用在内存压力不大的情况下每小时会发生一次 full gc, 每次耗时约 300 毫秒. 这显然不是正常情况, 需要进行排查.

    排查过程

    首先查看 GC 日志, 发现 FGC 是由 System.gc() 触发的, 日志内容大致如下(日志内容来源于周期性Full GC的异常排查):

    [GC (System.gc())
    Desired survivor size 5767168 bytes, new threshold 1 (max 15)
     [PSYoungGen: 518231K->2816K(693760K)] 1188617K->673417K(2092032K), 0.0274540 secs] [Times: user=0.06 sys=0.00, real=0.03 secs] 
    Heap after GC invocations=194 (full 1):
     PSYoungGen      total 693760K, used 2816K [0x00000007d5500000, 0x0000000800000000, 0x0000000800000000)
      eden space 688128K, 0% used [0x00000007d5500000,0x00000007d5500000,0x00000007ff500000)
      from space 5632K, 50% used [0x00000007ff500000,0x00000007ff7c0010,0x00000007ffa80000)
      to   space 5632K, 0% used [0x00000007ffa80000,0x00000007ffa80000,0x0000000800000000)
     ParOldGen       total 1398272K, used 670601K [0x000000077ff80000, 0x00000007d5500000, 0x00000007d5500000)
      object space 1398272K, 47% used [0x000000077ff80000,0x00000007a8e627e8,0x00000007d5500000)
     PSPermGen       total 262144K, used 125738K [0x000000076ff80000, 0x000000077ff80000, 0x000000077ff80000)
      object space 262144K, 47% used [0x000000076ff80000,0x0000000777a4a938,0x000000077ff80000)
    }
    

    对项目代码进行搜索后, 排除了代码中调用 System.gc() 的可能. 结合每小时触发一次 FGC 的周期性, 怀疑是有线程在循环调用, 使用了 jstack 找出了两个可疑线程:

    "RMI RenewClean-[xxx.xxx.xxx.xxx:35539]" #323 daemon prio=5 os_prio=0 tid=0x00007f82a4022800 nid=0x144 in Object.wait() [0x00007f81d7ebf000]                                                                           
    java.lang.Thread.State: TIMED_WAITING (on object monitor)                                                                                                                                                       
    at java.lang.Object.wait(Native Method)                                                                                                                                                                    
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)                                                                                                                                            
    - locked <0x00000007035c6ba0> (a java.lang.ref.ReferenceQueue$Lock)                                                                                                                                        
    at sun.rmi.transport.DGCClient$EndpointEntry$RenewCleanThread.run(DGCClient.java:563)
    at java.lang.Thread.run(Thread.java:748)                                                     
    
    
    "GC Daemon" #244 daemon prio=2 os_prio=0 tid=0x00007f8405f69800 nid=0xf5 in Object.wait() [0x00007f81ddb0d000]
    java.lang.Thread.State: TIMED_WAITING (on object monitor)                                                                     
    at java.lang.Object.wait(Native Method)                                                                                                                                                                    
    at sun.misc.GC$Daemon.run(GC.java:117)
    - locked <0x00000007035bddc8> (a sun.misc.GC$LatencyL
    

    检查 sun.misc.GC 和 sun.rmi.transport.DGCClient 两个类:

    DGCClient:

    private static final long gcInterval = (Long)AccessController.doPrivileged(new GetLongAction("sun.rmi.dgc.client.gcInterval", 3600000L));
    
    private static class EndpointEntry {
        public static DGCClient.EndpointEntry lookup(Endpoint var0) {
            synchronized(endpointTable) {
                DGCClient.EndpointEntry var2 = (DGCClient.EndpointEntry)endpointTable.get(var0);
                if (var2 == null) {
                    var2 = new DGCClient.EndpointEntry(var0);
                    endpointTable.put(var0, var2);
                    if (gcLatencyRequest == null) {
                        gcLatencyRequest = GC.requestLatency(DGCClient.gcInterval);
                    }
                }
    
                return var2;
            }
        }
    }
    

    GC:

    private static Thread daemon = null;
    
    public static GC.LatencyRequest requestLatency(long var0) {
        return new GC.LatencyRequest(var0);
    } 
    
    
    

    由上面的代码可知开启 RMI 会通过 DGCClient 调用 GC#requestLatency(long) 方法, 创建一个守护线程, 周期性执行 System.gc(). 这个周期为 DGCClient.gcInterval, 默认为一小时.

    另外, 项目还依赖了 cxf, cxf 默认会每 10 小时执行一次 System.gc():

    if (!skipHack("org.apache.cxf.JDKBugHacks.gcRequestLatency")) {
        Class<?> clazz = Class.forName("sun.misc.GC");
        Method method = clazz.getDeclaredMethod("currentLatencyTarget");
        Long l = (Long)method.invoke(null);
        if (l != null && l.longValue() == 0) {
            //something already set it, move on
            method = clazz.getDeclaredMethod("requestLatency",
                new Class[] {Long.TYPE});
            method.invoke(null, Long.valueOf(36000000));
        }
    }
    

    解决方法

    1. 将 sun.rmi.dgc.server.gcInterval 和 sun.rmi.dgc.client.gcInterval 参数设置为 9223372036854775807L (对这个值做了特殊处理 GC 守护线程会马上结束), 还需要一个参数 org.apache.cxf.JDKBugHacks.gcRequestLatency 设置为 false
    -Dsun.rmi.dgc.server.gcInterval=9223372036854775807L -Dsun.rmi.dgc.client.gcInterval=9223372036854775807L -Dorg.apache.cxf.JDKBugHacks.gcRequestLatency=false
    
    1. 将 sun.rmi.dgc.server.gcInterval 和 sun.rmi.dgc.client.gcInterval 设置为 9223372036854775806L, 此时守护线程会一直阻塞, 不会再触发 GC.
    2. 使用 CMS 回收器, 并开启并行回收, 以减少停顿时间
    -XX:+ExplicitGCInvokesConcurrent -XX:+UseConcMarkSweepGC
    

    参考

    gc-tuning-in-practice
    周期性Full GC的异常排查

  • 相关阅读:
    pytorch 之cuda语义
    pytorch 的自动求导机制-----requiers_grad 和volatile
    pytorch 之 Variable
    pytorch的安装
    【路径规划】OSQP曲线平滑 公式及代码
    【Ubuntu 1】ubuntu的软件包及便携系列 记录
    【路径规划】 The Dynamic Window Approach to Collision Avoidance (附python代码实例)
    【论文阅读】Learning to drive from a world on rails
    Motion Planning 是什么
    【路径规划】 Optimal Trajectory Generation for Dynamic Street Scenarios in a Frenet Frame (附python代码实例)
  • 原文地址:https://www.cnblogs.com/FJH1994/p/12317544.html
Copyright © 2011-2022 走看看