zoukankan      html  css  js  c++  java
  • 记录一次concurrent mode failure问题排查过程以及解决思路

    背景:后台定时任务脚本每天凌晨5点30会执行一个批量扫库做业务的逻辑。

    gc错误日志:

    2017-07-05T05:30:54.408+0800: 518534.458: [CMS-concurrent-mark-start]
    2017-07-05T05:30:55.279+0800: 518535.329: [GC 518535.329: [ParNew: 838848K->838848K(1118464K), 0.0000270 secs]
    [CMS-concurrent-mark: 1.564/1.576 secs] [Times: user=10.88 sys=0.31, real=1.57 secs]
     (concurrent mode failure): 2720535K->2719116K(2796224K), 13.3742340 secs] 
     3559383K->2719116K(3914688K), 
     [CMS Perm : 38833K->38824K(524288K)], 13.3748020 secs] [Times: user=16.19 sys=0.00, real=13.37 secs]
    2017-07-05T05:31:08.659+0800: 518548.710: [GC [1 CMS-initial-mark: 2719116K(2796224K)] 2733442K(3914688K), 0.0065150 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
    2017-07-05T05:31:08.666+0800: 518548.716: [CMS-concurrent-mark-start]
    2017-07-05T05:31:09.528+0800: 518549.578: 
    [GC 518549.578: [ParNew: 838848K->19737K(1118464K), 0.0055800 secs] 
    3557964K->2738853K(3914688K), 0.0060390 secs] [Times: user=0.09 sys=0.00, real=0.01 secs]
    [CMS-concurrent-mark: 1.644/1.659 secs] [Times: user=14.15 sys=0.84, real=1.66 secs]
    2017-07-05T05:31:10.326+0800: 518550.376: [CMS-concurrent-preclean-start]
    2017-07-05T05:31:10.341+0800: 518550.391: [CMS-concurrent-preclean: 0.015/0.015 secs] [Times: user=0.05 sys=0.02, real=0.02 secs]
    2017-07-05T05:31:10.341+0800: 518550.391: [CMS-concurrent-abortable-preclean-start]

    借鉴于:understanding-cms-gc-logs 

    得知导致concurrent mode failure的原因有是: there was not enough space in the CMS generation to promote the worst case surviving young generation objects. We name this failure as “full promotion guarantee failure” 

    解决的方案有: The concurrent mode failure can either be avoided increasing the tenured generation size or initiating the CMS collection at a lesser heap occupancy by setting CMSInitiatingOccupancyFraction to a lower value and setting UseCMSInitiatingOccupancyOnly to true.

    第二种方案需要综合考虑下,因为如果设置的CMSInitiatingOccupancyFraction过低有可能导致频繁的cms 降低性能。[参考不建议3g下配置cms:why no cms under 3G

    问题排查:

    1 jvm参数配置 -Xmx4096m -Xms2048m   -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSCompactAtFullCollection -XX:MaxTenuringThreshold=10 -XX:-UseAdaptiveSizePolicy -XX:PermSize=512M -XX:MaxPermSize=1024M -XX:SurvivorRatio=3 -XX:NewRatio=2 -XX:+PrintGCDateStamps -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails  几乎没什么问题

    2 从报警时间看每天凌晨5点30报警一次, 应该是定时任务的问题。

    该问题很容易排查,服务是个脚本服务,线上业务逻辑几乎没有,所以根据时间点找到定时任务的业务逻辑,就可以分析出来问题。

    业务代码:

         int batchNumber = 1;
            int realCount = 0;
            int offset = 0;
            int limit = 999;
            int totalCount = 0;
            //初始化20个大小的线程池
            ExecutorService service = Executors.newFixedThreadPool(20);
            while (true) {
                LogUtils.info(logger, "{0},{1}->{2}", batchNumber, offset, (offset + limit));
                try {
                    //分页查询
                    Set<String> result = query(offset, limit);
                    realCount = result.size();
                    //将查询出的数据放入线程池执行
                    service.execute(new AAAAAAA(result, batchNumber));
                } catch (Exception e) {
                    LogUtils.error(logger, e, "exception,batch:{0},offset:{1},count:{2}", batchNumber, offset, limit);
                    break;
                }
                totalCount += realCount;
                if (realCount < limit) {
                    break;
                }
                batchNumber++;
                offset += limit;
            }
            service.shutdown();

    用了一个固定20个线程的线程池,循环执行每次从数据库里面取出来999条数据放到线程池里面去跑

    分析

    newFixedThreadPool
    底层用了一个
    LinkedBlockingQueue
    无限队列,而我的数据有2kw+条,这样死循环取数据放到队列里面没有把内存撑爆算好的吧???

    最后换成
    BlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(20);
    ThreadPoolExecutor service = new ThreadPoolExecutor(20, 20, 1, TimeUnit.HOURS, queue, new ThreadPoolExecutor.CallerRunsPolicy());

      用了个固定长度的队列,而且失败策略用的callerruns,可以理解为不能执行并且不能加入等待队列的时候主线程会直接跑run方法,会造成多线程变单线程,降低效率。

    明天看看效果如何。

    后记:

      对于线程池阻塞更好的方案在这里: 重写一个拒绝策略,让队列满的时候阻塞主线程,等待队列消费后恢复。

    new RejectedExecutionHandler() {
    	@Override
    	public void rejectedExecution(Runnable r, ThreadPoolExecutor executor) {
    		if (!executor.isShutdown()) {
    			try {
    				executor.getQueue().put(r);
    			} catch (InterruptedException e) {
    				// should not be interrupted
    			}
    		}
    	}
    };
    

      用put代替offer,前者失败后阻塞,后者失败后直接返回false,线程池的设计还是很有意思的。

    详见:并发编程网-支持生产阻塞的线程池

     
  • 相关阅读:
    atexit函数的使用【学习笔记】
    Bootloader与Kernel间参数传递机制 taglist【转】
    Uboot中start.S源码的指令级的详尽解析【转】
    修改u-boot的开机logo及显示过程【转】
    Android 5.x SEAndroid/SElinux内核节点的读写权限【学习笔记】
    【转】IOS 计时器 NSTimer
    【转】iOS-延迟操作方法总结
    ios第三方开源库
    【转】 UIALertView的基本用法与UIAlertViewDelegate对对话框的事件处理方法
    【转】iOS类似Android上toast效果
  • 原文地址:https://www.cnblogs.com/Jaxlinda/p/7145912.html
Copyright © 2011-2022 走看看