zoukankan      html  css  js  c++  java
  • Java程序监控---Metrics

    概念

    Metrics是一个给JAVA服务的各项指标提供度量工具的包,在JAVA代码中嵌入Metrics代码,可以方便的对业务代码的各个指标进行监控

    目前最为流行的 metrics 库是来自 Coda Hale 的 dropwizard/metrics,该库被广泛地应用于各个知名的开源项目中。例如 Hadoop,Kafka,Spark,JStorm 中。

    有一些优点:

    • 提供了对Ehcache、Apache HttpClient、JDBI、Jersey、Jetty、Log4J、Logback、JVM等的集成
    • 支持多种Metric指标:Gauges、Counters、Meters、Histograms和Timers
    • 支持多种Reporter发布指标
      • JMX、Console,CSV文件和SLF4J loggers
      • Ganglia、Graphite,用于图形化展示

    MetricRegistry

    MetricRegistry类是Metrics的核心,它是存放应用中所有metrics的容器。也是我们使用 Metrics 库的起点。其中maven依赖添加在文末。

    1
    static final MetricRegistry metrics = new MetricRegistry();

    Reporter

    指标获取之后需要上传到各种地方,就需要用到Reporter。

    控制台

    监控指标直接打印在控制台

    1
    2
    3
    4
    5
    6
    7
    pravite static void startReportConsole() {
    ConsoleReporter reporter = ConsoleReporter.forRegistry(metrics)
    .convertRatesTo(TimeUnit.SECONDS)
    .convertDurationsTo(TimeUnit.MILLISECONDS)
    .build();
    reporter.start(1, TimeUnit.SECONDS);
    }

    JMX

    将监控指标上报到JMX中,后续可以通过其他的开源工具上传到Graphite等供图形化展示。从Jconsole中MBean中能看到。

    1
    2
    3
    4
    pravite static void startReportJmx(){
    JmxReporter reporterJmx = JmxReporter.forRegistry(metrics).build();
    reporterJmx.start();
    }

    Graphite

    将监控指标上传到Graphite,从Graphite-web中能看到上传的监控指标。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    pravite static void startReportGraphite(){
    Graphite graphite = new Graphite(new InetSocketAddress("graphite.xxx.com", 2003));
    GraphiteReporter reporter = GraphiteReporter.forRegistry(metrics)
    .prefixedWith("test.metrics")
    .convertRatesTo(TimeUnit.SECONDS)
    .convertDurationsTo(TimeUnit.MILLISECONDS)
    .filter(MetricFilter.ALL)
    .build(graphite);
    reporter.start(1, TimeUnit.MINUTES);
    }

    封装各种Reporter

    调用方式MetricCommon.getMetricAndStartReport();

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    public class MetricCommon {
    private static final MetricRegistry metricRegistry = new MetricRegistry();
    public static MetricRegistry getMetricAndStartReport(){
    startReportConsole();
    startReportJmx();
    startReportGraphite();
    return metricRegistry;
    }
    pravite static void startReportConsole() {...}
    pravite static void startReportJmx(){...}
    pravite static void startReportGraphite(){...}
    }

    Metics指标

    Metrics 有如下监控指标:

    • Gauges:记录一个瞬时值。例如一个待处理队列的长度。
    • Histograms:统计单个数据的分布情况,最大值、最小值、平均值、中位数,百分比(75%、90%、95%、98%、99%和99.9%)
    • Meters:统计调用的频率(TPS),总的请求数,平均每秒的请求数,以及最近的1、5、15分钟的平均TPS
    • Timers:当我们既要统计TPS又要统计耗时分布情况,Timer基于Histograms和Meters来实现
    • Counter:计数器,自带inc()和dec()方法计数,初始为0。
    • Health Checks:用于对Application、其子模块或者关联模块的运行是否正常做检测

    Gauges

    最简单的度量指标,只有一个简单的返回值,例如,我们想衡量一个待处理队列中任务的个数

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    public class GaugeTest {
    private static final MetricRegistry registry = MetricCommon.getMetricAndStartReport();
    private static final Random random = new Random();
    @Test
    public void testOneGuage() throws InterruptedException {
    Queue queue= new LinkedList<String>();
    registry.register(MetricRegistry.name(GaugeTest.class, "testGauges-queue-size", "size"),
    (Gauge<Integer>) () -> queue.size());
    while(true){
    Thread.sleep(1000);
    queue.add("Job-xxx");
    }
    }
    @Test
    public void testMultiGuage() throws InterruptedException {
    Map<Integer, Integer> map = new ConcurrentHashMap<>();
    while(true){
    int i = random.nextInt(100);
    int j = i % 10;
    if(!map.containsKey(j)){
    map.put(j,i);
    registry.register(MetricRegistry.name(GaugeTest.class, "testGauges-number", String.valueOf(j)),
    (Gauge<Integer>) () -> map.get(j));
    }else{
    map.put(j,i);
    }
    Thread.sleep(1000);
    }
    }
    }

    第一个测试用例,是用一个guage记录队列的长度

    1
    2
    3
    -- Gauges ----------------------------------------------------------------------
    GaugeTest.testGauges-queue-size.size
    value = 4

    第二个测试用例,每次产生一个100以内的随机数,将这些数以个位数的数字分组,guage记录每一组现在是什么数。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    -- Gauges ----------------------------------------------------------------------
    GaugeTest.testGauges-number.0
    value = 60
    GaugeTest.testGauges-number.1
    value = 1
    GaugeTest.testGauges-number.2
    value = 82
    GaugeTest.testGauges-number.3
    value = 23
    GaugeTest.testGauges-number.4
    value = 74
    GaugeTest.testGauges-number.5
    value = 25
    GaugeTest.testGauges-number.7
    value = 17
    GaugeTest.testGauges-number.8
    value = 78
    GaugeTest.testGauges-number.9
    value = 69

    Histogram

    Histogram统计数据的分布情况。比如最小值,最大值,中间值,还有中位数,75百分位, 90百分位, 95百分位, 98百分位, 99百分位, 和 99.9百分位的值(percentiles)。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    public class HistogramTest {
    private static final MetricRegistry registry = MetricCommon.getMetricAndStartReport();
    public static Random random = new Random();
    @Test
    public void test() throws InterruptedException {
    Histogram histogram = new Histogram(new ExponentiallyDecayingReservoir());
    registry.register(MetricRegistry.name(HistogramTest.class, "request", "histogram"), histogram);
    while(true){
    Thread.sleep(1000);
    histogram.update(random.nextInt(100000));
    }
    }
    }

    运行很长时间之后,相当于随机值取极限,会趋向于统计值,75%肯定是要<=75000,99.9%肯定是要<=999000。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    -- Histograms ------------------------------------------------------------------
    HistogramTest.request.histogram
    count = 1336
    min = 97
    max = 99930
    mean = 49816.49
    stddev = 29435.27
    median = 49368.00
    75% <= 75803.00
    95% <= 95340.00
    98% <= 98096.00
    99% <= 98724.00
    99.9% <= 99930.00

    Meters

    Meter度量一系列事件发生的速率(rate),例如TPS。Meters会统计最近1分钟,5分钟,15分钟,还有全部时间的速率。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    public class MetersTest {
    MetricRegistry registry = MetricCommon.getMetricAndStartAllReport("nc110x.corp.youdao.com","test.metrics");
    public static Random random = new Random();
    @Test
    public void testOne() throws InterruptedException {
    Meter meterTps = registry.meter(MetricRegistry.name(MetersTest.class,"request","tps"));
    while(true){
    meterTps.mark();
    Thread.sleep(random.nextInt(1000));
    }
    }
    @Test
    public void testMulti() throws InterruptedException {
    while(true){
    int i = random.nextInt(100);
    int j = i % 10;
    Meter meterTps = registry.meter(MetricRegistry.name(MetersTest.class,"request","tps",String.valueOf(j)));
    meterTps.mark();
    Thread.sleep(10);
    }
    }
    }

    这里,多个注册多个meter与注册多个guage、Histograms用法会有不同,meter方法是getOrAdd

    1
    2
    3
    public Meter meter(String name) {
    return (Meter)this.getOrAdd(name, MetricRegistry.MetricBuilder.METERS);
    }

    一个meter的测试用例,运行结果如下。可以看到随着次数的增多,各种rate无限趋近于2次。

    1
    2
    3
    4
    5
    6
    7
    -- Meters ------------------------------- 大专栏  Java程序监控---Metrics---------------------------------------
    MetersTest.request.tps
    count = 452
    mean rate = 1.99 events/second
    1-minute rate = 2.03 events/second
    5-minute rate = 2.00 events/second
    15-minute rate = 2.00 events/second

    多个meter的测试用例,运行结果取了数字个位数为6/7/8的三个如下。最后都会无限趋近于10。sleep时间为10ms,每秒有100份,平均到尾数不同的,每组就有10份。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    MetersTest.request.tps.6
    count = 905
    mean rate = 9.74 events/second
    1-minute rate = 9.76 events/second
    5-minute rate = 9.94 events/second
    15-minute rate = 9.98 events/second
    MetersTest.request.tps.7
    count = 935
    mean rate = 10.07 events/second
    1-minute rate = 10.62 events/second
    5-minute rate = 11.82 events/second
    15-minute rate = 12.19 events/second
    MetersTest.request.tps.8
    count = 937
    mean rate = 10.09 events/second
    1-minute rate = 10.09 events/second
    5-minute rate = 10.31 events/second
    15-minute rate = 10.37 events/second

    Timer

    Timer其实是 Histogram 和 Meter 的结合, histogram 某部分代码/调用的耗时, meter统计TPS。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    public class TimerTest {
    public static Random random = new Random();
    private static final MetricRegistry registry = MetricCommon.getMetricAndStartAllReport("nc110x.corp.youdao.com","test.metrics");
    private static final Map<Integer,Timer> timerMap = new ConcurrentHashMap<>();
    @Test
    public void testOneTimer() throws InterruptedException {
    Timer timer = registry.timer(MetricRegistry.name(TestTimer.class,"get-latency"));
    Timer.Context ctx;
    while(true){
    ctx = timer.time();
    Thread.sleep(random.nextInt(1000));
    ctx.stop();
    }
    }
    @Test
    public void testMultiTimer() throws InterruptedException {
    while(true){
    int i = random.nextInt(100);
    int j = i % 10;
    Timer timer = registry.timer(MetricRegistry.name(TestTimer.class,"get-latency",String.valueOf(j)));
    Timer.Context ctx;
    ctx = timer.time();
    Thread.sleep(random.nextInt(1000));
    ctx.stop();
    Thread.sleep(1000);
    }
    }
    }

    测试用例1是单个timer,结果如下。最后的时间都趋近于统计值。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    -- Timers ----------------------------------------------------------------------
    com.testmetrics.TestTimer.get-latency
    count = 657
    mean rate = 2.05 calls/second
    1-minute rate = 1.98 calls/second
    5-minute rate = 2.02 calls/second
    15-minute rate = 2.01 calls/second
    min = 4.98 milliseconds
    max = 998.93 milliseconds
    mean = 496.79 milliseconds
    stddev = 297.46 milliseconds
    median = 501.02 milliseconds
    75% <= 765.09 milliseconds
    95% <= 952.03 milliseconds
    98% <= 974.12 milliseconds
    99% <= 989.02 milliseconds
    99.9% <= 998.93 milliseconds

    Counters

    Counter 就是计数器,Counter 只是用 Gauge 封装了 AtomicLong 。我们可以使用如下的方法,使得获得队列大小更加高效。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    public class CounterTest {
    private static final MetricRegistry registry = MetricCommon.getMetricAndStartReport();
    public static Queue<String> q = new LinkedBlockingQueue<String>();
    public static Counter pendingJobs;
    public static Random random = new Random();
    public static void addJob(String job) {
    pendingJobs.inc();
    q.offer(job);
    }
    public static String takeJob() {
    pendingJobs.dec();
    return q.poll();
    }
    @Test
    public void test() throws InterruptedException {
    pendingJobs = registry.counter(MetricRegistry.name(Queue.class,"pending-jobs","size"));
    int num = 1;
    while(true){
    Thread.sleep(200);
    if (random.nextDouble() > 0.7){
    String job = takeJob();
    System.out.println("take job : "+job);
    }else{
    String job = "Job-"+num;
    addJob(job);
    System.out.println("add job : "+job);
    }
    num++;
    }
    }
    }

    job会越来越多,因为每次取走只取一个job,但是加入job是加入num个,num会一直增加,而概率是7:3。

    1
    2
    3
    -- Counters --------------------------------------------------------------------
    java.util.Queue.pending-jobs.size
    count = 36

    HeathChecks

    Metrics提供了一个独立的模块:Health Checks,用于对Application、其子模块或者关联模块的运行是否正常做检测。该模块是独立metrics-core模块的,使用时则导入metrics-healthchecks包。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    public class HeathChecksTest extends HealthCheck {
    @Override
    protected Result check() throws Exception {
    Random random = new Random();
    if(random.nextInt(10)!=9){
    return Result.healthy();
    }else{
    return Result.unhealthy("oh,unhealthy");
    }
    }
    @Test
    public void test() throws InterruptedException {
    HealthCheckRegistry registry = new HealthCheckRegistry();
    registry.register("check1",new HeathChecksTest());
    registry.register("check2", new HeathChecksTest());
    while (true) {
    for (Map.Entry<String, Result> entry : registry.runHealthChecks().entrySet()) {
    if (entry.getValue().isHealthy()) {
    System.out.println(entry.getKey() + ": OK, message:"+entry.getValue());
    } else {
    System.err.println(entry.getKey() + ": FAIL, error message: " + entry.getValue());
    }
    }
    Thread.sleep(1000);
    }
    }
    }

    注册两个HeathChecks,重写其check()方法为取随机数,只要不是9就为healthy,输出结果如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    check1: OK, message:Result{isHealthy=true}
    check2: FAIL, error message: Result{isHealthy=false, message=oh,unhealthy}
    check1: OK, message:Result{isHealthy=true}
    check2: OK, message:Result{isHealthy=true}
    check1: OK, message:Result{isHealthy=true}
    check2: OK, message:Result{isHealthy=true}
    check1: OK, message:Result{isHealthy=true}
    check2: OK, message:Result{isHealthy=true}
    check1: OK, message:Result{isHealthy=true}

    maven依赖

    • metrics-core:必须添加
    • metrics-healthchecks:用到healthchecks时添加
    • metrics-graphite:用到graphite时添加
    • org.slf4j:不添加看不到metrics-graphite包出错的log
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      <properties>
      <metrics.version>3.1.0</metrics.version>
      <sl4j.version>1.7.22</sl4j.version>
      </properties>
      <dependency>
      <groupId>io.dropwizard.metrics</groupId>
      <artifactId>metrics-core</artifactId>
      <version>${metrics.version}</version>
      </dependency>
      <dependency>
      <groupId>io.dropwizard.metrics</groupId>
      <artifactId>metrics-healthchecks</artifactId>
      <version>${metrics.version}</version>
      </dependency>
      <dependency>
      <groupId>io.dropwizard.metrics</groupId>
      <artifactId>metrics-graphite</artifactId>
      <version>${metrics.version}</version>
      </dependency>
      <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-api</artifactId>
      <version>${sl4j.version}</version>
      </dependency>
      <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-simple</artifactId>
      <version>${sl4j.version}</version>
      </dependency>

    参考

    http://metrics.dropwizard.io/3.1.0/getting-started/
    http://www.cnblogs.com/nexiyi/p/metrics_sample_1.html
    http://wuchong.me/blog/2015/08/01/getting-started-with-metrics/

  • 相关阅读:
    Kubernetes Admission
    kops文章
    eks文章
    AWS CloudFormation
    AWS Secrets Manager
    如何在C# WinForm 程序中使用WebBrowser控件时设置COOKIE的值。
    Windows Server 2008 服务器核心(Serve Core)实战2
    让IIS支持WAP站点。
    C#中的委托,匿名方法和Lambda表达式(转载)
    数据库状态回复指令。
  • 原文地址:https://www.cnblogs.com/lijianming180/p/12259003.html
Copyright © 2011-2022 走看看