我为什么不建议使用框架默认的 DefaultMeterObservationHandler

我为什么不建议使用框架默认的 DefaultMeterObservationHandler

个人创作公约:本人声明创作的所有文章皆为自己原创,如果有参考任何文章的地方,会标注出来,如果有疏漏,欢迎大家批判。如果大家发现网上有抄袭本文章的,欢迎举报,并且积极向这个 github 仓库 提交 issue,谢谢支持~
另外,本文为了避免抄袭,会在不影响阅读的情况下,在文章的随机位置放入对于抄袭和洗稿的人的“亲切”的问候。如果是正常读者看到,笔者在这里说声对不起,。如果被抄袭狗或者洗稿狗看到了,希望你能够好好反思,不要再抄袭了,谢谢。

背景知识

最近,我们升级了 SpringBoot 3.x,并且,升级后,我们全面改造了原来的 Sleuth 以及 Micrometer 监控,最新的 io.micrometer.observation.Observation 抽象将链路追踪以及指标监控(opentracing 和 opentelemetry 两个标准)结合,这样,我们就可以在链路追踪中,同时记录指标监控数据了。

并且,在此基础上,我们还加入了全局的 io.micrometer.observation.ObservationHandler,用于在 Observation start 的时候,生成 JFR 事件,在 stop 的时候 commit JFR 事件。这样我们就实现了通过一个 Observation

    1. 暴露指标监控数据到 /actuator/prometheus,供 Prometheus 采集,Grafana 展示:img.png
    1. 上报链路追踪数据到 Jaeger:img.png
    1. 生成 JFR 事件,供 JDK Mission Control 分析:img.png

为何要这么做呢?

  • 指标数据是统计数据,是聚合数据,是一段时间内的数据,而链路追踪数据是实时数据,是每个请求的数据
  • 但是像是链路追踪上报在生产上肯定不能 100% 上报(上报性能,还有成本考虑,以及查询的存储性能有限,成本不能太高),而是采样上报,这样就会导致链路追数据不全
  • JFR 是保留详细数据的好选择,保留在本地,只有发生问题需要排查的时候,事后按需获取就行。

这样,既节约了成本,又能及时报警,又能容易通过 Jaeger 通过链路追踪数据定位出问题的实例,然后通过获取对应实例的 JFR 事件定位详细问题。

全面使用 Observation 遇到了内存溢出以及 CPU 飙高(非不断 GC 导致)

但是,我们在全面使用 Observation 的时候,发现了一个问题,就是内存溢出以及 CPU 飙高(非不断 GC 导致),刚开始我们因为 CPU 飙高是内存溢出引起的,但是后来发现,并不只这个原因。

首先为何会出现内存溢出,我们先做个测试,添加依赖:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>3.2.1</version>
    </parent>

    <groupId>com.github.hashjang</groupId>
    <artifactId>wwsmbjysymrdo</artifactId>

    <properties>
        <!--所有项目 Java 基线版本为 17,如果需要升级,修改这里,注意不可降级-->
        <!--Baseline Java Version 17, if you need to upgrade, modify here, note that it cannot be downgraded-->
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

        <!--这里放我们自定义的依赖版本属性-->
        <disruptor.version>3.4.4</disruptor.version>
    </properties>

    <dependencies>
        <!--lombok简化代码-->
        <!--lombok simplifies code-->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
        </dependency>
        <!--日志需要用log4j2-->
        <!--choose log4j2 for logging-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter</artifactId>
            <exclusions>
                <exclusion>
                    <groupId>org.springframework.boot</groupId>
                    <artifactId>spring-boot-starter-logging</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-log4j2</artifactId>
        </dependency>
        <!--log4j2异步日志需要的依赖,所有项目都必须用log4j2和异步日志配置-->
        <!--Dependencies required for log4j2 asynchronous logging, all projects must use log4j2 and asynchronous logging configuration-->
        <dependency>
            <groupId>com.lmax</groupId>
            <artifactId>disruptor</artifactId>
            <version>${disruptor.version}</version>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>
    </dependencies>
</project>

之后编写代码测试:

package com.github.hashjang.wwsmbjysymrdo;

import io.micrometer.common.KeyValue;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Tag;
import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.observation.MeterObservationHandler;
import io.micrometer.observation.Observation;
import io.micrometer.observation.ObservationRegistry;
import lombok.extern.log4j.Log4j2;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.stereotype.Component;

import java.util.ArrayList;
import java.util.List;

@Log4j2
@SpringBootApplication
public class Main {
    public static void main(String[] args) {
        SpringApplication.run(Main.class);
    }

    @Component
    public static class Runner implements CommandLineRunner {
        @Autowired
        private ObservationRegistry observationRegistry;

        @Override
        public void run(String... args) throws Exception {
            //不断创建新的 Observation,然后 start
            while(true) {
                try {
                    Observation test = Observation.createNotStarted("test", observationRegistry);
                    test.start();
                } catch (Throwable r) {
                    log.error("error", r);
                    break;
                }
            }
            log.info("complete");
        }
    }
}

之后,限制 JVM 最大堆内存为 32m 之后启动,可以看到,不断创建新的 Observation,然后 start,最终导致内存溢出:

  .   ____          _            __ _ _
 /\ / ___'_ __ _ _(_)_ __  __ _    
( ( )\___ | '_ | '_| | '_ / _` |    
 \/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::                (v3.2.1)

2024-01-19T12:50:37.474+08:00  INFO 21401 --- [           main] c.g.h.w.Main                             : Starting Main using Java 17.0.8 with PID 21401 (/Users/hash/Desktop/Project/Java/spring-cloud-native/article/Problem/wwsmbjysymrdo/target/classes started by hash in /Users/hash/Desktop/Project/Java/spring-cloud-native)
2024-01-19T12:50:37.475+08:00  INFO 21401 --- [           main] c.g.h.w.Main                             : No active profile set, falling back to 1 default profile: "default"
2024-01-19T12:50:37.853+08:00  INFO 21401 --- [           main] c.g.h.w.Main                             : Started Main in 0.529 seconds (process running for 0.762)
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
WARN StatusConsoleListener org.apache.logging.log4j.spi.AbstractLogger caught java.lang.OutOfMemoryError logging ReusableSimpleMessage: error
 java.lang.OutOfMemoryError: Java heap space
2024-01-19T12:50:40.437+08:00  INFO 21401 --- [           main] c.g.h.w.Main                             : complete
2024-01-19T12:50:40.913+08:00  WARN 21401 --- [ionShutdownHook] o.s.b.f.s.DisposableBeanAdapter          : Invocation of close method failed on bean with name 'log4j2Metrics': java.lang.OutOfMemoryError: Java heap space

这里发生了内存溢出,但是仅仅因为 Observation 的创建和 start,就导致内存溢出,这是不合理的,我们分析一下,为何会出现这个问题。

为何会出现内存溢出

我们通过增加如下启动参数启动并且在退出的时候 dump JFR:

-XX:StartFlightRecording=disk=true,dumponexit=true

或者使用下面的参数在内存溢出的时候 dump 整个堆:

-XX:+HeapDumpOnOutOfMemoryError

都可以看到,内存溢出的时候,堆中有大量的 io.micrometer.core.instrument.internal.DefaultLongTaskTimer$SampleImpl 对象:
img.png

查看源码,发现在 org.springframework.boot.actuate.autoconfigure.observation.ObservationAutoConfiguration 中,会自动配置一些全局的 ObservationHandler

https://github.com/spring-projects/spring-boot/blob/main/spring-boot-project/spring-boot-actuator-autoconfigure/src/main/java/org/springframework/boot/actuate/autoconfigure/observation/ObservationAutoConfiguration.java

	@Configuration(proxyBeanMethods = false)
	@ConditionalOnClass(MeterRegistry.class)
	@ConditionalOnMissingClass("io.micrometer.tracing.Tracer")
	static class OnlyMetricsConfiguration {

		@Bean
		ObservationHandlerGrouping metricsObservationHandlerGrouping() {
			return new ObservationHandlerGrouping(MeterObservationHandler.class);
		}

	}

	@Configuration(proxyBeanMethods = false)
	@ConditionalOnClass(Tracer.class)
	@ConditionalOnMissingClass("io.micrometer.core.instrument.MeterRegistry")
	static class OnlyTracingConfiguration {

		@Bean
		ObservationHandlerGrouping tracingObservationHandlerGrouping() {
			return new ObservationHandlerGrouping(TracingObservationHandler.class);
		}

	}

	@Configuration(proxyBeanMethods = false)
	@ConditionalOnClass({ MeterRegistry.class, Tracer.class })
	static class MetricsWithTracingConfiguration {

		@Bean
		ObservationHandlerGrouping metricsAndTracingObservationHandlerGrouping() {
			return new ObservationHandlerGrouping(
					List.of(TracingObservationHandler.class, MeterObservationHandler.class));
		}

	}

	@Configuration(proxyBeanMethods = false)
	@ConditionalOnBean(MeterRegistry.class)
	@ConditionalOnMissingBean(MeterObservationHandler.class)
	static class MeterObservationHandlerConfiguration {

		@ConditionalOnMissingBean(type = "io.micrometer.tracing.Tracer")
		@Configuration(proxyBeanMethods = false)
		static class OnlyMetricsMeterObservationHandlerConfiguration {

			@Bean
			DefaultMeterObservationHandler defaultMeterObservationHandler(MeterRegistry meterRegistry) {
				return new DefaultMeterObservationHandler(meterRegistry);
			}

		}

		@ConditionalOnBean(Tracer.class)
		@Configuration(proxyBeanMethods = false)
		static class TracingAndMetricsObservationHandlerConfiguration {

			@Bean
			TracingAwareMeterObservationHandler<Observation.Context> tracingAwareMeterObservationHandler(
					MeterRegistry meterRegistry, Tracer tracer) {
				DefaultMeterObservationHandler delegate = new DefaultMeterObservationHandler(meterRegistry);
				return new TracingAwareMeterObservationHandler<>(delegate, tracer);
			}

		}

	}

	@Configuration(proxyBeanMethods = false)
	@ConditionalOnClass(Advice.class)
	static class ObservedAspectConfiguration {

		@Bean
		@ConditionalOnMissingBean
		ObservedAspect observedAspect(ObservationRegistry observationRegistry) {
			return new ObservedAspect(observationRegistry);
		}

	}

以上代码的意思是,根据你的项目中是否添加了链路追踪,或者指标监控的依赖,来初始化不同的 ObservationHandler,如果你的项目中只有指标监控,那么就会初始化 DefaultMeterObservationHandler,这个 DefaultMeterObservationHandler 会在每个 Observationstart 的时候,创建一个 io.micrometer.core.instrument.internal.DefaultLongTaskTimer$SampleImpl 对象,然后将这个 io.micrometer.core.instrument.internal.DefaultLongTaskTimer$SampleImpl 对象放入 DefaultLongTaskTimeractiveTasks 中,这是一个 ConcurrentLinkedDeque。在调用 Observationstop 的时候,会从 DefaultLongTaskTimeractiveTasks 中移除这个 io.micrometer.core.instrument.internal.DefaultLongTaskTimer$SampleImpl 对象。如果没有 stop,那么这个 ConcurrentLinkedDeque 就会越来越大,最终导致内存溢出。

为何解决内存溢出之后,还会出现 CPU 飙高

知道问题之后,我们给遗漏 stop 的地方加上了 try finally stop。但是,我们发现,即使加上了 try finally stop,也会出现 CPU 飙高的问题,我们通过 JFR 看一下,CPU 究竟消耗在哪里:

img.png

我们惊奇的发现,还是和 DefaultLongTaskTimer 有关,看来,即使我们加上了 try finally stop,但是 DefaultLongTaskTimeractiveTasks 里面还是有很多 io.micrometer.core.instrument.internal.DefaultLongTaskTimer$SampleImpl 对象,导致多线程 stop 的时候 CPU 飙高。我们做个实验:

package com.github.hashjang.wwsmbjysymrdo;

import io.micrometer.common.KeyValue;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Tag;
import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.observation.MeterObservationHandler;
import io.micrometer.observation.Observation;
import io.micrometer.observation.ObservationRegistry;
import lombok.extern.log4j.Log4j2;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.stereotype.Component;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.Executor;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.ThreadPoolExecutor;

@Log4j2
@SpringBootApplication
public class Main {
    public static void main(String[] args) {
        SpringApplication.run(Main.class);
    }

    @Component
    public static class Runner implements CommandLineRunner {
        @Autowired
        private ObservationRegistry observationRegistry;

        @Override
        public void run(String... args) throws Exception {
            ExecutorService executor = Executors.newFixedThreadPool(100);
            //预热
            for (int j = 0; j < 100000; j++) {
                executor.execute(() -> {
                    Observation observation = Observation.createNotStarted("test", observationRegistry);
                    observation.start();
                    observation.stop();
                });
            }
            //测试
            long start = System.currentTimeMillis();
            Future<?> test[] = new Future[100];
            for (int i = 0; i < 100; i++) {
                test[i] = executor.submit(() -> {
                    for (int j = 0; j < 100000; j++) {
                        Observation observation = Observation.createNotStarted("test", observationRegistry);
                        observation.start();
                        observation.stop();
                    }
                });
            }
            for (int i = 0; i < 100; i++) {
                test[i].get();
            }
            log.info("cost {} ms", System.currentTimeMillis() - start);
        }
    }
}

在我的电脑上,这个测试最后输出显示大概需要 5300ms 左右。我们将全局的 ObservationHandler 改为什么都不做的,对比下:

package com.github.hashjang.wwsmbjysymrdo;

import io.micrometer.common.KeyValue;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Tag;
import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.observation.MeterObservationHandler;
import io.micrometer.observation.Observation;
import io.micrometer.observation.ObservationRegistry;
import lombok.extern.log4j.Log4j2;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.stereotype.Component;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.Executor;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.ThreadPoolExecutor;

@Log4j2
@SpringBootApplication
public class Main {
    public static void main(String[] args) {
        SpringApplication.run(Main.class);
    }

    @Bean
    MeterObservationHandler<Observation.Context> defaultMeterObservationHandler(MeterRegistry meterRegistry) {
        return new MeterObservationHandler<>() {
        };
    }

    @Component
    public static class Runner implements CommandLineRunner {
        @Autowired
        private ObservationRegistry observationRegistry;

        @Override
        public void run(String... args) throws Exception {
            ExecutorService executor = Executors.newFixedThreadPool(100);
            //预热
            for (int j = 0; j < 100000; j++) {
                executor.execute(() -> {
                    Observation observation = Observation.createNotStarted("test", observationRegistry);
                    observation.start();
                    observation.stop();
                });
            }
            //测试
            long start = System.currentTimeMillis();
            Future<?> test[] = new Future[100];
            for (int i = 0; i < 100; i++) {
                test[i] = executor.submit(() -> {
                    for (int j = 0; j < 100000; j++) {
                        Observation observation = Observation.createNotStarted("test", observationRegistry);
                        observation.start();
                        observation.stop();
                    }
                });
            }
            for (int i = 0; i < 100; i++) {
                test[i].get();
            }
            log.info("cost {} ms", System.currentTimeMillis() - start);
        }
    }
}

最后大概需要 1400 ms 左右。这个差距还是很明显的。

默认的 DefaultMeterObservationHandler

我们看一下 DefaultMeterObservationHandler 的源码:

hhttps://github.com/micrometer-metrics/micrometer/blob/main/micrometer-core/src/main/java/io/micrometer/core/instrument/observation/DefaultMeterObservationHandler.java

    @Override
    public void onStart(Observation.Context context) {
        LongTaskTimer.Sample longTaskSample = LongTaskTimer.builder(context.getName() + ".active")
                .tags(createTags(context)).register(meterRegistry).start();
        context.put(LongTaskTimer.Sample.class, longTaskSample);

        Timer.Sample sample = Timer.start(meterRegistry);
        context.put(Timer.Sample.class, sample);
    }

    @Override
    public void onStop(Observation.Context context) {
        Timer.Sample sample = context.getRequired(Timer.Sample.class);
        sample.stop(Timer.builder(context.getName()).tags(createErrorTags(context)).tags(createTags(context))
                .register(this.meterRegistry));

        LongTaskTimer.Sample longTaskSample = context.getRequired(LongTaskTimer.Sample.class);
        longTaskSample.stop();
    }

可以看出,默认情况下,DefaultMeterObservationHandler 会在 Observationstart 的时候,创建一个 LongTaskTimer.Sample 对象,然后将这个 LongTaskTimer.Sample 对象放入 LongTaskTimeractiveTasks 中,然后还有一个 Timer.Sample 对象,这个 Timer.Sample 对象是用于记录 Observation 的耗时的。

其中,LongTaskTimer.Sample 对象会引起如果 Observation 只 start 没有 stop 的话,会导致内存溢出,而 Timer.Sample 对象没有这个问题。并且,LongTaskTimer.Sample 对象的在 stop 的时候,因为多线程的原因可能 ConcurrentLinkedQueue 很大导致 CPU 飙高。

解决方案

我们可以替换掉 DefaultMeterObservationHandler,自己实现一个 MeterObservationHandler,在 start 的时候,不创建 LongTaskTimer.Sample 对象,只保留 Timer.Sample 对象,这样就不会出现内存溢出以及 CPU 飙高的问题了。

package com.github.hashjang.wwsmbjysymrdo;

import io.micrometer.common.KeyValue;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Tag;
import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.observation.MeterObservationHandler;
import io.micrometer.observation.Observation;
import io.micrometer.observation.ObservationRegistry;
import lombok.extern.log4j.Log4j2;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.stereotype.Component;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;

@Log4j2
@SpringBootApplication
public class Main {
    public static void main(String[] args) {
        SpringApplication.run(Main.class);
    }

    @Bean
    MeterObservationHandler<Observation.Context> defaultMeterObservationHandler(MeterRegistry meterRegistry) {
        return new MeterObservationHandler<>() {
            @Override
            public void onStart(Observation.Context context) {
                Timer.Sample sample = Timer.start(meterRegistry);
                context.put(Timer.Sample.class, sample);
            }

            @Override
            public void onStop(Observation.Context context) {
                List<Tag> tags = createTags(context);
                tags.add(Tag.of("error", getErrorValue(context)));
                Timer.Sample sample = context.getRequired(Timer.Sample.class);
                sample.stop(Timer.builder(context.getName()).tags(tags).register(meterRegistry));
            }

            @Override
            public void onEvent(Observation.Event event, Observation.Context context) {
                Counter.builder(context.getName() + "." + event.getName())
                        .tags(createTags(context))
                        .register(meterRegistry)
                        .increment();
            }

            private String getErrorValue(Observation.Context context) {
                Throwable error = context.getError();
                return error != null ? error.getClass().getSimpleName() : "none";
            }

            private List<Tag> createTags(Observation.Context context) {
                List<Tag> tags = new ArrayList<>();
                for (KeyValue keyValue : context.getLowCardinalityKeyValues()) {
                    tags.add(Tag.of(keyValue.getKey(), keyValue.getValue()));
                }
                return tags;
            }
        };
    }

    @Component
    public static class Runner implements CommandLineRunner {
        @Autowired
        private ObservationRegistry observationRegistry;

        @Override
        public void run(String... args) throws Exception {
            ExecutorService executor = Executors.newFixedThreadPool(100);
            //预热
            for (int j = 0; j < 100000; j++) {
                executor.execute(() -> {
                    Observation observation = Observation.createNotStarted("test", observationRegistry);
                    observation.start();
                    observation.stop();
                });
            }
            //测试
            long start = System.currentTimeMillis();
            Future<?> test[] = new Future[100];
            for (int i = 0; i < 100; i++) {
                test[i] = executor.submit(() -> {
                    for (int j = 0; j < 100000; j++) {
                        Observation observation = Observation.createNotStarted("test", observationRegistry);
                        observation.start();
                        observation.stop();
                    }
                });
            }
            for (int i = 0; i < 100; i++) {
                test[i].get();
            }
            log.info("cost {} ms", System.currentTimeMillis() - start);
        }
    }
}

并且,针对这个问题,我们已经提交了 Issue,希望能够尽快采纳建议到 Micrometer 的主干分支中:Remove LongtaskTimer.Sample in DefaultMeterObservationHandler (for the purpose of prevent memory leak and lower CPU consuming)