我为什么不建议使用框架默认的 DefaultMeterObservationHandler
个人创作公约:本人声明创作的所有文章皆为自己原创,如果有参考任何文章的地方,会标注出来,如果有疏漏,欢迎大家批判。如果大家发现网上有抄袭本文章的,欢迎举报,并且积极向这个 github 仓库 提交 issue,谢谢支持~
另外,本文为了避免抄袭,会在不影响阅读的情况下,在文章的随机位置放入对于抄袭和洗稿的人的“亲切”的问候。如果是正常读者看到,笔者在这里说声对不起,。如果被抄袭狗或者洗稿狗看到了,希望你能够好好反思,不要再抄袭了,谢谢。
背景知识
最近,我们升级了 SpringBoot 3.x,并且,升级后,我们全面改造了原来的 Sleuth 以及 Micrometer 监控,最新的
并且,在此基础上,我们还加入了全局的
-
- 暴露指标监控数据到
/actuator/prometheus ,供 Prometheus 采集,Grafana 展示:
- 暴露指标监控数据到
-
- 上报链路追踪数据到 Jaeger:
-
- 生成 JFR 事件,供 JDK Mission Control 分析:
为何要这么做呢?
- 指标数据是统计数据,是聚合数据,是一段时间内的数据,而链路追踪数据是实时数据,是每个请求的数据
- 但是像是链路追踪上报在生产上肯定不能 100% 上报(上报性能,还有成本考虑,以及查询的存储性能有限,成本不能太高),而是采样上报,这样就会导致链路追数据不全
- JFR 是保留详细数据的好选择,保留在本地,只有发生问题需要排查的时候,事后按需获取就行。
这样,既节约了成本,又能及时报警,又能容易通过 Jaeger 通过链路追踪数据定位出问题的实例,然后通过获取对应实例的 JFR 事件定位详细问题。
全面使用 Observation 遇到了内存溢出以及 CPU 飙高(非不断 GC 导致)
但是,我们在全面使用
首先为何会出现内存溢出,我们先做个测试,添加依赖:
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>3.2.1</version> </parent> <groupId>com.github.hashjang</groupId> <artifactId>wwsmbjysymrdo</artifactId> <properties> <!--所有项目 Java 基线版本为 17,如果需要升级,修改这里,注意不可降级--> <!--Baseline Java Version 17, if you need to upgrade, modify here, note that it cannot be downgraded--> <maven.compiler.source>17</maven.compiler.source> <maven.compiler.target>17</maven.compiler.target> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <!--这里放我们自定义的依赖版本属性--> <disruptor.version>3.4.4</disruptor.version> </properties> <dependencies> <!--lombok简化代码--> <!--lombok simplifies code--> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> </dependency> <!--日志需要用log4j2--> <!--choose log4j2 for logging--> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter</artifactId> <exclusions> <exclusion> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-logging</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-log4j2</artifactId> </dependency> <!--log4j2异步日志需要的依赖,所有项目都必须用log4j2和异步日志配置--> <!--Dependencies required for log4j2 asynchronous logging, all projects must use log4j2 and asynchronous logging configuration--> <dependency> <groupId>com.lmax</groupId> <artifactId>disruptor</artifactId> <version>${disruptor.version}</version> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> </dependencies> </project>
之后编写代码测试:
package com.github.hashjang.wwsmbjysymrdo; import io.micrometer.common.KeyValue; import io.micrometer.core.instrument.Counter; import io.micrometer.core.instrument.MeterRegistry; import io.micrometer.core.instrument.Tag; import io.micrometer.core.instrument.Timer; import io.micrometer.core.instrument.observation.MeterObservationHandler; import io.micrometer.observation.Observation; import io.micrometer.observation.ObservationRegistry; import lombok.extern.log4j.Log4j2; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.boot.CommandLineRunner; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; import org.springframework.context.annotation.Bean; import org.springframework.stereotype.Component; import java.util.ArrayList; import java.util.List; @Log4j2 @SpringBootApplication public class Main { public static void main(String[] args) { SpringApplication.run(Main.class); } @Component public static class Runner implements CommandLineRunner { @Autowired private ObservationRegistry observationRegistry; @Override public void run(String... args) throws Exception { //不断创建新的 Observation,然后 start while(true) { try { Observation test = Observation.createNotStarted("test", observationRegistry); test.start(); } catch (Throwable r) { log.error("error", r); break; } } log.info("complete"); } } }
之后,限制 JVM 最大堆内存为 32m 之后启动,可以看到,不断创建新的 Observation,然后 start,最终导致内存溢出:
. ____ _ __ _ _ /\ / ___'_ __ _ _(_)_ __ __ _ ( ( )\___ | '_ | '_| | '_ / _` | \/ ___)| |_)| | | | | || (_| | ) ) ) ) ' |____| .__|_| |_|_| |_\__, | / / / / =========|_|==============|___/=/_/_/_/ :: Spring Boot :: (v3.2.1) 2024-01-19T12:50:37.474+08:00 INFO 21401 --- [ main] c.g.h.w.Main : Starting Main using Java 17.0.8 with PID 21401 (/Users/hash/Desktop/Project/Java/spring-cloud-native/article/Problem/wwsmbjysymrdo/target/classes started by hash in /Users/hash/Desktop/Project/Java/spring-cloud-native) 2024-01-19T12:50:37.475+08:00 INFO 21401 --- [ main] c.g.h.w.Main : No active profile set, falling back to 1 default profile: "default" 2024-01-19T12:50:37.853+08:00 INFO 21401 --- [ main] c.g.h.w.Main : Started Main in 0.529 seconds (process running for 0.762) Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space WARN StatusConsoleListener org.apache.logging.log4j.spi.AbstractLogger caught java.lang.OutOfMemoryError logging ReusableSimpleMessage: error java.lang.OutOfMemoryError: Java heap space 2024-01-19T12:50:40.437+08:00 INFO 21401 --- [ main] c.g.h.w.Main : complete 2024-01-19T12:50:40.913+08:00 WARN 21401 --- [ionShutdownHook] o.s.b.f.s.DisposableBeanAdapter : Invocation of close method failed on bean with name 'log4j2Metrics': java.lang.OutOfMemoryError: Java heap space
这里发生了内存溢出,但是仅仅因为
为何会出现内存溢出
我们通过增加如下启动参数启动并且在退出的时候 dump JFR:
-XX:StartFlightRecording=disk=true,dumponexit=true
或者使用下面的参数在内存溢出的时候 dump 整个堆:
-XX:+HeapDumpOnOutOfMemoryError
都可以看到,内存溢出的时候,堆中有大量的
查看源码,发现在
https://github.com/spring-projects/spring-boot/blob/main/spring-boot-project/spring-boot-actuator-autoconfigure/src/main/java/org/springframework/boot/actuate/autoconfigure/observation/ObservationAutoConfiguration.java
@Configuration(proxyBeanMethods = false) @ConditionalOnClass(MeterRegistry.class) @ConditionalOnMissingClass("io.micrometer.tracing.Tracer") static class OnlyMetricsConfiguration { @Bean ObservationHandlerGrouping metricsObservationHandlerGrouping() { return new ObservationHandlerGrouping(MeterObservationHandler.class); } } @Configuration(proxyBeanMethods = false) @ConditionalOnClass(Tracer.class) @ConditionalOnMissingClass("io.micrometer.core.instrument.MeterRegistry") static class OnlyTracingConfiguration { @Bean ObservationHandlerGrouping tracingObservationHandlerGrouping() { return new ObservationHandlerGrouping(TracingObservationHandler.class); } } @Configuration(proxyBeanMethods = false) @ConditionalOnClass({ MeterRegistry.class, Tracer.class }) static class MetricsWithTracingConfiguration { @Bean ObservationHandlerGrouping metricsAndTracingObservationHandlerGrouping() { return new ObservationHandlerGrouping( List.of(TracingObservationHandler.class, MeterObservationHandler.class)); } } @Configuration(proxyBeanMethods = false) @ConditionalOnBean(MeterRegistry.class) @ConditionalOnMissingBean(MeterObservationHandler.class) static class MeterObservationHandlerConfiguration { @ConditionalOnMissingBean(type = "io.micrometer.tracing.Tracer") @Configuration(proxyBeanMethods = false) static class OnlyMetricsMeterObservationHandlerConfiguration { @Bean DefaultMeterObservationHandler defaultMeterObservationHandler(MeterRegistry meterRegistry) { return new DefaultMeterObservationHandler(meterRegistry); } } @ConditionalOnBean(Tracer.class) @Configuration(proxyBeanMethods = false) static class TracingAndMetricsObservationHandlerConfiguration { @Bean TracingAwareMeterObservationHandler<Observation.Context> tracingAwareMeterObservationHandler( MeterRegistry meterRegistry, Tracer tracer) { DefaultMeterObservationHandler delegate = new DefaultMeterObservationHandler(meterRegistry); return new TracingAwareMeterObservationHandler<>(delegate, tracer); } } } @Configuration(proxyBeanMethods = false) @ConditionalOnClass(Advice.class) static class ObservedAspectConfiguration { @Bean @ConditionalOnMissingBean ObservedAspect observedAspect(ObservationRegistry observationRegistry) { return new ObservedAspect(observationRegistry); } }
以上代码的意思是,根据你的项目中是否添加了链路追踪,或者指标监控的依赖,来初始化不同的
为何解决内存溢出之后,还会出现 CPU 飙高
知道问题之后,我们给遗漏 stop 的地方加上了 try finally stop。但是,我们发现,即使加上了 try finally stop,也会出现 CPU 飙高的问题,我们通过 JFR 看一下,CPU 究竟消耗在哪里:
我们惊奇的发现,还是和
package com.github.hashjang.wwsmbjysymrdo; import io.micrometer.common.KeyValue; import io.micrometer.core.instrument.Counter; import io.micrometer.core.instrument.MeterRegistry; import io.micrometer.core.instrument.Tag; import io.micrometer.core.instrument.Timer; import io.micrometer.core.instrument.observation.MeterObservationHandler; import io.micrometer.observation.Observation; import io.micrometer.observation.ObservationRegistry; import lombok.extern.log4j.Log4j2; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.boot.CommandLineRunner; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; import org.springframework.context.annotation.Bean; import org.springframework.stereotype.Component; import java.util.ArrayList; import java.util.List; import java.util.concurrent.Executor; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; import java.util.concurrent.ThreadPoolExecutor; @Log4j2 @SpringBootApplication public class Main { public static void main(String[] args) { SpringApplication.run(Main.class); } @Component public static class Runner implements CommandLineRunner { @Autowired private ObservationRegistry observationRegistry; @Override public void run(String... args) throws Exception { ExecutorService executor = Executors.newFixedThreadPool(100); //预热 for (int j = 0; j < 100000; j++) { executor.execute(() -> { Observation observation = Observation.createNotStarted("test", observationRegistry); observation.start(); observation.stop(); }); } //测试 long start = System.currentTimeMillis(); Future<?> test[] = new Future[100]; for (int i = 0; i < 100; i++) { test[i] = executor.submit(() -> { for (int j = 0; j < 100000; j++) { Observation observation = Observation.createNotStarted("test", observationRegistry); observation.start(); observation.stop(); } }); } for (int i = 0; i < 100; i++) { test[i].get(); } log.info("cost {} ms", System.currentTimeMillis() - start); } } }
在我的电脑上,这个测试最后输出显示大概需要 5300ms 左右。我们将全局的
package com.github.hashjang.wwsmbjysymrdo; import io.micrometer.common.KeyValue; import io.micrometer.core.instrument.Counter; import io.micrometer.core.instrument.MeterRegistry; import io.micrometer.core.instrument.Tag; import io.micrometer.core.instrument.Timer; import io.micrometer.core.instrument.observation.MeterObservationHandler; import io.micrometer.observation.Observation; import io.micrometer.observation.ObservationRegistry; import lombok.extern.log4j.Log4j2; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.boot.CommandLineRunner; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; import org.springframework.context.annotation.Bean; import org.springframework.stereotype.Component; import java.util.ArrayList; import java.util.List; import java.util.concurrent.Executor; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; import java.util.concurrent.ThreadPoolExecutor; @Log4j2 @SpringBootApplication public class Main { public static void main(String[] args) { SpringApplication.run(Main.class); } @Bean MeterObservationHandler<Observation.Context> defaultMeterObservationHandler(MeterRegistry meterRegistry) { return new MeterObservationHandler<>() { }; } @Component public static class Runner implements CommandLineRunner { @Autowired private ObservationRegistry observationRegistry; @Override public void run(String... args) throws Exception { ExecutorService executor = Executors.newFixedThreadPool(100); //预热 for (int j = 0; j < 100000; j++) { executor.execute(() -> { Observation observation = Observation.createNotStarted("test", observationRegistry); observation.start(); observation.stop(); }); } //测试 long start = System.currentTimeMillis(); Future<?> test[] = new Future[100]; for (int i = 0; i < 100; i++) { test[i] = executor.submit(() -> { for (int j = 0; j < 100000; j++) { Observation observation = Observation.createNotStarted("test", observationRegistry); observation.start(); observation.stop(); } }); } for (int i = 0; i < 100; i++) { test[i].get(); } log.info("cost {} ms", System.currentTimeMillis() - start); } } }
最后大概需要 1400 ms 左右。这个差距还是很明显的。
默认的 DefaultMeterObservationHandler
我们看一下
hhttps://github.com/micrometer-metrics/micrometer/blob/main/micrometer-core/src/main/java/io/micrometer/core/instrument/observation/DefaultMeterObservationHandler.java
@Override public void onStart(Observation.Context context) { LongTaskTimer.Sample longTaskSample = LongTaskTimer.builder(context.getName() + ".active") .tags(createTags(context)).register(meterRegistry).start(); context.put(LongTaskTimer.Sample.class, longTaskSample); Timer.Sample sample = Timer.start(meterRegistry); context.put(Timer.Sample.class, sample); } @Override public void onStop(Observation.Context context) { Timer.Sample sample = context.getRequired(Timer.Sample.class); sample.stop(Timer.builder(context.getName()).tags(createErrorTags(context)).tags(createTags(context)) .register(this.meterRegistry)); LongTaskTimer.Sample longTaskSample = context.getRequired(LongTaskTimer.Sample.class); longTaskSample.stop(); }
可以看出,默认情况下,
其中,
解决方案
我们可以替换掉
package com.github.hashjang.wwsmbjysymrdo; import io.micrometer.common.KeyValue; import io.micrometer.core.instrument.Counter; import io.micrometer.core.instrument.MeterRegistry; import io.micrometer.core.instrument.Tag; import io.micrometer.core.instrument.Timer; import io.micrometer.core.instrument.observation.MeterObservationHandler; import io.micrometer.observation.Observation; import io.micrometer.observation.ObservationRegistry; import lombok.extern.log4j.Log4j2; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.boot.CommandLineRunner; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; import org.springframework.context.annotation.Bean; import org.springframework.stereotype.Component; import java.util.ArrayList; import java.util.List; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; @Log4j2 @SpringBootApplication public class Main { public static void main(String[] args) { SpringApplication.run(Main.class); } @Bean MeterObservationHandler<Observation.Context> defaultMeterObservationHandler(MeterRegistry meterRegistry) { return new MeterObservationHandler<>() { @Override public void onStart(Observation.Context context) { Timer.Sample sample = Timer.start(meterRegistry); context.put(Timer.Sample.class, sample); } @Override public void onStop(Observation.Context context) { List<Tag> tags = createTags(context); tags.add(Tag.of("error", getErrorValue(context))); Timer.Sample sample = context.getRequired(Timer.Sample.class); sample.stop(Timer.builder(context.getName()).tags(tags).register(meterRegistry)); } @Override public void onEvent(Observation.Event event, Observation.Context context) { Counter.builder(context.getName() + "." + event.getName()) .tags(createTags(context)) .register(meterRegistry) .increment(); } private String getErrorValue(Observation.Context context) { Throwable error = context.getError(); return error != null ? error.getClass().getSimpleName() : "none"; } private List<Tag> createTags(Observation.Context context) { List<Tag> tags = new ArrayList<>(); for (KeyValue keyValue : context.getLowCardinalityKeyValues()) { tags.add(Tag.of(keyValue.getKey(), keyValue.getValue())); } return tags; } }; } @Component public static class Runner implements CommandLineRunner { @Autowired private ObservationRegistry observationRegistry; @Override public void run(String... args) throws Exception { ExecutorService executor = Executors.newFixedThreadPool(100); //预热 for (int j = 0; j < 100000; j++) { executor.execute(() -> { Observation observation = Observation.createNotStarted("test", observationRegistry); observation.start(); observation.stop(); }); } //测试 long start = System.currentTimeMillis(); Future<?> test[] = new Future[100]; for (int i = 0; i < 100; i++) { test[i] = executor.submit(() -> { for (int j = 0; j < 100000; j++) { Observation observation = Observation.createNotStarted("test", observationRegistry); observation.start(); observation.stop(); } }); } for (int i = 0; i < 100; i++) { test[i].get(); } log.info("cost {} ms", System.currentTimeMillis() - start); } } }
并且,针对这个问题,我们已经提交了 Issue,希望能够尽快采纳建议到 Micrometer 的主干分支中:Remove LongtaskTimer.Sample in DefaultMeterObservationHandler (for the purpose of prevent memory leak and lower CPU consuming)