Periodically remove inactive connection pool metrics #6024

ikhoon · 2024-12-09T09:28:02Z

Motivation:

We observed that threads were blocked when multiple connections were closed simultaneously and the endpoint had a small number of event loops.

armeria/core/src/main/java/com/linecorp/armeria/client/ConnectionPoolMetrics.java

Lines 79 to 85 in fa76e99

    
           lock.lock(); 
        
           try { 
        
               final Meters meters = metersMap.get(commonTags); 
        
               if (meters != null) { 
        
                   meters.decrement(); 
        
                   assert meters.activeConnections() >= 0 : "active connections should not be negative. " + meters; 
        
                   if (meters.activeConnections() == 0) {

We have no exact evidence, but I guess Micrometer's remove() operation may take a long time. The other logic is a simple HashMap operation that does not block for a long time.

Modifications:

Add a dedicated GC thread to remove inactive meters whose active connections are 0.
- A jitter is added to prevent GC from executing simultaneously.
- Unsed meters are removed every hour + jitter.
ConnectionPoolListener now implements SafeCloseable so users should close it when it is unused.

Result:

Fix the bug where EventLoop is blocked for a long time by ConnectionPoolListener.metricCollecting() when a connection is closed.

Motivation: We observed that threads were blocked when multiple connections were closed simultaneously and the endpoint has a small number of event loops. https://github.com/line/armeria/blob/fa76e99fa6132545df3a8d05eeb81c5681ec8953/core/src/main/java/com/linecorp/armeria/client/ConnectionPoolMetrics.java#L79-L85 We have no exact evidence, but I guess the `remove()` operation of Micrometer may take a long time. The other logic is a simple HashMap operation, so it does not block for a long time. Modifications: - Add a dedicated gc thread to remove inactive meters whose active conections is 0. - A jitter is added to prevent gc from executing at the same time. - Unsed meters are removed every an hour + jitter. Result: - Fix the bug where `EventLoop` is blocked for a long time by `ConnectionPoolListener.metricCollecting()` when a connection is closed.

github-actions · 2024-12-09T09:39:55Z

🔍 Build Scan® (commit: `2dce44a`)

Job name	Status	Build Scan®
build-ubicloud-standard-8-jdk-8	✅	https://ge.armeria.dev/s/wgenjr6a27vq4
build-ubicloud-standard-8-jdk-21-snapshot-blockhound	✅	https://ge.armeria.dev/s/xdksovukfmcro
build-ubicloud-standard-8-jdk-17-min-java-17-coverage	✅	https://ge.armeria.dev/s/7ba267nifoakm
build-ubicloud-standard-8-jdk-17-min-java-11	✅	https://ge.armeria.dev/s/cqeylf6okqchy
build-ubicloud-standard-8-jdk-17-leak	✅	https://ge.armeria.dev/s/kuj7n4oyz37a6
build-ubicloud-standard-8-jdk-11	✅	https://ge.armeria.dev/s/2cvj7yin4pjy4
build-macos-latest-jdk-21	✅	https://ge.armeria.dev/s/mysmdncc4345o

minwoox

Looks good all in all. 👍

minwoox · 2024-12-09T12:47:03Z

core/src/main/java/com/linecorp/armeria/client/ConnectionPoolMetrics.java

-final class ConnectionPoolMetrics {
+final class ConnectionPoolMetrics implements SafeCloseable {
+
+    private static final ScheduledExecutorService CLEANUP_EXECUTOR =


Can we use the blocking task executor similar you did in k8sEndpointGroup?

I prefer a dedicated thread. We don't know how long the remove() operations takes and as the number of ConnectionPoolMetrics increases, the number of jobs will also increases. An isolated environment may be better than a blocking task executor used for handling requests.

That is true. But I also don't want to waste resources creating a thread that is used once an hour. 🤔

It makes sense. I am convinced.

We don't know how long the remove() operations takes and as the number of ConnectionPoolMetrics increases,

Yeah, but I believe it's really a problem when multiple threads trying to remove each meter from the meter registry.
Now it's done by just one thread, so I think there are not so many contentions.

It makes sense. I am convinced.

Thank you for your understanding. 🙏
If you want, I can push a commit for you. Or you can do it by yourself after you get back from the day-off.
I think this is not urgent, so we don't have to hurry. Please get some rest when you day-off. 😆

if we use a blocking task executor, multiple tasks may be scheduled in different threads depending on the number of ConnectionPoolMetrics. Although the chance of contention would be extremely low.

If you want, I can push a commit for you.

Please push the commit. I think that it would be a trivial change.

if we use a blocking task executor, multiple tasks may be scheduled in different threads depending on the number of ConnectionPoolMetrics. Although the chance of contention would be extremely low.

That is true. It can be a problem when a lot of client factories are used. But as you mentioned, the chance of contention would be extremely low. 😉

Please push the commit. I think that it would be a trivial change.

I've actually left another suggestion that might bring a breaking change. 🤣
#6024 (comment)

I don’t want to add additional breaking changes in this PR as we are going to release a patch version for the PR.

minwoox · 2024-12-09T12:55:45Z

core/src/main/java/com/linecorp/armeria/client/ConnectionPoolMetrics.java

                }
            }
        } finally {
            lock.unlock();
        }
+
+        for (Meters meters : unusedMetersList) {
+            meters.remove(meterRegistry);


I believe we should do this in the lock block, other the newly added meter might be removed.
Because cleanupInactiveMeters is accessed only by one thread, I think we can move this logic into the lock block above.

I understood what you mentioned. However, I don't think moving meters.remove(meterRegistry) to the lock block is good because the lock could block event loops when increaseConn{Opened,Closed}() and cleanupInactiveMeters() are invoked together.

Would it be better to asynchronously perform increaseConnOpened and increaseConnClosed() in a blocking executor and revert this PR?

the lock could block event loops when increaseConn{Opened,Closed}() and cleanupInactiveMeters() are invoked together.

I thought it's okay since cleanupInactiveMeters is only invoked by the executor.
If you worry about the situation, how about adding an additional tag then? e.g. creation.index
connections{protocol="...", remote.ip="...", local.ip="...", creation.index=x }
If we use the tag, we can distinguish from the previous one. Also we can change the lock with the hashmap to concurrent hash map.

Sorry, it may not work. Let me do the brief PoC to see if that works or not.

I realized that we still need the lock

void cleanupInactiveMeters() { final List<Meters> unusedMetersList = new ArrayList<>(); lock.lock(); try { for (final Iterator<Entry<List<Tag>, Meters>> it = metersMap.entrySet().iterator(); it.hasNext();) { final Entry<List<Tag>, Meters> entry = it.next(); final Meters meters = entry.getValue(); if (meters.activeConnections() == 0) { metersMap.remove(commonTags); unusedMetersList.add(meters); } } } finally { lock.unlock(); } unusedMetersList.forEach(meters -> meters.remove(meterRegistry)); } private static final class Meters { private static final AtomicLong COUNTER = new AtomicLong(); private final Counter opened; private final Counter closed; private final Gauge active; private int activeConnections; Meters(MeterIdPrefix idPrefix, List<Tag> commonTags, MeterRegistry registry) { final String index = String.valueOf(COUNTER.incrementAndGet()); opened = Counter.builder(idPrefix.name("connections")) .tags(commonTags) .tag(STATE, "opened") .tag("creation.index", index) .register(registry); closed = Counter.builder(idPrefix.name("connections")) .tags(commonTags) .tag(STATE, "closed") .tag("creation.index", index) .register(registry); active = Gauge.builder(idPrefix.name("active.connections"), this, Meters::activeConnections) .tags(commonTags) .tag("creation.index", index) .register(registry); }

I agree with you.

Would it be better to asynchronously perform increaseConnOpened and increaseConnClosed() in a blocking executor and revert this PR?

I prefer this approach where a single dedicated thread is responsible for incrementing/decrementing/cleaning up metrics asynchronously.

The current proposed approach increases the number of tags per-endpoint. While this may be fine for the server recording metrics, this may not bode well for backends.

i.e. If a connection is closed every minute, that would mean 1440 time series are created each day per-endpoint.
I'm not exactly sure how long prometheus retains time series, but it seems like 15 days is the expiry for samples.

ref: https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects
Another SO answer seems to indicate 2 hours is the expiry.

In any case, our internal monitoring system also stores all time series' (with a long expiry from my memory) which makes me concerned whether this approach is a good idea.

i.e. If a connection is closed every minute, that would mean 1440 time series are created each day per-endpoint.
I'm not exactly sure how long prometheus retains time series, but it seems like 15 days is the expiry for samples.

Because the executor clear the metric once an hour, 24 time series are created at worst. Let me investigate it it's acceptable or not.

I believe we have three potential approaches:

Using a blocking task executor

Implementing a garbage collection (GC)-like mechanism with a counter

Implementing a GC-like mechanism with a striping lock

Using a blocking task executor might be the simplest solution. However, delegating tasks to the blocking executor solely to increment a metric doesn't seem ideal from a performance standpoint.
The second option, as mentioned by @jrhee17, has its drawbacks. While increasing the interval might mitigate the issue, it's uncertain if that would provide a robust solution.
~~The third option appears to be the most promising. With this approach, when the meter registry is removed, only the thread accessing the striping lock would be impacted.~~

The third option is not a good idea because there are still event loops that wait for the lock.

Refactored the code to create Meters without lock. PTAL.

minwoox · 2024-12-10T01:13:33Z

How about also changing remote.ip to remote.address and using EndpointToStringUtil.toShortString?

trustin

LGTM once @minwoox's comments are addressed. Thanks, @ikhoon!

core/src/main/java/com/linecorp/armeria/client/ClientFactoryBuilder.java

…ilder.java

ikhoon · 2024-12-10T02:30:24Z

How about also changing remote.ip to remote.address and using EndpointToStringUtil.toShortString?

The port information seems useful. Would you also want to include weight?

jrhee17 · 2024-12-10T02:42:16Z

core/src/main/java/com/linecorp/armeria/client/ConnectionPoolMetrics.java

                }
            }
        } finally {
            lock.unlock();
        }
+
+        for (Meters meters : unusedMetersList) {
+            meters.remove(meterRegistry);


Would it be better to asynchronously perform increaseConnOpened and increaseConnClosed() in a blocking executor and revert this PR?

I prefer this approach where a single dedicated thread is responsible for incrementing/decrementing/cleaning up metrics asynchronously.

The current proposed approach increases the number of tags per-endpoint. While this may be fine for the server recording metrics, this may not bode well for backends.

i.e. If a connection is closed every minute, that would mean 1440 time series are created each day per-endpoint.
I'm not exactly sure how long prometheus retains time series, but it seems like 15 days is the expiry for samples.

ref: https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects
Another SO answer seems to indicate 2 hours is the expiry.

In any case, our internal monitoring system also stores all time series' (with a long expiry from my memory) which makes me concerned whether this approach is a good idea.

minwoox · 2024-12-10T05:36:23Z

The port information seems useful. Would you also want to include weight?

Oops, what I meant was using TextFormmater.socketAddress(). 😓

minwoox

This approach looks nice. Thanks!

minwoox · 2024-12-11T01:18:00Z

core/src/main/java/com/linecorp/armeria/client/ConnectionPoolMetrics.java

+                }
+            } finally {


Small suggestion:

Suggested change

}

} finally {

}

if (unusedMetersList.isEmpty()) {

garbageCollecting = false;

return;

}

} finally {

Motivation: We observed that threads were blocked when multiple connections were closed simultaneously and the endpoint had a small number of event loops. https://github.com/line/armeria/blob/fa76e99fa6132545df3a8d05eeb81c5681ec8953/core/src/main/java/com/linecorp/armeria/client/ConnectionPoolMetrics.java#L79-L85 We have no exact evidence, but I guess Micrometer's `remove()` operation may take a long time. The other logic is a simple HashMap operation that does not block for a long time. Modifications: - Add a dedicated GC thread to remove inactive meters whose active connections are 0. - A jitter is added to prevent GC from executing simultaneously. - Unsed meters are removed every hour + jitter. - `ConnectionPoolListener` now implements `SafeCloseable` so users should close it when it is unused. Result: - Fix the bug where `EventLoop` is blocked for a long time by `ConnectionPoolListener.metricCollecting()` when a connection is closed.

ikhoon added the defect label Dec 9, 2024

ikhoon added this to the 1.31.3 milestone Dec 9, 2024

ikhoon requested review from jrhee17, minwoox and trustin as code owners December 9, 2024 09:28

polish

bcd1c13

minwoox reviewed Dec 9, 2024

View reviewed changes

trustin approved these changes Dec 10, 2024

View reviewed changes

ikhoon commented Dec 10, 2024

View reviewed changes

core/src/main/java/com/linecorp/armeria/client/ClientFactoryBuilder.java Outdated Show resolved Hide resolved

ikhoon and others added 3 commits December 10, 2024 10:49

Update core/src/main/java/com/linecorp/armeria/client/ClientFactoryBu…

5a0acd0

…ilder.java

Add counter

182baed

Fix test

c68301d

jrhee17 approved these changes Dec 10, 2024

View reviewed changes

pending meter registration while gc

78ad1b8

minwoox approved these changes Dec 11, 2024

View reviewed changes

ikhoon and others added 4 commits December 11, 2024 10:25

add a test

d38e77a

address comments

bf3308a

lint

d3b8665

Merge branch 'main' into gc-connection-metrics

2dce44a

ikhoon merged commit b2360c9 into line:main Dec 11, 2024
14 checks passed

ikhoon deleted the gc-connection-metrics branch December 11, 2024 05:21

ikhoon mentioned this pull request Dec 13, 2024

MeterRegistry.remove() blocks a thread for a long time micrometer-metrics/micrometer#5743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodically remove inactive connection pool metrics #6024

Periodically remove inactive connection pool metrics #6024

ikhoon commented Dec 9, 2024

github-actions bot commented Dec 9, 2024 •

edited

Loading

minwoox left a comment

minwoox Dec 9, 2024

ikhoon Dec 10, 2024

minwoox Dec 10, 2024

ikhoon Dec 10, 2024

minwoox Dec 10, 2024

minwoox Dec 10, 2024

ikhoon Dec 10, 2024 •

edited

Loading

ikhoon Dec 10, 2024

minwoox Dec 10, 2024

ikhoon Dec 10, 2024

minwoox Dec 9, 2024

ikhoon Dec 9, 2024 •

edited

Loading

minwoox Dec 10, 2024 •

edited

Loading

minwoox Dec 10, 2024

minwoox Dec 10, 2024 •

edited

Loading

ikhoon Dec 10, 2024

jrhee17 Dec 10, 2024 •

edited

Loading

minwoox Dec 10, 2024 •

edited

Loading

minwoox Dec 10, 2024 •

edited

Loading

ikhoon Dec 11, 2024

minwoox commented Dec 10, 2024 •

edited

Loading

trustin left a comment

ikhoon commented Dec 10, 2024

jrhee17 Dec 10, 2024 •

edited

Loading

minwoox commented Dec 10, 2024

minwoox left a comment

minwoox Dec 11, 2024

	lock.lock();
	try {
	final Meters meters = metersMap.get(commonTags);
	if (meters != null) {
	meters.decrement();
	assert meters.activeConnections() >= 0 : "active connections should not be negative. " + meters;
	if (meters.activeConnections() == 0) {

Periodically remove inactive connection pool metrics #6024

Periodically remove inactive connection pool metrics #6024

Conversation

ikhoon commented Dec 9, 2024

github-actions bot commented Dec 9, 2024 • edited Loading

🔍 Build Scan® (commit: 2dce44a)

minwoox left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ikhoon Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ikhoon Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

minwoox Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minwoox Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhee17 Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

minwoox Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

minwoox Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minwoox commented Dec 10, 2024 • edited Loading

trustin left a comment

Choose a reason for hiding this comment

ikhoon commented Dec 10, 2024

jrhee17 Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

minwoox commented Dec 10, 2024

minwoox left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 9, 2024 •

edited

Loading

🔍 Build Scan® (commit: `2dce44a`)

ikhoon Dec 10, 2024 •

edited

Loading

ikhoon Dec 9, 2024 •

edited

Loading

minwoox Dec 10, 2024 •

edited

Loading

minwoox Dec 10, 2024 •

edited

Loading

jrhee17 Dec 10, 2024 •

edited

Loading

minwoox Dec 10, 2024 •

edited

Loading

minwoox Dec 10, 2024 •

edited

Loading

minwoox commented Dec 10, 2024 •

edited

Loading

jrhee17 Dec 10, 2024 •

edited

Loading