Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very long NodeJS event loop lag #543

Open
shappir opened this issue Mar 1, 2023 · 30 comments
Open

Very long NodeJS event loop lag #543

shappir opened this issue Mar 1, 2023 · 30 comments

Comments

@shappir
Copy link
Contributor

shappir commented Mar 1, 2023

We are using prom-client with NestJS (together with @willsoto/nestjs-prometheus). We have several custom metrics - in particular histograms. As a result, the overall metrics size grows up to 8MB. This results in the NodeJS event loop lag getting to be as large as 1 second!

What we see:

  • Reducing the amount of data collected, e.g. by removing some metrics, improves the situation (however we would like to keep the metrics we have)
  • Explicitly invoking register.resetMetrics() causes the lag to go down almost to 0. It then starts increasing again
  • CPU usage is constant at around 15%
  • CPU profile shows long time spent in Node internal network functions, such as getPeerCertificate and destroySSL

Has anyone encountered this sort of behavior? Any suggestions?

@ngavalas
Copy link
Contributor

ngavalas commented Mar 8, 2023

I've noticed this as well. There are long, blocking calls when converting the metrics to the string that gets output to the client, though my CPU traces squarely place the blame on the userland code and not the node internal network stack. Can you share you trace?

I saw you landed a PR that improved this a bit (thanks); I'm benchmarking another fix on top of yours to see if it helps. Would love an update if you figure this out.

@zbjornson
Copy link
Collaborator

Is something inflating the number of metrics, e.g. using request paths as label values? What types of metrics are you using?

14.2.0 has #540 that might help if you're using summaries.

#480 is another known perf issue with summaries, and I started a fix in #486.

@ngavalas
Copy link
Contributor

ngavalas commented Mar 8, 2023

@zbjornson From my traces, the other major offender when we have a slow call is rendering histograms. In particular, this function seems quite slow: https://github.com/siimon/prom-client/blob/master/lib/histogram.js#L210-L217

I believe we could improve the performance of this by internally having a concept of "shared labels" across an entire set of buckets so that we don't need to copy all the labels N times. In particular, histogram.get() would return this set of shared labels and we'd merge in the specific entries labels on top of that.

For histograms, this means we only build { le: upperBound } per bucket and everything else is shared. I would need to go poke around in the code a bit and see what else this would affect, but it seems like a relatively straightforward win.

@ngavalas
Copy link
Contributor

ngavalas commented Mar 8, 2023

Okay so from what I can tell, <metric>.get() is only called in the Registry, so this is a pretty self-contained change. I'll benchmark it but I expect this to be a win in practice for us.

@nflaig
Copy link
Contributor

nflaig commented Mar 8, 2023

@ngavalas <metric>.get() is an external API and can be consumed by anyone who wants to get a single metric value

@ngavalas
Copy link
Contributor

ngavalas commented Mar 8, 2023

Ah. Thoughts on exposing an internal version for getMetricAsPrometheusString and have the existing get flatten out the labels so external callers are undisturbed? Happy to put up the PR this afternoon to start the discussion.

@zbjornson
Copy link
Collaborator

@ngavalas can you provide a reproduction case (either a full example that can be profiled, or a benchmark in the benchmarks suite) please?

@ngavalas
Copy link
Contributor

ngavalas commented Mar 8, 2023

Yeah, I'll work on a repro; I've only observed this in production so I'll need to try to figure out which labels are causing issues for histograms. I can share a snippet of a v8 CPU profile as well in a few.

@ngavalas
Copy link
Contributor

ngavalas commented Mar 8, 2023

@zbjornson FWIW, I think the existing registry benchmarks capture the problem fairly well. I have my proposed change working locally, passing all tests, and...

### Results:

┌────────────────────────────────────────┬────────────────────┬────────────────────┐
│ registry                               │ current            │ prom-client@latest │
├────────────────────────────────────────┼────────────────────┼────────────────────┤
│ getMetricsAsJSON#1 with 64             │ 3747.0556369320916 │ 2000.6075672702098 │
├────────────────────────────────────────┼────────────────────┼────────────────────┤
│ getMetricsAsJSON#2 with 8              │ 3458.670353309119  │ 1674.8824925227625 │
├────────────────────────────────────────┼────────────────────┼────────────────────┤
│ getMetricsAsJSON#2 with 4 and 2 with 2 │ 3660.534565081909  │ 1608.8334436915613 │
├────────────────────────────────────────┼────────────────────┼────────────────────┤
│ getMetricsAsJSON#2 with 2 and 2 with 4 │ 3330.496324702882  │ 452.3735183797067  │
├────────────────────────────────────────┼────────────────────┼────────────────────┤
│ getMetricsAsJSON#6 with 2              │ 3081.6146752509944 │ 1982.9467290709053 │
└────────────────────────────────────────┴────────────────────┴────────────────────┘

### Summary:

✓ registry ➭ getMetricsAsJSON#1 with 64 is 87.30% faster.
✓ registry ➭ getMetricsAsJSON#2 with 8 is 106.5% faster.
✓ registry ➭ getMetricsAsJSON#2 with 4 and 2 with 2 is 127.5% faster.
✓ registry ➭ getMetricsAsJSON#2 with 2 and 2 with 4 is 636.2% faster.
✓ registry ➭ getMetricsAsJSON#6 with 2 is 55.41% faster.

@ngavalas
Copy link
Contributor

ngavalas commented Mar 8, 2023

I can put up the PR if you want to discuss?

@zbjornson
Copy link
Collaborator

Awesome. A PR would be great!

@ngavalas
Copy link
Contributor

ngavalas commented Mar 8, 2023

@zbjornson One more possible improvement idea: can we skip labels with empty values at serialization time? As far as I can tell, prometheus doesn't differentiate between {foo=""} and {}; empty labels are the same as missing.

Apparently, we tried to keep the label set consistent across all writes to a given metric specifically because of this code: https://github.com/siimon/prom-client/blob/master/lib/validation.js#L17-L27. It was complaining if the labels didn't match up.

This means we, in practice, end up serializing a large amount of totally empty labels on most metrics, which is a huge waste of serialization time and prom<->node transfer bytes. If we can just skip serializing them on the library side, all would be well. I can move to a separate issue if you want, but this is part of our histogram performance problem and I'm happy to put up the PR for that as well.

@SuperQ
Copy link

SuperQ commented Mar 9, 2023

@ngavalas Prometheus doesn't differentiate at query time, but it does require label consistency at scrape time.

OK:

foo{bar=""} 1
foo{bar="baz"} 1

Invalid:

foo{} 1
foo{bar="baz"} 1

@shappir
Copy link
Contributor Author

shappir commented Mar 9, 2023

I'm working on another PR that dramatically improves the serialization of histograms when only a few entries changed since the previous serialization. I think that would be a common scenario when there are more than a few labels / label values. This is accomplished by retaining the serialized representation (string) per entry from the pervious invocation of getMetricAsPrometheusString inside the histogram. And entry's cached representation is discarded when a new value is observed for it.

Had to change the benchmark for registry to modify some values, otherwise it ran so fast that the benchmarking utility blew up :-P

Should I create the PR or to of master or next ?

@ngavalas
Copy link
Contributor

ngavalas commented Mar 9, 2023

@SuperQ Facepalm. Do you happen to know where it says that in the Prometheus docs? The reason I ask is because I need to know if the label ORDER also matters since that’s a detail I wasn’t sure about in my PR.

@shappir That also sounds like a good idea. I wonder if both our changes are needed or if one or the other is fast enough. I’ll try out your approach once the PR is up!

@shappir
Copy link
Contributor Author

shappir commented Mar 9, 2023

My approach is only beneficial if most of the entries in a histogram hash haven't been changed. Specifically, if none have been changed then the time to generate the Prometheus string for the histogram goes down to almost zero. But if most are changed then there is little or no benefit.

Asking again: on top of master or next ?

@ngavalas
Copy link
Contributor

ngavalas commented Mar 9, 2023

@shappir it looks like next merged into master yesterday (for example, this is in master now: a19a1ad), so probably master?

@SuperQ
Copy link

SuperQ commented Mar 10, 2023

@ngavalas I don't think label sets are called out in the docs. Ordering within a metric family doesn't matter tho.

I think the label name sets was implied in Prometheus, but we formalized it in the OpenMetrics spec.

@shappir
Copy link
Contributor Author

shappir commented Mar 12, 2023

I implemented the the caching mechanism: #555

I changed the registry metrics benchmark to update an eighth of the histogram for each iteration, otherwise it was too fast :-) With this update, the performance improvement is in the range of 100% - 200%.

@ngavalas
Copy link
Contributor

FYI, I deployed the new pre-release version that has my fix in it and I'm seeing the metric render time drop from 500ms+ to ~40ms. We can probably squeeze some more juice out of this, but we've made huge progress for the histogram-heavy workload we have in real life.

@shappir
Copy link
Contributor Author

shappir commented Mar 15, 2023

I really like what you did - it's a great idea. Hope to see it officially released soon so that I can check the impact on our production servers. (And see if metrics response generation is indeed the bottleneck - I know the issue is caused by Prometheus integration but I'm not yet sure as to the cause, as I indicated in my original comment.)

I am still checking if I can implement my optimization on top of yours as I think it can provide noticeable added benefit.

@shappir
Copy link
Contributor Author

shappir commented Mar 15, 2023

I did create a PR which is mostly code cleanup after the shared labels modification, and also some minor optimizations: #559

@shappir
Copy link
Contributor Author

shappir commented Mar 27, 2023

Turns out that the value reported by nodejs_eventloop_lag_seconds is not the average Node event loop lag. Rather it's the time required to compute and send the metrics response. This is why it only goes up and never comes down as more and more data is collected. This should be fixed. See this issue for details.

@SimenB
Copy link
Collaborator

SimenB commented Oct 13, 2023

v15 is out with both of your improvements now (thanks a bunch!). Are we ok now, or should we keep this issue open?

@shappir
Copy link
Contributor Author

shappir commented Oct 15, 2023

v15 is out with both of your improvements now

This is great news! We have lots of graphs monitoring the performance of services using prom-client. We will upgrade and I will share the results. I think that will provide good feedback to where we are.

Currently the only major performance improvement I can think of that hasn't been done yet is to cache the string representation of each entry, in addition to the actual values. This way each entry that hasn't changed (which would be most when cardinality is high) could reuse it instead of computing it again.

The obvious downside is that memory consumption would remain high after metrics have been reported.

@SimenB
Copy link
Collaborator

SimenB commented Oct 16, 2023

Great stuff, looking forward to it!

Caching the strings seems like something we can probably do behind an option or some runtime toggle? Or possibly only if a metric has a cardinality of something larger than some n.

@shappir
Copy link
Contributor Author

shappir commented Oct 19, 2023

Here is the graphic. As you can see, the time to generate the prom-client response is still lengthy (grows up to ~400ms) but reduced 60% relative to what it was with the previous version. No other changes to the system
Screenshot 2023-10-19 at 13 19 41

Adding an option may be a good idea. Using cardinality to determine this behavior isn't applicable in my opinion:

  1. Low cardinality would benefit less, but would also consume little memory
  2. High cardinality would benefits the most, but would also consume a lot of memory

@SimenB
Copy link
Collaborator

SimenB commented Oct 27, 2023

#594 might help as well

@shappir
Copy link
Contributor Author

shappir commented Nov 13, 2023

Here are the results several weeks later. Since upgrading to 1.5 at the end of September the maximum response time to the Prometheus service has been reduced by 30%.
Screenshot 2023-11-13 at 13 30 12

@SimenB
Copy link
Collaborator

SimenB commented Jan 3, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants