-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very long NodeJS event loop lag #543
Comments
I've noticed this as well. There are long, blocking calls when converting the metrics to the string that gets output to the client, though my CPU traces squarely place the blame on the userland code and not the node internal network stack. Can you share you trace? I saw you landed a PR that improved this a bit (thanks); I'm benchmarking another fix on top of yours to see if it helps. Would love an update if you figure this out. |
@zbjornson From my traces, the other major offender when we have a slow call is rendering histograms. In particular, this function seems quite slow: https://github.com/siimon/prom-client/blob/master/lib/histogram.js#L210-L217 I believe we could improve the performance of this by internally having a concept of "shared labels" across an entire set of buckets so that we don't need to copy all the labels N times. In particular, For histograms, this means we only build |
Okay so from what I can tell, |
@ngavalas |
Ah. Thoughts on exposing an internal version for |
@ngavalas can you provide a reproduction case (either a full example that can be profiled, or a benchmark in the benchmarks suite) please? |
Yeah, I'll work on a repro; I've only observed this in production so I'll need to try to figure out which labels are causing issues for histograms. I can share a snippet of a v8 CPU profile as well in a few. |
@zbjornson FWIW, I think the existing registry benchmarks capture the problem fairly well. I have my proposed change working locally, passing all tests, and...
|
I can put up the PR if you want to discuss? |
Awesome. A PR would be great! |
@zbjornson One more possible improvement idea: can we skip labels with empty values at serialization time? As far as I can tell, prometheus doesn't differentiate between Apparently, we tried to keep the label set consistent across all writes to a given metric specifically because of this code: https://github.com/siimon/prom-client/blob/master/lib/validation.js#L17-L27. It was complaining if the labels didn't match up. This means we, in practice, end up serializing a large amount of totally empty labels on most metrics, which is a huge waste of serialization time and prom<->node transfer bytes. If we can just skip serializing them on the library side, all would be well. I can move to a separate issue if you want, but this is part of our histogram performance problem and I'm happy to put up the PR for that as well. |
@ngavalas Prometheus doesn't differentiate at query time, but it does require label consistency at scrape time. OK:
Invalid:
|
I'm working on another PR that dramatically improves the serialization of histograms when only a few entries changed since the previous serialization. I think that would be a common scenario when there are more than a few labels / label values. This is accomplished by retaining the serialized representation (string) per entry from the pervious invocation of Had to change the benchmark for registry to modify some values, otherwise it ran so fast that the benchmarking utility blew up :-P Should I create the PR or to of |
@SuperQ Facepalm. Do you happen to know where it says that in the Prometheus docs? The reason I ask is because I need to know if the label ORDER also matters since that’s a detail I wasn’t sure about in my PR. @shappir That also sounds like a good idea. I wonder if both our changes are needed or if one or the other is fast enough. I’ll try out your approach once the PR is up! |
My approach is only beneficial if most of the entries in a histogram hash haven't been changed. Specifically, if none have been changed then the time to generate the Prometheus string for the histogram goes down to almost zero. But if most are changed then there is little or no benefit. Asking again: on top of |
@ngavalas I don't think label sets are called out in the docs. Ordering within a metric family doesn't matter tho. I think the label name sets was implied in Prometheus, but we formalized it in the OpenMetrics spec. |
I implemented the the caching mechanism: #555 I changed the registry metrics benchmark to update an eighth of the histogram for each iteration, otherwise it was too fast :-) With this update, the performance improvement is in the range of 100% - 200%. |
FYI, I deployed the new pre-release version that has my fix in it and I'm seeing the metric render time drop from 500ms+ to ~40ms. We can probably squeeze some more juice out of this, but we've made huge progress for the histogram-heavy workload we have in real life. |
I really like what you did - it's a great idea. Hope to see it officially released soon so that I can check the impact on our production servers. (And see if metrics response generation is indeed the bottleneck - I know the issue is caused by Prometheus integration but I'm not yet sure as to the cause, as I indicated in my original comment.) I am still checking if I can implement my optimization on top of yours as I think it can provide noticeable added benefit. |
I did create a PR which is mostly code cleanup after the shared labels modification, and also some minor optimizations: #559 |
Turns out that the value reported by |
v15 is out with both of your improvements now (thanks a bunch!). Are we ok now, or should we keep this issue open? |
This is great news! We have lots of graphs monitoring the performance of services using prom-client. We will upgrade and I will share the results. I think that will provide good feedback to where we are. Currently the only major performance improvement I can think of that hasn't been done yet is to cache the string representation of each entry, in addition to the actual values. This way each entry that hasn't changed (which would be most when cardinality is high) could reuse it instead of computing it again. The obvious downside is that memory consumption would remain high after metrics have been reported. |
Great stuff, looking forward to it! Caching the strings seems like something we can probably do behind an option or some runtime toggle? Or possibly only if a metric has a cardinality of something larger than some |
#594 might help as well |
@shappir more improvements in https://github.com/siimon/prom-client/releases/tag/v15.1.0 |
We are using
prom-client
with NestJS (together with@willsoto/nestjs-prometheus
). We have several custom metrics - in particular histograms. As a result, the overall metrics size grows up to 8MB. This results in the NodeJS event loop lag getting to be as large as 1 second!What we see:
register.resetMetrics()
causes the lag to go down almost to 0. It then starts increasing againgetPeerCertificate
anddestroySSL
Has anyone encountered this sort of behavior? Any suggestions?
The text was updated successfully, but these errors were encountered: