Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak in prometheus receiver #9278

Closed
codeboten opened this issue Apr 14, 2022 · 11 comments · Fixed by #9718
Closed

Potential memory leak in prometheus receiver #9278

codeboten opened this issue Apr 14, 2022 · 11 comments · Fixed by #9718
Labels
bug Something isn't working comp:prometheus Prometheus related issues

Comments

@codeboten
Copy link
Contributor

codeboten commented Apr 14, 2022

Investigate whether the prometheus receiver is impacted by the memory leak addressed by the following fix: prometheus/prometheus#10590

Note: the 0.49.0 release is blocked pending the investigation

@codeboten
Copy link
Contributor Author

@gouthamve is investigating

@dmitryax dmitryax added the release:blocker The issue must be resolved before cutting the next release label Apr 14, 2022
@Aneurysm9
Copy link
Member

We have already tagged v0.49.0. If we need to include the upstream reversion then we need to retract v0.49.0 and ship v0.49.1.

@dmitryax
Copy link
Member

@Aneurysm9
Copy link
Member

Yes, that dependency update was to pull in prometheus/prometheus#10450 and prometheus/prometheus#10473 to allow access to target info and metric metadata even after job and instance have been relabeled.

@mx-psi
Copy link
Member

mx-psi commented Apr 15, 2022

We have already tagged v0.49.0. If we need to include the upstream reversion then we need to retract v0.49.0 and ship v0.49.1.

We should retract both the main module as well as (at least) the modules that depend on prometheus

@dmitryax
Copy link
Member

@gouthamve do you have any update on this?

@dmitryax
Copy link
Member

Cross posting update from @gouthamve in CNCF Slack:

The memory increase is more like 30% when there is a lot of target churn. This is because of the change where we were passing in the scrape cache inside the Context and scrape cache is not being garbage collected quickly enough. I was thinking it was because the context was being held inside TSDB but I think its fully inside the scraping code, hence it does affect the collector as well. I spent all of Thursday on it, and plan to spend today on it as well, but I can't make enough headway because it could not be replicated in any local tests (just prombench, Prometheus' k8s based large scale benchmarking setup).
I fully agree that we shouldn't block the release on this, I think we should take the hit given it only affects things when there is high churn in targets. I can also prepare a PR to roll things back if required.

@dmitryax dmitryax added bug Something isn't working comp:prometheus Prometheus related issues and removed release:blocker The issue must be resolved before cutting the next release labels Apr 20, 2022
@dmitryax
Copy link
Member

Decision was made during the SIG meeting to not block the release by this issue. The warning about this issue was added to the 0.49.0 release notes.

gouthamve added a commit to gouthamve/opentelemetry-collector-contrib that referenced this issue May 3, 2022
Prometheus started doing proper go.mod publishing and started off with
v0.35, but this means the previous version was "winning" as it was v1.8.2
so I had to update the version everywhere.

This brings in prometheus/prometheus#10636
which fixes open-telemetry#9278

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
gouthamve added a commit to gouthamve/opentelemetry-collector-contrib that referenced this issue May 3, 2022
Prometheus started doing proper go.mod publishing and started off with
v0.35, but this means the previous version was "winning" as it was v1.8.2
so I had to update the version everywhere.

This brings in prometheus/prometheus#10636
which fixes open-telemetry#9278

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
codeboten pushed a commit that referenced this issue May 3, 2022
* Update Prometheus to not have memory leak.

Prometheus started doing proper go.mod publishing and started off with
v0.35, but this means the previous version was "winning" as it was v1.8.2
so I had to update the version everywhere.

This brings in prometheus/prometheus#10636
which fixes #9278

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
djaglowski pushed a commit to djaglowski/opentelemetry-collector-contrib that referenced this issue May 10, 2022
* Update Prometheus to not have memory leak.

Prometheus started doing proper go.mod publishing and started off with
v0.35, but this means the previous version was "winning" as it was v1.8.2
so I had to update the version everywhere.

This brings in prometheus/prometheus#10636
which fixes open-telemetry#9278

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
@ItsLastDay
Copy link
Contributor

Since the issue is closed, does it mean the warning "The issue is currently being investigated and will be fixed in one of the new releases. More details: #9278." in https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver is now obsolete?

@dashpole
Copy link
Contributor

Yes, we can remove that status header

@dashpole
Copy link
Contributor

I opened #11050

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working comp:prometheus Prometheus related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants