Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in provider #87

Closed
davidgiga1993 opened this issue Feb 15, 2024 · 7 comments
Closed

Memory leak in provider #87

davidgiga1993 opened this issue Feb 15, 2024 · 7 comments

Comments

@davidgiga1993
Copy link

When running the provider with a medium number of resources (100+ Users, 20+ Orgs) the memory consumption increases until it gets killed by the OOM of the resource limit. The provider process is consuming all the memory in that case.

Also the memory usage in general is insanely high for what this provider is doing, especially when compared to the others.
Additionally we're also facing the CPU resource issue where the entire crossplane provider consumes the CPUs of an entire node the entire time..

image

As far as I understand most of this comes probably from upjet?
Wouldn't it make more sense to build a "proper" provider and not rely on terraform internally as it seems to be the root cause of some of those issues?

@Duologic
Copy link
Member

The terraform memory leakage is a nuisance. Thanks for making an issue.

Looking around upstream I find suggestions to set requests and limits on the ControllerConfig as a stop-gap solution: crossplane-contrib/provider-upjet-aws#325 (comment)

Linked from that same issue, there is another solution called ProviderScheduler: crossplane/upjet#178 I don't know if we already implement that but definitely worth investigating.

@Duologic
Copy link
Member

Example implemenation of the ProviderScheduler solution: https://github.com/upbound/provider-aws/pull/627/files

@patst
Copy link
Contributor

patst commented Feb 15, 2024

We have a few hundred resources and observed that as well.
You should check the queue of reconciles. Probably they pile up because the requests are not completed fast enough.

What helped us is the configuration with

    - --poll=12h
    - --sync=12h

to reduce the load.

Every change to the resource will trigger a reconcile anyway. The poll and sync stuff may only help, if somebody did manual changes to a resource which then get reset on next reconcile.

But the whole setup with the crossplane provider seems very fragile, we often have to do manual cleanups. :-/

@julienduchesne
Copy link
Member

julienduchesne commented Feb 15, 2024

To me, it looks like the poll interval doesn't even work 🤔. I've got dashboards being refreshed every minute anyways

@Argannor
Copy link

Over the course of the last week I implemented the parts of this provider using the grafana go client instead of terraform as a proof of concept.

Please note that I don't want to advertise my implementation as a replacement, since only a few of the resources are implemented and everything is quite young. Instead I want to show this to you guys to have a look at it and decide for yourselves if this could be an option to replace the current terraform/upjet based implementation.

For @davidgiga1993 and me the new implementation solved the leak and cpu usage (s. screenshots above)

Before 15:00 the provider from this repository was used, after that my implementation was used
image
image
(If wanted I can post an update after a longer observation period)

Here you can find the source code used: https://github.com/Argannor/provider-grafana

@julienduchesne
Copy link
Member

You can definitely advertise your implementation. The Terraform implementation is sub-optimal but I also do not have enough time to maintain a manually written provider. So, unfortunately, I can tell you that we will keep using upjet regardless of the performance issues

@julienduchesne
Copy link
Member

Fixed in v0.13.0
image

See #107 for more info!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants