Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit memory usage with many provider aliases #32744

Open
dmikalova opened this issue Feb 25, 2023 · 8 comments
Open

Limit memory usage with many provider aliases #32744

dmikalova opened this issue Feb 25, 2023 · 8 comments

Comments

@dmikalova
Copy link

dmikalova commented Feb 25, 2023

Terraform Version

terraform version
Terraform v1.3.9
on darwin_arm64
+ provider registry.terraform.io/hashicorp/aws v4.45.0
+ provider registry.terraform.io/okta/okta v3.42.0

Terraform Configuration Files

https://gist.github.com/dmikalova/7f4f2c0905146f5a4713cf65744ef764

Debug Output

I don't think these are needed

Expected Behavior

When terraform runs, it should only run as many provider processes as is set by -parallelism

Actual Behavior

When terraform starts up, and there are many provider aliases (20+), it loads a provider process for each provider alias - all at the same time, even if -parallelism=2 is set. You can validate this by running a process monitor and you will see each provider process going in and out as they do work.

This means the amount of memory that will be consumed is a bit more than the number of aliases you have * the size of the provider. The AWS provider is roughly 100MB, so 20 aliases will consume over 2GB. We set up compliance infrastructure for each AWS region so that is 23 aliases. We have another situation where we set 30 aliases - 1 per developer sandbox account. Both of these situations have crashed our terraform runners with 4GB of memory, and this can start happening unexpectedly, such as when updating to a newer version of the AWS provider that consumes more memory.

After what I'm assuming is the initialization of the provider aliases, the number of provider processes running then roughly follows -parallelism=2 - the peak memory consumption seems to happen early on when all of the providers are initialized at once.

Steps to Reproduce

terraform init
terraform apply

Additional Context

Everything is relatively default - this issue only happens in constrained memory systems such as CI, the issue does not occur dev laptops. Increasing the CI worker size alleviates the issue.

References

none

@dmikalova dmikalova added bug new new issue not yet triaged labels Feb 25, 2023
@jbardin
Copy link
Member

jbardin commented Feb 27, 2023

Hi @dmikalova,

Thanks for filing the issue. The behavior here is working as designed, so I'm going to relabel this as an enhancement. The fact that the AWS provider requires separate instances for each region, and itself is a very large process is outside the control of Terraform. There may be way to limit the number of plugin processes, but currently the architecture cannot guarantee that the process can proceed without deadlocks if all providers are not available.

@jbardin jbardin added enhancement core cli and removed bug new new issue not yet triaged labels Feb 27, 2023
@apparentlymart
Copy link
Contributor

It seems like there's an implied question here about why "parallelism" doesn't limit this, so just a little note about that:

That value controls how many graph nodes can be actively evaluating at a time, but each provider block generates two graph nodes: one to start the provider instance, and one to stop the provider instance. The provider process is therefore running throughout all the time between the start and the stop node, which is important because other nodes in the graph (representing resource and data blocks) will make use of it.

The concurrency limit therefore only controls how the starting and stopping of the provider instance can interact with other nodes, and doesn't affect the lifetime. Provider configurations whose configurations are totally constant in the configuration depend on nothing else and therefore become eligible to start immediately at the start of the run, and so -parallelism=2 would mean that only two of them can be starting at a time, but once two have started another two can immediately start while the first two are still running.

@apparentlymart
Copy link
Contributor

It might be interesting to investigate whether the OS is able to share the memory pages containing the provider code between processes, since that should be identical and fixed across all processes.

I don't know if it's possible to achieve this portably, but it seems like it would help if all of the instances of a particular provider all share the same memory pages for the executable code and only have separate memory pages for their dynamic data.

(Of course that wouldn't help if Linux's accounting of resource limits would still count each process's usage as separate memory usage despite sharing physical memory pages. More investigation required to see if this is a productive direction.)

@dmikalova
Copy link
Author

Thanks for the context, I appreciate y'all investigating this idea.

@non7top
Copy link

non7top commented Jun 21, 2023

I also faced this issue. For me terraform consumed 12Gb+ of ram and a few gigs of swap probably. And crashed after all without completing successfully. I don't have more ram on my system to verify how much it would require to complete successfully.

@aruandre
Copy link

we're having similar issue with 8 aws provider aliases and the container gets killed at 4gb memory usage.

@farhad-taran
Copy link

I am having the same issue, is there a recommended work around?

@apparentlymart
Copy link
Contributor

For now I think the workarounds would be either:

  • Run Terraform in an environment with a larger peak memory limit.
  • Refactor your configurations so each one uses fewer provider configurations and can therefore fit within your system's existing memory limit.

For those whose environments are constrained enough that even one instance of the hashicorp/aws provider is too large, I think increasing the memory limit will be the only viable option for the moment.

Since that provider is the one currently suffering the most from large memory usage, the AWS provider team has their own issue where they are tracking the general problem of that provider's memory usage growing with each new AWS service supported: hashicorp/terraform-provider-aws#31722 . For those of you for whom the AWS provider is the primary offender, I'd suggest following that issue too so you can see updates from the provider development team. Optimizations inside the provider itself are more likely to be feasible/profitable in the short term, since the overall (ever-growing) size of that provider's schema seems to be the root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants