Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform downloads module from git source for every usage #11435

Closed
mattlqx opened this issue Jan 26, 2017 · 17 comments
Closed

Terraform downloads module from git source for every usage #11435

mattlqx opened this issue Jan 26, 2017 · 17 comments

Comments

@mattlqx
Copy link

mattlqx commented Jan 26, 2017

Terraform Version

0.8.4

Affected Resource(s)

  • core

Terraform Configuration Files

module "ui" {
  source = "git::ssh://git@bitbucket.foo.io:7999/repo/project//aws_instance"
  ...
}

module "db" {
  source = "git::ssh://git@bitbucket.foo.io:7999/repo/project//aws_instance"
  ...
}

module "api" {
  source = "git::ssh://git@bitbucket.foo.io:7999/repo/project//aws_instance"
  ...
}

Expected Behavior

I would think that TF would be able to identify that the module source is the same on all these resources and be able to download the module source once and apply it to each usage rather than downloading it N times.

Actual Behavior

Output appears to show the module is downloaded N times instead of once.

Get: git::ssh://git@bitbucket.foo.io:7999/repo/project (update)
Get: git::ssh://git@bitbucket.foo.io:7999/repo/project (update)
Get: git::ssh://git@bitbucket.foo.io:7999/repo/project (update)

Steps to Reproduce

Create several modules that use the same source and run a TF plan or apply operation.

@rowleyaj
Copy link
Contributor

rowleyaj commented Feb 1, 2017

@mattlqx This has been a pain point for a while, especially when using modules a large number of times. I've looked into it in the past but I'm not sure of the best way to change it, so haven't proposed anything. The code behind it is these lines: https://github.com/hashicorp/terraform/blob/master/config/module/tree.go#L156-L177

When the tree is built it uses the path to the module in the tree, and the src of the module to create a key for each module. This is then passed to go-getter which creates the hash for the modules location on disk. The addition of the path was added in #1418

A work around has been proposed using an external file: http://bensnape.com/2016/01/14/terraform-design-patterns-the-terrafile/

I'd like to work out a nicer way, module downloads are the slowest part of running terraform for us at the moment.

@rowleyaj
Copy link
Contributor

rowleyaj commented Feb 1, 2017

@mitchellh @phinze are there any further thoughts on the how the module lock idea mentioned in #1418 could work?

@johnparfitt
Copy link

This takes about 2 mins to fetch 30 identical modules as opposed to 2 seconds when source is local. Unfortunately this is a showstopper for us.

@salimane
Copy link

any update on this ?

@clumsysnake
Copy link

also hitting this issue

@zerolaser
Copy link

any update on this ?

@apparentlymart apparentlymart added cli and removed core labels Mar 24, 2018
@horalstvo
Copy link

Also an issue for us. And downloading in parallel would be great as well.

@rifelpet
Copy link

rifelpet commented May 3, 2018

As a step towards a better solution, what if the local module directory name was a hash of the module source and version? Downloads would still be serial but subsequent downloads for identical sources and versions would be effectively no-ops for some module protocols. For git sources you could ignore the path inside the repo too (everything after the double slash), allowing for more more modules to share the same repo clone.

Thoughts?

Due to module invocations not supporting count, we have 20 module invocations using the same git repo and subdirectory, with another 20 or so using a different subdirectory in the same repo. This could reduce the number of repo clones from 40 to 1

@apparentlymart
Copy link
Contributor

Hi all!

In the next major release, due to switching to a new implementation of the configuration language, we also have a new module installer mechanism.

Since the new configuration language implementation is able to produce better source location context in error messages, we've switched to using human-readable names for the module installation directories instead of the current hashes, so that paths shown in the error messages are easier to understand.

As a consequence of this, the directories are named after the module names given in configuration (because these are short, filename-friendly, and easily recognizable), and so relying on naming the directories after the source address isn't an option any longer.

However, we do still want to solve this inefficiency. Our favored idea right now is to make the installer internally keep a mapping table from a (source address, version) tuple to the first module installed from that source, and then if we find a second request to install the same thing we would either copy or symlink the initial directory locally, without downloading again. As @rifelpet suggested, we may also be able to use a "trimmed" version of the source address in this lookup table so that different paths under the same repository/package would be able to share a directory.

Since the configuration language improvements are already a big change we decided to leave this optimization out of scope for the first pass. The new implementation was designed with this thought in mind and so it hopefully won't be too hard to retrofit it once the dust has settled on the other changes.

@rismoney
Copy link

are there any workarounds? this is a real pain point that's been going on for years. it's gotten a lot worse as the scaling of module reuse grows.

@leftathome
Copy link

I guess the trimmed version of the source address works fine, as long as it somehow includes the refspec. For users who are pulling modules from Git repos, ref=v1.0.1 of a module will be different from ref=v1.1.0. :)

If it doesn't include the refspec, some of us are going to have to hold off on upgrading until this optimization can take place.

(Or I guess we could build deploy-tarballs-to-S3 pipelines...)

@rismoney
Copy link

What do you mean trimmed version of the source address works fine with a refspec?

@apparentlymart
Copy link
Contributor

The "trimming" function I was alluding to would remove the portion of the address that represents a path within whatever package is being retrieved, which for git repositories would be the path into a sub-directory within the repository.

As you point out, @rismoney, that would need to be done with care to preserve all of the query-string-like arguments that many of the source address schemes support, like ref on the git scheme.

For example:

Full Trimmed
git::ssh://git@.../repo//aws_instance git::ssh://git@.../repo
git::ssh://git@.../repo//aws_instance?ref=v0.1.0 git::ssh://git@.../repo?ref=v0.1.0
http://example.com/example.tar//foo http://example.com/example.tar

The handling of these sub-paths is pretty complex today due to how the different parts of the installer interact. I believe the new installer I mentioned is implemented in a simpler way where the source addresses are handled in a single place, so it'll hopefully be easier to support this trimming-and-caching idea once we've finalized that. The new installer will be in the next major release, so hopefully we can do something with this after that.

@rismoney
Copy link

@apparentlymart thank you for the explanation, its now very clear.

I had been reusing the same module ~150x within the same repo (using pathing). I was attempting a refactor to move all these modules to an external, and this problem got ugly fast. It became a millions of files (git) thing, due to cloning depth, sheer quantity, and obvious slowness.

The question I have with the source being trimmed, is then, is the module name then pushed to another parameter? How do you then specify the proper module.

@rismoney
Copy link

nm. it's trimmed from a code perspective, not an enduser perspective. got it.

@apparentlymart
Copy link
Contributor

Hi all,

A while back over in #18309 we merged a change into the v0.12 development branch to re-use already-downloaded modules where possible. Although we had initially planned to leave that out of the v0.12.0 scope, once the new installer was implemented it turned out to be relatively simple to include an initial pass of this optimization behavior due to the new installer being designed with it in mind.

That change is now in master ready to be included in the v0.12.0 final release, so I just verified it against the v0.12.0-alpha2 prerelease build.

I used this configuration:

module "consul-a" {
  source = "hashicorp/consul/aws"
}

module "consul-b" {
  source = "hashicorp/consul/aws"
}

I'm using Terraform module registry sources here first of all just because it's a more convenient thing to quickly test. The caching mechanism works for modules downloaded from any source.

$ terraform init
Initializing modules...
Downloading hashicorp/consul/aws 0.4.1 for consul-a...
- consul-a in .terraform/modules/consul-a/hashicorp-terraform-aws-consul-52bd5df
- consul-a.consul_clients in .terraform/modules/consul-a/hashicorp-terraform-aws-consul-52bd5df/modules/consul-cluster
- consul-a.consul_clients.iam_policies in .terraform/modules/consul-a/hashicorp-terraform-aws-consul-52bd5df/modules/consul-iam-policies
- consul-a.consul_clients.security_group_rules in .terraform/modules/consul-a/hashicorp-terraform-aws-consul-52bd5df/modules/consul-security-group-rules
- consul-a.consul_clients.security_group_rules.client_security_group_rules in .terraform/modules/consul-a/hashicorp-terraform-aws-consul-52bd5df/modules/consul-client-security-group-rules
- consul-a.consul_servers in .terraform/modules/consul-a/hashicorp-terraform-aws-consul-52bd5df/modules/consul-cluster
- consul-a.consul_servers.iam_policies in .terraform/modules/consul-a/hashicorp-terraform-aws-consul-52bd5df/modules/consul-iam-policies
- consul-a.consul_servers.security_group_rules in .terraform/modules/consul-a/hashicorp-terraform-aws-consul-52bd5df/modules/consul-security-group-rules
- consul-a.consul_servers.security_group_rules.client_security_group_rules in .terraform/modules/consul-a/hashicorp-terraform-aws-consul-52bd5df/modules/consul-client-security-group-rules
Downloading hashicorp/consul/aws 0.4.1 for consul-b...
- consul-b in .terraform/modules/consul-b/hashicorp-terraform-aws-consul-52bd5df
- consul-b.consul_clients in .terraform/modules/consul-b/hashicorp-terraform-aws-consul-52bd5df/modules/consul-cluster
- consul-b.consul_clients.iam_policies in .terraform/modules/consul-b/hashicorp-terraform-aws-consul-52bd5df/modules/consul-iam-policies
- consul-b.consul_clients.security_group_rules in .terraform/modules/consul-b/hashicorp-terraform-aws-consul-52bd5df/modules/consul-security-group-rules
- consul-b.consul_clients.security_group_rules.client_security_group_rules in .terraform/modules/consul-b/hashicorp-terraform-aws-consul-52bd5df/modules/consul-client-security-group-rules
- consul-b.consul_servers in .terraform/modules/consul-b/hashicorp-terraform-aws-consul-52bd5df/modules/consul-cluster
- consul-b.consul_servers.iam_policies in .terraform/modules/consul-b/hashicorp-terraform-aws-consul-52bd5df/modules/consul-iam-policies
- consul-b.consul_servers.security_group_rules in .terraform/modules/consul-b/hashicorp-terraform-aws-consul-52bd5df/modules/consul-security-group-rules
- consul-b.consul_servers.security_group_rules.client_security_group_rules in .terraform/modules/consul-b/hashicorp-terraform-aws-consul-52bd5df/modules/consul-client-security-group-rules

The download completed successfully as before, with very similar output as before (the slight changes in layout here are a result of the other module installer changes in this release). The primary difference here is that while there was a noticeable pause at Downloading hashicorp/consul/aws 0.4.1 for consul-a... (as Terraform retrieved the source code), the similar line for consul-b completed almost instantaneously.

The trace log explains why this is:

[TRACE] copying previous install .terraform/modules/consul-a to .terraform/modules/consul-b

Terraform noticed that both modules had the same finally resolved module source string (in this case it was https://api.github.com/repos/hashicorp/terraform-aws-consul/tarball/v0.4.1?archive=tar.gz) and thus skipped downloading it a second time and just re-used the copy it already downloaded.

I also verified this using git:: sources -- since that's what this issue was originally about -- by directly accessing the git repository for this same module, at git::https://github.com/hashicorp/terraform-aws-consul. The behavior is the same, as long as all of the git sources use the same protocol, remote branch/tag, etc.

Initializing modules...
Downloading git::https://github.com/hashicorp/terraform-aws-consul for consul-a...
- consul-a in .terraform/modules/consul-a
- consul-a.consul_clients in .terraform/modules/consul-a/modules/consul-cluster
- consul-a.consul_clients.iam_policies in .terraform/modules/consul-a/modules/consul-iam-policies
- consul-a.consul_clients.security_group_rules in .terraform/modules/consul-a/modules/consul-security-group-rules
- consul-a.consul_clients.security_group_rules.client_security_group_rules in .terraform/modules/consul-a/modules/consul-client-security-group-rules
- consul-a.consul_servers in .terraform/modules/consul-a/modules/consul-cluster
- consul-a.consul_servers.iam_policies in .terraform/modules/consul-a/modules/consul-iam-policies
- consul-a.consul_servers.security_group_rules in .terraform/modules/consul-a/modules/consul-security-group-rules
- consul-a.consul_servers.security_group_rules.client_security_group_rules in .terraform/modules/consul-a/modules/consul-client-security-group-rules
Downloading git::https://github.com/hashicorp/terraform-aws-consul for consul-b...
- consul-b in .terraform/modules/consul-b
- consul-b.consul_clients in .terraform/modules/consul-b/modules/consul-cluster
- consul-b.consul_clients.iam_policies in .terraform/modules/consul-b/modules/consul-iam-policies
- consul-b.consul_clients.security_group_rules in .terraform/modules/consul-b/modules/consul-security-group-rules
- consul-b.consul_clients.security_group_rules.client_security_group_rules in .terraform/modules/consul-b/modules/consul-client-security-group-rules
- consul-b.consul_servers in .terraform/modules/consul-b/modules/consul-cluster
- consul-b.consul_servers.iam_policies in .terraform/modules/consul-b/modules/consul-iam-policies
- consul-b.consul_servers.security_group_rules in .terraform/modules/consul-b/modules/consul-security-group-rules
- consul-b.consul_servers.security_group_rules.client_security_group_rules in .terraform/modules/consul-b/modules/consul-client-security-group-rules

For this initial pass we're only optimizing away the redundant downloads. Terraform will still create a local copy of the code for each module, since this allowed this to be implemented in a reasonably low-risk way that doesn't require changes anywhere else in the codebase. The fact that it is implemented as a detail of the installer for now is also why the terraform init output doesn't actually acknowledge that the caching behavior is active, though the much decreased runtime of terraform init should make it very obvious.

Although not directly related to the issue at hand, you can also see in the output I shared above the other change I mentioned where the modules are now installed into directories named after the module names given in configuration, rather than the opaque hashes used before. The primary motivation for this is to avoid including inscrutable paths in Terraform's new, richer error messages, which will include references to particular source files where possible.

I expect we will make further refinements to this behavior in later releases, such as having Terraform produce symlinks instead of copies where possible, and perhaps making the caching behavior more visible in the UI, but this first pass here meets the main request of this issue and so I'm going to close this out. Once we have more experience with this new implementation after release, we can discuss further improvements to it in new GitHub issues.

Thanks for reporting this, and thanks for your patience while we laid the groundwork to make this possible.

@ghost
Copy link

ghost commented Mar 31, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Mar 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests