Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configload: Don't download the same module source multiple times #18309

Merged
merged 1 commit into from
Jun 22, 2018

Conversation

apparentlymart
Copy link
Contributor

It is common for the same module source package to be referenced multiple times in the same configuration, either because there are literally multiple instances of the same module source or because a single package (or repository) contains multiple modules in sub-directories and many of them are referenced.

To optimize this, here we introduce a simple caching behavior where the module installer will detect if it's asked to install multiple times from the same source and produce the second and subsequent directories by copying the first, rather than by downloading again over the network.

This optimization is applied once all of the go-getter detection has completed and sub-directory portions have been trimmed, so it is also able to normalize differently-specified source addresses that all
ultimately detect to the same resolved address. When installing, we always extract the entire specified package (or repository) and then reference the specified sub-directory, so we can safely re-use existing directories when the base package is the same, even if the sub-directory is different.

However, as a result we do not yet address the fact that the same package will be stored multiple times on disk, which may still be problematic when referencing large repositories multiple times in disk-storage-constrained environments. We could address this in a subsequent change by investigating the use of symlinks where possible.

Since the Registry installer is implemented just as an extra resolution step in front of go-getter, this optimization applies to registry modules too. However, this does not apply to relative local references, which will continue to just resolve into the already-prepared directory of their parent module with no filesystem modification whatsoever.

The cache of previously installed paths lives only for the duration of one call to InstallModules, so we will never re-use directories that were created by previous runs of terraform init and there is no risk
that older versions will pollute the cache when attempting an upgrade from a source address that doesn't explicitly specify a version.

No additional tests are added here because the existing module installer tests (when TF_ACC=1) already cover the case of installing multiple modules from the same source, exercising this new codepath.

This is in response to #11435.

It is common for the same module source package to be referenced multiple
times in the same configuration, either because there are literally
multiple instances of the same module source or because a single package
(or repository) contains multiple modules in sub-directories and many
of them are referenced.

To optimize this, here we introduce a simple caching behavior where the
module installer will detect if it's asked to install multiple times from
the same source and produce the second and subsequent directories by
copying the first, rather than by downloading again over the network.

This optimization is applied once all of the go-getter detection has
completed and sub-directory portions have been trimmed, so it is also
able to normalize differently-specified source addresses that all
ultimately detect to the same resolved address. When installing, we
always extract the entire specified package (or repository) and then
reference the specified sub-directory, so we can safely re-use existing
directories when the base package is the same, even if the sub-directory
is different.

However, as a result we do not yet address the fact that the same package
will be stored multiple times _on disk_, which may still be problematic
when referencing large repositories multiple times in
disk-storage-constrained environments. We could address this in a
subsequent change by investigating the use of symlinks where possible.

Since the Registry installer is implemented just as an extra resolution
step in front of go-getter, this optimization applies to registry
modules too. This does not apply to local relative references, which will
continue to just resolve into the already-prepared directory of their
parent module.

The cache of previously installed paths lives only for the duration of
one call to InstallModules, so we will never re-use directories that
were created by previous runs of "terraform init" and there is no risk
that older versions will pollute the cache when attempting an upgrade
from a source address that doesn't explicitly specify a version.

No additional tests are added here because the existing module installer
tests (when TF_ACC=1) already cover the case of installing multiple
modules from the same source.
@rismoney
Copy link

Trying to clarify this- if I have 2 module invocations that both the same exact reference:
source="git::https://myco.example.com/foo/foo?ref=master"
will I end up with 2 copies of this, but the 2nd (and subsequent) will be copied" instead of cloned or will it reuse the directory it trimmed?

@apparentlymart
Copy link
Contributor Author

@rismoney as of this commit, it will create a local copy for each reference as before.

This was intended as a safe and straightforward way to address the reasonable complaint that re-downloading over the network many times is slow and wasteful of possibly-metered data service.

I believe we can also, in a later change, optimize the local disk usage when the filesystem and host OS is able to support features like symlinks or hardlinks. That'd require a lot more testing and and care for different scenarios (e.g. host platforms that don't support links at all, host platforms that do with filesystems that don't, etc) and would therefore cause this work to not get done at all during the current release cycle, which is already full of work. This was only possible to squeeze in because it was a relatively easy extension of work already done, and so I decided to deal with this portion of the problem in isolation as it addresses the most common complaint.

@apparentlymart apparentlymart merged commit 1be8167 into v0.12-dev Jun 22, 2018
@apparentlymart apparentlymart deleted the f-module-get-cache branch June 22, 2018 16:43
@rismoney
Copy link

Could it just check for existence in the cache? Not sure why you need unique paths even simulated with symlinks. Why can't each module be downloaded 1x, and check for existence (should be safe as per InstallModules 1x op)

If anything I see an issue with symlinks/single-fied storage, where parallelism/checkouts to refs are problematic across similar modules usage.

@apparentlymart
Copy link
Contributor Author

Let's take this discussion back into #11435, since this PR is now closed and will not see any further changes.

@ghost
Copy link

ghost commented Apr 3, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants