-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
configload: Don't download the same module source multiple times #18309
Conversation
It is common for the same module source package to be referenced multiple times in the same configuration, either because there are literally multiple instances of the same module source or because a single package (or repository) contains multiple modules in sub-directories and many of them are referenced. To optimize this, here we introduce a simple caching behavior where the module installer will detect if it's asked to install multiple times from the same source and produce the second and subsequent directories by copying the first, rather than by downloading again over the network. This optimization is applied once all of the go-getter detection has completed and sub-directory portions have been trimmed, so it is also able to normalize differently-specified source addresses that all ultimately detect to the same resolved address. When installing, we always extract the entire specified package (or repository) and then reference the specified sub-directory, so we can safely re-use existing directories when the base package is the same, even if the sub-directory is different. However, as a result we do not yet address the fact that the same package will be stored multiple times _on disk_, which may still be problematic when referencing large repositories multiple times in disk-storage-constrained environments. We could address this in a subsequent change by investigating the use of symlinks where possible. Since the Registry installer is implemented just as an extra resolution step in front of go-getter, this optimization applies to registry modules too. This does not apply to local relative references, which will continue to just resolve into the already-prepared directory of their parent module. The cache of previously installed paths lives only for the duration of one call to InstallModules, so we will never re-use directories that were created by previous runs of "terraform init" and there is no risk that older versions will pollute the cache when attempting an upgrade from a source address that doesn't explicitly specify a version. No additional tests are added here because the existing module installer tests (when TF_ACC=1) already cover the case of installing multiple modules from the same source.
Trying to clarify this- if I have 2 module invocations that both the same exact reference: |
@rismoney as of this commit, it will create a local copy for each reference as before. This was intended as a safe and straightforward way to address the reasonable complaint that re-downloading over the network many times is slow and wasteful of possibly-metered data service. I believe we can also, in a later change, optimize the local disk usage when the filesystem and host OS is able to support features like symlinks or hardlinks. That'd require a lot more testing and and care for different scenarios (e.g. host platforms that don't support links at all, host platforms that do with filesystems that don't, etc) and would therefore cause this work to not get done at all during the current release cycle, which is already full of work. This was only possible to squeeze in because it was a relatively easy extension of work already done, and so I decided to deal with this portion of the problem in isolation as it addresses the most common complaint. |
Could it just check for existence in the cache? Not sure why you need unique paths even simulated with symlinks. Why can't each module be downloaded 1x, and check for existence (should be safe as per InstallModules 1x op) If anything I see an issue with symlinks/single-fied storage, where parallelism/checkouts to refs are problematic across similar modules usage. |
Let's take this discussion back into #11435, since this PR is now closed and will not see any further changes. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
It is common for the same module source package to be referenced multiple times in the same configuration, either because there are literally multiple instances of the same module source or because a single package (or repository) contains multiple modules in sub-directories and many of them are referenced.
To optimize this, here we introduce a simple caching behavior where the module installer will detect if it's asked to install multiple times from the same source and produce the second and subsequent directories by copying the first, rather than by downloading again over the network.
This optimization is applied once all of the
go-getter
detection has completed and sub-directory portions have been trimmed, so it is also able to normalize differently-specified source addresses that allultimately detect to the same resolved address. When installing, we always extract the entire specified package (or repository) and then reference the specified sub-directory, so we can safely re-use existing directories when the base package is the same, even if the sub-directory is different.
However, as a result we do not yet address the fact that the same package will be stored multiple times on disk, which may still be problematic when referencing large repositories multiple times in disk-storage-constrained environments. We could address this in a subsequent change by investigating the use of symlinks where possible.
Since the Registry installer is implemented just as an extra resolution step in front of
go-getter
, this optimization applies to registry modules too. However, this does not apply to relative local references, which will continue to just resolve into the already-prepared directory of their parent module with no filesystem modification whatsoever.The cache of previously installed paths lives only for the duration of one call to
InstallModules
, so we will never re-use directories that were created by previous runs ofterraform init
and there is no riskthat older versions will pollute the cache when attempting an upgrade from a source address that doesn't explicitly specify a version.
No additional tests are added here because the existing module installer tests (when
TF_ACC=1
) already cover the case of installing multiple modules from the same source, exercising this new codepath.This is in response to #11435.