Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coalescing of identical datasources #19942

Open
rifelpet opened this issue Jan 8, 2019 · 7 comments
Open

Coalescing of identical datasources #19942

rifelpet opened this issue Jan 8, 2019 · 7 comments

Comments

@rifelpet
Copy link

rifelpet commented Jan 8, 2019

Current Terraform Version

Terraform v0.11.11
+ provider.aws v1.33.0

Terraform v0.12.0-alpha4 (2c36829d3265661d8edbd5014de8090ea7e2a076)
+ provider.aws v1.40.0-6-gb23683732-dev

Use-cases

We have many modules in which we abstract away some usage requirements through the use of datasources. For example: rather than requiring the module user to know the value to use for a variable, we can use a datasource in the module to get that data automatically without the module user having to know how to find it.

This results in many datasources being looked up which are effectively identical. If a single terraform state contains many invocations of the same module, this results in the datasource being queried many times. Some of these datasources could have different inputs, for example if they depend on one of the module's variables, but others could be entirely identical. Making identical lookups is unnecessary and slows down plans.

Attempted Solutions

N/A

Proposal

For any given plan, terraform should cache the results of datasources based on their inputs and provider configuration. If an identical datasource is found elsewhere in the plan, it should use the cached results rather than performing an additional lookup.

Here is a simple example:

provider "aws" {}

data "aws_ami_ids" "ubuntu1" {
  owners = ["099720109477"]

  filter {
    name   = "name"
    values = ["ubuntu/images/ubuntu-xenial-16.04-amd64-server-20181223"]
  }
}
data "aws_ami_ids" "ubuntu2" {
  owners = ["099720109477"]

  filter {
    name   = "name"
    values = ["ubuntu/images/ubuntu-xenial-16.04-amd64-server-20181223"]
  }
}
$ TF_LOG=trace terraform plan 2>&1 | grep "ec2/DescribeImages"
2019-01-08T11:26:07.768-0800 [DEBUG] plugin.terraform-provider-aws_v1.33.0_x4: 2019/01/08 11:26:07 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/DescribeImages Details:
2019-01-08T11:26:07.768-0800 [DEBUG] plugin.terraform-provider-aws_v1.33.0_x4: 2019/01/08 11:26:07 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/DescribeImages Details:
2019-01-08T11:26:07.909-0800 [DEBUG] plugin.terraform-provider-aws_v1.33.0_x4: 2019/01/08 11:26:07 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DescribeImages Details:
2019-01-08T11:26:07.941-0800 [DEBUG] plugin.terraform-provider-aws_v1.33.0_x4: 2019/01/08 11:26:07 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DescribeImages Details:

The ec2.DescribeImages AWS API call is made twice even though its inputs are identical and it would be reasonable to assume that their responses would also be identical. Therefor we shouldn't need to make the second API call and could instead use the cached results of the first.

Some of the more generic providers may not want this behavior, for example the http provider, but for others like the cloud providers I can't think of any reason this would be undesirable. Perhaps enabling datasource caching could be a provider configuration setting?

References

N/A

@apparentlymart
Copy link
Contributor

Thanks for sharing this idea, @rifelpet!

I like this idea and it seems like a specialized case of batch requests (#7388). It would require a similar graph optimization step to what we were thinking about for batch requests -- in order to find and merge graph nodes -- but the merging rule could in principle be hard-coded into Terraform rather than informed by the provider as we assume for other cases.

With that said, it may be better for us to keep these assumptions out of Terraform Core, instead just implementing the general batch request mechanism and letting the provider itself to notice when it's possible to merge two read requests, since the provider has better knowledge about what makes two requests equivalent, and so it's likely to be able to merge requests in more cases than Terraform Core could by just requiring the entire data resource configuration to be identical.

I'm going to leave this open as a separate issue for now since it may be possible to implement it separately from the broader batch request idea. If we decide to merge these two problems together after all then we might come back later and close this one as a duplicate.

Thanks again!

@apparentlymart apparentlymart changed the title Caching of identical datasources Coalescing of identical datasources Jan 8, 2019
@rifelpet
Copy link
Author

rifelpet commented Jan 8, 2019

Awesome, thanks for the pointer to batch requests! I'll follow that issue. I'm definitely looking forward to terraform and the providers supporting both cases, it would really speed up the plans on some of our (unfortunately) larger states.

@funny-falcon-at-joomcode
Copy link

funny-falcon-at-joomcode commented Jan 11, 2019

But why it could not be batched on provider level? Just wait for a hundred of microseconds to collect more requests. Or am I too naive?

@Manbeardo
Copy link

Since there are valid reasons to re-fetch the same data source, maybe it'd make sense to add an opt-in meta-argument for data sources that allows the use of cached results?

@apparentlymart
Copy link
Contributor

apparentlymart commented Jul 11, 2024

It's been a while since I was thinking about this specific issue, but my thinking about batch requests has evolved in the meantime and I think the same ideas still apply to this request coalescing idea.

What I've been thinking about is actually along similar lines to what @Manbeardo suggested, but instead of introducing an artificial delay on the provider side we can instead exploit the fact that Terraform's request concurrency limit (controlled by the unfortunately-named "parallelism" option) creates delays naturally, and so with some adjustments to how exactly Terraform honors that limit we could allow requests to begin enough to know what we intend to send to the provider but then each time a concurrency slot becomes available take more than one request from the queue if our batching rules (whatever they turn out to be) suggest that more than one pending request could be handled together.

The way that might apply to this feature request is to allow a provider to declare some rule for which data requests are potentially coalescable -- a simple rule to start with could be to announce that a all requests for a particular data resource type can potentially be coalesced -- and then Terraform Core would make a best effort to send as many such requests as possible in a single provider call. The provider can then use its own logic to decide what is and is not actually coalescable, falling back on just handling all of the requests separately as normal in the worst case but hopefully sometimes noticing that two requests are similar enough to be served from the same API call.

This makes the Core handling of coalescing a dynamic concern that unfolds based only on what happens to become runnable before the next concurrency slot becomes available, rather than trying to define an algorithm for deciding the coalescing opportunities statically as a dependency graph reduction.

This particular request has not seen a lot of interest since it was opened and so it seems like opportunities for coalescing data lookups are somewhat marginal in practice, and so I doubt that this capability alone has sufficient priority to implement alone, but it seems to me like it could come along with the general idea of batching if this idea were taken into account while designing the provider protocol changes for batching.

(You can find some recent prototyping I did about this in #35393. In that PR I was only researching whether it would be feasible to make the concurrency limit be enforced at a finer grain than entire graph nodes -- which it is -- but that prototype didn't actually make any real use of that capability for batching, and I actually made the concurrency limits too fine-grain so a real implementation would need to find a middle ground where the UI can still only report requests once they are actually starting. But I don't think that's something that needs its feasibility proven, just some design thought into where are the best places to acquire and release the semaphore to get the desired effect.)

@Manbeardo
Copy link

In my particular application, I have some commonly-imported modules that provide configuration that's reused through several modules. Sometimes those modules contain data sources, which results in the data being re-fetched for every time the module is included. It isn't causing any problems now, but it has potential to become problematic as the number of consumers increases.

In my case, I'm not looking for coalescing of requests that are "similar enough to be served from the same API call". I'm looking for cached results for identical data sources with exactly the same resolved parameters.

@apparentlymart
Copy link
Contributor

Hi @Manbeardo,

I expect that a provider would only decide to coalesce two requests if it knows that they are equivalent as far as the remote API is concerned. A provider is the correct place in Terraform's architecture to make that decision because it has the target-platform-specific knowledge to know which rule makes sense.

If having exactly-identical request arguments is the rule for a particular case then that should be trivial for a provider to implement because the plugin framework already needs to be able to compare values for equality for other reasons anyway, so the mechanisms for that are already there.

In the meantime, I suspect this issue hasn't attracted more interest because many folks are following something like the recommendations from Module Composition, which suggests (among other things) the idea of having a shared module receive the data it needs as an input variable rather than fetching it with a data resource itself, so that the calling module can decide how best to obtain that data, and can often avoid making any data resource requests at all if the object in question is already being managed elsewhere in the same configuration.

(When we consider this later, it might be interesting to ponder whether it makes sense to give providers the option of coalescing a managed resource "refresh" with a data source "read" if they were going to end up making the same API call anyway, in which case the choice to use the module composition approach would be less consequential, but that seems considerably more complex for providers to implement so we'd need to see if the plugin framework can make it ergonomic enough that provider developers would actually be willing to do it -- there's no point in offering a provider protocol feature that nobody actually uses because it's too hard to make use of.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants