-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache the endpoints on the controller loop #1001
Conversation
@fraenkel thanks for your PR 😄. |
@njuettner Sure, a valid question. We continue to have issues with AWS and rate limiting. With my previous AWS improvements, we are now at the point where we perform the same Route53 API calls exactly twice. Once at the beginning of the controller loop to retrieve the records and then again during ApplyChanges. In small environments, this is not an issue. For us, we have over 3000 DNS entries (C and TXT), and a large number of clusters, i.e, many external-dns. If you do the math, Records(), will be a minimum of 30 API calls, there are more for Zones, etc... Now you double it, and we are sitting at 60 API calls. Multiply this by some number of clusters, and you can see how this can start to be an issue. This change passes the endpoints retrieved in the controller loop back to the provider whereby we cut half the API calls. There will be some extra calls for Zones but I would be a bigger change to attempt to cache that. Providers can opt into using the cache when available. This PR only enabled AWS to leverage the cache in the context. Any other provider can choose to do the same. So there might be some confusion over what cache is disabled. In the case that caching within the Registry is enabled, I chose to disable providing the cached endpoints to the Provider. There are two reasons for doing so. The first is probably more obvious. I did not want to change the current behavior when caching is enabled. I mentioned this in my previous PR. If the provider is always seeing a stale version of the world, once you have any error (inconsistent view), you must wait until the next interval before you are corrected. Which bleeds into the second issue, the cache is updated regardless of the success or more importantly failure of ApplyChanges. Throttling and inconsistent views will immediately cause errors which means the cache does not reflect what happened but what should have happened. Any subsequent loops are making plans on incorrect data. In my previous PR, I argued that providing the same cached data or retrieving new ones didn't matter since a broken cache is going to lead to questionable outcomes. So for now, I am focusing on reducing API calls for single controller loop execution. I can go either way on passing cached data to the Provider. My take is that you only pass relatively current data. My other idea was to cache this data within the Provider. I did not like this approach because it made no guarantees on when data was refreshed for each pass of the controller. If possible, data should be current at the beginning rather than at the end after the Plan has been made. |
Sorry for my late response. Since you mentioned it can be applied for other providers, why not adding the same feature for all other providers? If I understood it correctly it will be only beneficial and do no harm. WDYT? If you're not comfortable with it, it is also fine and we can merge it. Thanks again 👍. |
I looked at a few providers to see if they could benefit. While alicloud could use something like this, their internal representation is different so they cannot. If we allowed providers to fill in cached content, then it could support additional use cases. However, I was trying to be as light as possible. |
One last request from my side. Do you mind rebase from master and fix it a TransIP (new provider)? Then I think we can merge it :). |
The controller will retrieve all the endpoints at the beginning of its loop. When changes need to be applied, the provider may need to query the endpoints again. Allow the provider to skip the queries if its data was cached.
|
@njuettner That isn't an error I introduced. I didn't touch any code in that path. I don't even know how that test can fail other than dryRun being set. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: njuettner The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is one attempt to reduce the API load during ApplyChanges as seen by #484
I started with one path from the Controller -> Registry (TXT) -> Provider (AWS) as a proof of concept that it could be done relatively clean.
This model doesn't cache hosted zones for AWS since that information is not returned.
As shown by the test in aws_test.go, we now go from 3 ListRecordResourceSets to 0.
I also disabled the caching via context if one enables caching in the Registry the prevent the provider seeing potentially incorrect data.
Thoughts?