Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache the endpoints on the controller loop #1001

Merged
merged 1 commit into from
May 10, 2019

Conversation

fraenkel
Copy link
Contributor

This is one attempt to reduce the API load during ApplyChanges as seen by #484

I started with one path from the Controller -> Registry (TXT) -> Provider (AWS) as a proof of concept that it could be done relatively clean.
This model doesn't cache hosted zones for AWS since that information is not returned.

As shown by the test in aws_test.go, we now go from 3 ListRecordResourceSets to 0.
I also disabled the caching via context if one enables caching in the Registry the prevent the provider seeing potentially incorrect data.

Thoughts?

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 25, 2019
@k8s-ci-robot k8s-ci-robot requested review from linki and njuettner April 25, 2019 12:15
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 25, 2019
@njuettner
Copy link
Member

@fraenkel thanks for your PR 😄.
May I ask what are you trying to solve? What impact does this change have for all providers? What's the benefit of disabling caching again?

@fraenkel
Copy link
Contributor Author

@njuettner Sure, a valid question.

We continue to have issues with AWS and rate limiting. With my previous AWS improvements, we are now at the point where we perform the same Route53 API calls exactly twice. Once at the beginning of the controller loop to retrieve the records and then again during ApplyChanges. In small environments, this is not an issue. For us, we have over 3000 DNS entries (C and TXT), and a large number of clusters, i.e, many external-dns. If you do the math, Records(), will be a minimum of 30 API calls, there are more for Zones, etc... Now you double it, and we are sitting at 60 API calls. Multiply this by some number of clusters, and you can see how this can start to be an issue.

This change passes the endpoints retrieved in the controller loop back to the provider whereby we cut half the API calls. There will be some extra calls for Zones but I would be a bigger change to attempt to cache that. Providers can opt into using the cache when available. This PR only enabled AWS to leverage the cache in the context. Any other provider can choose to do the same.

So there might be some confusion over what cache is disabled. In the case that caching within the Registry is enabled, I chose to disable providing the cached endpoints to the Provider. There are two reasons for doing so. The first is probably more obvious. I did not want to change the current behavior when caching is enabled. I mentioned this in my previous PR. If the provider is always seeing a stale version of the world, once you have any error (inconsistent view), you must wait until the next interval before you are corrected. Which bleeds into the second issue, the cache is updated regardless of the success or more importantly failure of ApplyChanges. Throttling and inconsistent views will immediately cause errors which means the cache does not reflect what happened but what should have happened. Any subsequent loops are making plans on incorrect data. In my previous PR, I argued that providing the same cached data or retrieving new ones didn't matter since a broken cache is going to lead to questionable outcomes.
There are a few ways to fix this but given the separation of registry cache and provider, there is no easy fix. One might think a quick fix is to update the cache only after a successful ApplyChanges, but now the cache is still inconsistent with the world of partial changes were made.

So for now, I am focusing on reducing API calls for single controller loop execution. I can go either way on passing cached data to the Provider. My take is that you only pass relatively current data.

My other idea was to cache this data within the Provider. I did not like this approach because it made no guarantees on when data was refreshed for each pass of the controller. If possible, data should be current at the beginning rather than at the end after the Plan has been made.

@njuettner
Copy link
Member

Sorry for my late response.

Since you mentioned it can be applied for other providers, why not adding the same feature for all other providers? If I understood it correctly it will be only beneficial and do no harm.

WDYT?

If you're not comfortable with it, it is also fine and we can merge it.
Did you had the chance to test your change in your clusters?

Thanks again 👍.

@njuettner njuettner self-assigned this May 7, 2019
@fraenkel
Copy link
Contributor Author

fraenkel commented May 7, 2019

I looked at a few providers to see if they could benefit. While alicloud could use something like this, their internal representation is different so they cannot. If we allowed providers to fill in cached content, then it could support additional use cases. However, I was trying to be as light as possible.

@njuettner
Copy link
Member

One last request from my side. Do you mind rebase from master and fix it a TransIP (new provider)? Then I think we can merge it :).

The controller will retrieve all the endpoints at the beginning of its
loop. When changes need to be applied, the provider may need to query
the endpoints again. Allow the provider to skip the queries if its data was
cached.
@njuettner
Copy link
Member

--- FAIL: TestAWSSDProvider_CreateService (0.00s)
    require.go:738: 
        	Error Trace:	aws_sd_test.go:590
        	            				aws_sd_test.go:586
        	Error:      	"map[଎:{
        	            	  CreateDate: 2019-05-07 23:58:05.681471777 +0000 UTC m=+0.077964183,
        	            	  Description: "",
        	            	  DnsConfig: {
        	            	    DnsRecords: [{
        	            	        TTL: 100,
        	            	        Type: "A"
        	            	      }],
        	            	    NamespaceId: "private",
        	            	    RoutingPolicy: "WEIGHTED"
        	            	  },
        	            	  Id: "\u0b0e",
        	            	  Name: "ALIAS-srv"
        	            	} ᭌ:{
        	            	  CreateDate: 2019-05-07 23:58:05.681364288 +0000 UTC m=+0.077856706,
        	            	  Description: "",
        	            	  DnsConfig: {
        	            	    DnsRecords: [{
        	            	        TTL: 80,
        	            	        Type: "CNAME"
        	            	      }],
        	            	    NamespaceId: "private",
        	            	    RoutingPolicy: "WEIGHTED"
        	            	  },
        	            	  Id: "\u1b4c",
        	            	  Name: "CNAME-srv"
        	            	}]" should have 3 item(s), but has 2
        	Test:       	TestAWSSDProvider_CreateService

@fraenkel
Copy link
Contributor Author

fraenkel commented May 8, 2019

@njuettner That isn't an error I introduced. I didn't touch any code in that path. I don't even know how that test can fail other than dryRun being set.
I have run the test 100s of times and it always passes.

@njuettner
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: njuettner

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2019
@k8s-ci-robot k8s-ci-robot merged commit 3b44e39 into kubernetes-sigs:master May 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants