Cache the endpoints on the controller loop #1001

fraenkel · 2019-04-25T12:15:46Z

This is one attempt to reduce the API load during ApplyChanges as seen by #484

I started with one path from the Controller -> Registry (TXT) -> Provider (AWS) as a proof of concept that it could be done relatively clean.
This model doesn't cache hosted zones for AWS since that information is not returned.

As shown by the test in aws_test.go, we now go from 3 ListRecordResourceSets to 0.
I also disabled the caching via context if one enables caching in the Registry the prevent the provider seeing potentially incorrect data.

Thoughts?

njuettner · 2019-04-26T10:15:08Z

@fraenkel thanks for your PR 😄.
May I ask what are you trying to solve? What impact does this change have for all providers? What's the benefit of disabling caching again?

fraenkel · 2019-04-26T11:05:56Z

@njuettner Sure, a valid question.

We continue to have issues with AWS and rate limiting. With my previous AWS improvements, we are now at the point where we perform the same Route53 API calls exactly twice. Once at the beginning of the controller loop to retrieve the records and then again during ApplyChanges. In small environments, this is not an issue. For us, we have over 3000 DNS entries (C and TXT), and a large number of clusters, i.e, many external-dns. If you do the math, Records(), will be a minimum of 30 API calls, there are more for Zones, etc... Now you double it, and we are sitting at 60 API calls. Multiply this by some number of clusters, and you can see how this can start to be an issue.

This change passes the endpoints retrieved in the controller loop back to the provider whereby we cut half the API calls. There will be some extra calls for Zones but I would be a bigger change to attempt to cache that. Providers can opt into using the cache when available. This PR only enabled AWS to leverage the cache in the context. Any other provider can choose to do the same.

So there might be some confusion over what cache is disabled. In the case that caching within the Registry is enabled, I chose to disable providing the cached endpoints to the Provider. There are two reasons for doing so. The first is probably more obvious. I did not want to change the current behavior when caching is enabled. I mentioned this in my previous PR. If the provider is always seeing a stale version of the world, once you have any error (inconsistent view), you must wait until the next interval before you are corrected. Which bleeds into the second issue, the cache is updated regardless of the success or more importantly failure of ApplyChanges. Throttling and inconsistent views will immediately cause errors which means the cache does not reflect what happened but what should have happened. Any subsequent loops are making plans on incorrect data. In my previous PR, I argued that providing the same cached data or retrieving new ones didn't matter since a broken cache is going to lead to questionable outcomes.
There are a few ways to fix this but given the separation of registry cache and provider, there is no easy fix. One might think a quick fix is to update the cache only after a successful ApplyChanges, but now the cache is still inconsistent with the world of partial changes were made.

So for now, I am focusing on reducing API calls for single controller loop execution. I can go either way on passing cached data to the Provider. My take is that you only pass relatively current data.

My other idea was to cache this data within the Provider. I did not like this approach because it made no guarantees on when data was refreshed for each pass of the controller. If possible, data should be current at the beginning rather than at the end after the Plan has been made.

njuettner · 2019-05-07T08:41:38Z

Sorry for my late response.

Since you mentioned it can be applied for other providers, why not adding the same feature for all other providers? If I understood it correctly it will be only beneficial and do no harm.

WDYT?

If you're not comfortable with it, it is also fine and we can merge it.
Did you had the chance to test your change in your clusters?

Thanks again 👍.

fraenkel · 2019-05-07T12:27:25Z

I looked at a few providers to see if they could benefit. While alicloud could use something like this, their internal representation is different so they cannot. If we allowed providers to fill in cached content, then it could support additional use cases. However, I was trying to be as light as possible.

njuettner · 2019-05-07T12:51:20Z

One last request from my side. Do you mind rebase from master and fix it a TransIP (new provider)? Then I think we can merge it :).

The controller will retrieve all the endpoints at the beginning of its loop. When changes need to be applied, the provider may need to query the endpoints again. Allow the provider to skip the queries if its data was cached.

njuettner · 2019-05-08T13:03:59Z

--- FAIL: TestAWSSDProvider_CreateService (0.00s)
    require.go:738: 
        	Error Trace:	aws_sd_test.go:590
        	            				aws_sd_test.go:586
        	Error:      	"map[଎:{
        	            	  CreateDate: 2019-05-07 23:58:05.681471777 +0000 UTC m=+0.077964183,
        	            	  Description: "",
        	            	  DnsConfig: {
        	            	    DnsRecords: [{
        	            	        TTL: 100,
        	            	        Type: "A"
        	            	      }],
        	            	    NamespaceId: "private",
        	            	    RoutingPolicy: "WEIGHTED"
        	            	  },
        	            	  Id: "\u0b0e",
        	            	  Name: "ALIAS-srv"
        	            	} ᭌ:{
        	            	  CreateDate: 2019-05-07 23:58:05.681364288 +0000 UTC m=+0.077856706,
        	            	  Description: "",
        	            	  DnsConfig: {
        	            	    DnsRecords: [{
        	            	        TTL: 80,
        	            	        Type: "CNAME"
        	            	      }],
        	            	    NamespaceId: "private",
        	            	    RoutingPolicy: "WEIGHTED"
        	            	  },
        	            	  Id: "\u1b4c",
        	            	  Name: "CNAME-srv"
        	            	}]" should have 3 item(s), but has 2
        	Test:       	TestAWSSDProvider_CreateService

fraenkel · 2019-05-08T22:12:13Z

@njuettner That isn't an error I introduced. I didn't touch any code in that path. I don't even know how that test can fail other than dryRun being set.
I have run the test 100s of times and it always passes.

njuettner · 2019-05-10T08:58:13Z

/lgtm
/approve

k8s-ci-robot · 2019-05-10T08:58:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: njuettner

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [njuettner]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 25, 2019

k8s-ci-robot requested review from linki and njuettner April 25, 2019 12:15

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 25, 2019

fraenkel force-pushed the issue484 branch from dcca0ac to 86a2c45 Compare April 25, 2019 12:22

njuettner self-assigned this May 7, 2019

Cache the endpoints on the controller loop

fab942f

The controller will retrieve all the endpoints at the beginning of its loop. When changes need to be applied, the provider may need to query the endpoints again. Allow the provider to skip the queries if its data was cached.

fraenkel force-pushed the issue484 branch from 86a2c45 to fab942f Compare May 7, 2019 23:52

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2019

k8s-ci-robot merged commit 3b44e39 into kubernetes-sigs:master May 10, 2019

greut mentioned this pull request Sep 16, 2019

exoscale: add context support #1193

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache the endpoints on the controller loop #1001

Cache the endpoints on the controller loop #1001

fraenkel commented Apr 25, 2019

njuettner commented Apr 26, 2019

fraenkel commented Apr 26, 2019

njuettner commented May 7, 2019

fraenkel commented May 7, 2019

njuettner commented May 7, 2019

njuettner commented May 8, 2019

fraenkel commented May 8, 2019

njuettner commented May 10, 2019

k8s-ci-robot commented May 10, 2019

Cache the endpoints on the controller loop #1001

Cache the endpoints on the controller loop #1001

Conversation

fraenkel commented Apr 25, 2019

njuettner commented Apr 26, 2019

fraenkel commented Apr 26, 2019

njuettner commented May 7, 2019

fraenkel commented May 7, 2019

njuettner commented May 7, 2019

njuettner commented May 8, 2019

fraenkel commented May 8, 2019

njuettner commented May 10, 2019

k8s-ci-robot commented May 10, 2019