✨ Add remote cluster cache manager #2880

JoelSpeed · 2020-04-08T13:57:11Z

What this PR does / why we need it:
Add a remote cluster cache manager that owns the full lifecycle of
caches against workload clusters. Controllers that need to watch
resources in workload clusters should use this functionality.

This is the foundation for #2414 and #2577.

It does not currently check for connectivity failures to workload clusters. It only stops the workload cluster caches if there is an error retrieving the Cluster from the management cluster. If desired, I could add something that does a minimal "ping" to each workload cluster periodically, removing any caches that have connectivity issues. Or we could do that in a follow-up.

I have not updated MachineHealthCheck to use this. I could do that in this PR, or a follow-up.

I have not added anything for KubeadmControlPlane to use this. I could do that in this PR, or a follow-up (cc @detiber).

/priority important-soon

cc @vincepri @ncdc

--
This replaces #2835. I've taken the code that @ncdc wrote, addressed some feedback from the previous PR and added a test suite for this code. I have not done anything regarding this comment https://github.com/kubernetes-sigs/cluster-api/pull/2835/files#r401783111 as I was not sure how to interpret the conversation

With the code as is, it is not actually being used anywhere. To be used, we need to create a ClusterCacheTracker and call NewClusterCacheReconciler within main.go, pass the ClusterCacheTracker into controllers that will need access to remote caches (eg MHC), and then, from within the controller, call ClusterCacheTracker.Watch when reconciling their object. Tracking which of the clusters have already been watched and deduplication is handled within the ClusterCacheTracker, so no tracking is needed within individual controllers.

The deduplication will not work properly however with a different set of predicates for each Watch, I'm not sure if that's a major problem or not, open to suggestions for rewrites if there's a good way to fix that if we think it's important

detiber · 2020-04-08T14:11:22Z

It does not currently check for connectivity failures to workload clusters.

Things definitely brings up some interesting questions...

Should it be the responsibility of the tracker cache to ensure that users won't be retrieving "stale" resources from the cache?
Or should it be the responsibility of the consumer (potentially with help from a helper utility) to determine if they care about the Node info being "stale"?

vincepri · 2020-04-08T14:21:53Z

/milestone v0.3.4

ncdc · 2020-04-08T14:48:46Z

I could see us adding goroutine(s) that periodically attempt to do a live query against each workload cluster's apiserver, maybe?

vincepri · 2020-04-08T15:10:04Z

I'd leave it as a client responsibility to decide what to do when connectivity cannot be established to a workload cluster. In some cases, we might want a controller to not error if we can't establish contact, in some other case we might want to return an error and go in backoff loop.

Given that all use cases aren't clear yet, I'd leave it to clients to decide and bring up requirements in the future.

ncdc · 2020-04-08T15:13:24Z

I think it should be this cluster cache manager's responsibility to evict the workload cache if connectivity is down. Otherwise, clients (controllers, really) will be unaware they're not getting updated data.

vincepri · 2020-04-08T15:15:25Z

What if the lost of connectivity is temporary (e.g. upgrades), would we have the ability to recover?

ncdc · 2020-04-08T15:21:48Z

We could code it to retry up to $conditions (n times, some timeout, etc) before evicting. After that, it would require either an event (create/update/delete) or a full (management cluster cache) resync to get it recreated.

JoelSpeed · 2020-04-08T15:24:15Z

A benefit(?) of @ncdc's suggestion above is that the Watch would then fail and the client (controller) would then be aware that there are connectivity issues, which they can handle in their own manner

vincepri · 2020-04-08T15:24:59Z

That sounds good, I do think we need to think about recovering from connectivity errors

ncdc · 2020-04-08T15:25:35Z

@JoelSpeed how would the watch fail?

JoelSpeed · 2020-04-08T15:32:04Z

@ncdc When calling newClusterCache (assuming it's been deleted by the connectivity failure), calling cache.New attempts discovery of resources, when creating a NewDiscoveryRestMapper, from the API of the target cluster because we provide no Mapper through the options. If the discovery fails because the API server is still not available, that error should be propagated and returned by Watch

ncdc · 2020-04-08T15:35:46Z

@JoelSpeed ah, that makes sense. Thanks for clarifying! My suggestion was in a different section of the code, though: a goroutine owned by the cluster cache manager/tracker/reconciler. The retries would be hidden to the other controllers. So a controller would have to be reconciling something and call Watch after the cache had already been evicted, after timing out during the retry period.

JoelSpeed · 2020-04-08T15:38:58Z

@ncdc Ack yeah sorry, I follow, don't think I explained very well, I think we are on the same page. I meant that after an eviction, as you've said, the controller would need to have some event to trigger the reconcile, and then Watch would fail. So the client controller wouldn't know that the remote cluster is broken until it is next reconciling and calls Watch

JoelSpeed · 2020-04-14T13:15:03Z

Do we want to try and tackle the connection loss in this PR or should that be considered a follow up?

ncdc · 2020-04-14T13:16:29Z

I like the idea of having PRs be as complete as possible, but will defer to @vincepri if he feels differently.

vincepri · 2020-04-14T14:27:56Z

Connection loss seems an integral part of this package, +1 to what Andy said

JoelSpeed · 2020-04-14T16:03:14Z

Ack, I can try having a go at the suggestion from @ncdc. Does anyone have any opinions about how long the time out should be (1m, 5m?), how long the polling interval should be (10s, 30s?), should these be user configurable (flags to the manager?), and a "safe" but lightweight way to check connectivity to the cluster (poll /healthz maybe)?

vincepri · 2020-04-14T18:04:13Z

@JoelSpeed This is what CAPA is currently using to health check https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/9913ac8dbb96ed9bcffa6eba75b886cd4ef8c20b/pkg/cloud/services/elb/loadbalancer.go#L263-L269, we can work off of that

vincepri · 2020-04-20T23:06:19Z

@JoelSpeed any updates on this PR?

JoelSpeed · 2020-04-21T11:47:12Z

@vincepri Sorry, had got stuck on the health checking, was more challenging to work out what to do and how to test it than I thought it would be. I've pushed some health checking code though I'm not 100% sure about it, PTAL

ncdc · 2020-05-04T13:52:31Z

LGTM, thanks!

JoelSpeed · 2020-05-04T13:55:57Z

That was quick, thanks @ncdc! Squashed

ncdc · 2020-05-04T13:58:57Z

/lgtm

vincepri · 2020-05-04T14:28:23Z

/test pull-cluster-api-test

ncdc · 2020-05-04T14:30:18Z

/hold

ncdc · 2020-05-04T14:31:06Z

Something is flaky in the new tests:

• Failure [0.452 seconds]
ClusterCache Reconciler suite
/home/prow/go/src/sigs.k8s.io/cluster-api/controllers/remote/cluster_cache_reconciler_test.go:42
  When running the ClusterCacheReconciler
  /home/prow/go/src/sigs.k8s.io/cluster-api/controllers/remote/cluster_cache_reconciler_test.go:43
    when clusters are deleted
    /home/prow/go/pkg/mod/github.com/onsi/ginkgo@v1.12.0/extensions/table/table.go:92
      when no clusters deleted [It]
      /home/prow/go/pkg/mod/github.com/onsi/ginkgo@v1.12.0/extensions/table/table_entry.go:43
      Failed after 0.000s.
      
      Expected
          <bool>: true
      to be false
      /home/prow/go/src/sigs.k8s.io/cluster-api/controllers/remote/cluster_cache_reconciler_test.go:199

JoelSpeed · 2020-05-04T14:32:03Z

Ack, will investigate, I didn't run the tests too much after fixing the races so may have introduced a problem

ncdc · 2020-05-04T14:32:40Z

/lgtm cancel

JoelSpeed · 2020-05-04T16:04:21Z

I've spent the last 90 mins investigating this to work out where this could have flaked.

For this test to fail, something had to call Stop on the clusterCache, there's only two places that's done, in the health checker and in the reconciler.

If it was the health checker, this would imply the health check failed, but it's health checking the testenv API so that shouldn't have happened and I would expect it to have manifested before this point.

If it was the reconciler, this would imply the cluster didn't exist when we called reconcile, again, I would have thought this would have come up before, but, I've added an extra check to make sure the cluster exists in the cache before we call that.

I played around with timings for this and added a big sleep at the beginning and was unable to reproduce the flake, so I think this should fix it.

I'd suggest we run the tests a few times on CI and see if it flakes again before squash/merge

benmoss · 2020-05-04T16:40:47Z

Investigating this too, this might be of help: https://gist.github.com/benmoss/a83b50cbc38d7b1c99a1a6387b43d0ad

edit: nevermind that was based on an older checkout of this PR

benmoss · 2020-05-04T18:15:58Z

I ran the tests 20 times and they didn't fail, there's a compilation error making it fail right now though

cluster-api/controllers/remote/cluster_cache_reconciler_test.go:78:19: no new variables on left side of :=

vincepri · 2020-05-04T18:21:49Z

Have you rebased everything on master? CI does that for us, just want to make sure your last commit is also rebased

JoelSpeed · 2020-05-04T18:42:19Z

Oops, I messed up my commit staging, will fix up, rebase on latest and re-push

JoelSpeed · 2020-05-05T08:52:49Z

/test pull-cluster-api-test
/test pull-cluster-api-capd-e2e

JoelSpeed · 2020-05-05T09:51:53Z

/test pull-cluster-api-test

Just want to run this a few times to see if it flakes again in CI

vincepri · 2020-05-05T15:30:26Z

LGTM

@ncdc @benmoss ?

@JoelSpeed need to squash

ncdc · 2020-05-05T15:38:27Z

Squash away!

Add a remote cluster cache manager that owns the full lifecycle of caches against workload clusters. Controllers that need to watch resources in workload clusters should use this functionality. Signed-off-by: Andy Goldstein <goldsteina@vmware.com>

ncdc · 2020-05-05T16:05:56Z

/lgtm
/hold cancel

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 8, 2020

k8s-ci-robot requested review from detiber and vincepri April 8, 2020 13:57

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 8, 2020

JoelSpeed mentioned this pull request Apr 8, 2020

[WIP]✨ Add remote cluster cache manager #2835

Closed

k8s-ci-robot added this to the v0.3.4 milestone Apr 8, 2020

vincepri mentioned this pull request Apr 13, 2020

MachineHealthCheck nodes cache doesn't stop after cluster deletion #2902

Closed

fabriziopandini mentioned this pull request Apr 14, 2020

Add a temporary workaround to get KCP to detect Ready state faster #2906

Closed

vincepri mentioned this pull request Apr 20, 2020

Figure out if MHC informers should be stopped if the cluster goes away #2577

Closed

JoelSpeed force-pushed the workload-cluster-cache branch from 859329b to 4b227a4 Compare May 4, 2020 13:55

k8s-ci-robot assigned ncdc May 4, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2020

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 4, 2020

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2020

JoelSpeed force-pushed the workload-cluster-cache branch from f5c9f5d to f617947 Compare May 4, 2020 21:02

JoelSpeed force-pushed the workload-cluster-cache branch from f617947 to 090ebe6 Compare May 5, 2020 15:47

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 5, 2020

k8s-ci-robot merged commit 5d7c145 into kubernetes-sigs:master May 5, 2020

JoelSpeed deleted the workload-cluster-cache branch May 5, 2020 16:17

✨ Add remote cluster cache manager #2880

✨ Add remote cluster cache manager #2880

Conversation

JoelSpeed commented Apr 8, 2020

detiber commented Apr 8, 2020

vincepri commented Apr 8, 2020

ncdc commented Apr 8, 2020

vincepri commented Apr 8, 2020

ncdc commented Apr 8, 2020

vincepri commented Apr 8, 2020

ncdc commented Apr 8, 2020

JoelSpeed commented Apr 8, 2020 • edited Loading

vincepri commented Apr 8, 2020

ncdc commented Apr 8, 2020

JoelSpeed commented Apr 8, 2020

ncdc commented Apr 8, 2020

JoelSpeed commented Apr 8, 2020

JoelSpeed commented Apr 14, 2020

ncdc commented Apr 14, 2020

vincepri commented Apr 14, 2020

JoelSpeed commented Apr 14, 2020 • edited Loading

vincepri commented Apr 14, 2020

vincepri commented Apr 20, 2020

JoelSpeed commented Apr 21, 2020

ncdc commented May 4, 2020

JoelSpeed commented May 4, 2020

ncdc commented May 4, 2020

vincepri commented May 4, 2020

ncdc commented May 4, 2020

ncdc commented May 4, 2020

JoelSpeed commented May 4, 2020

ncdc commented May 4, 2020

JoelSpeed commented May 4, 2020

benmoss commented May 4, 2020 • edited Loading

benmoss commented May 4, 2020 • edited Loading

vincepri commented May 4, 2020

JoelSpeed commented May 4, 2020

JoelSpeed commented May 5, 2020

JoelSpeed commented May 5, 2020

vincepri commented May 5, 2020

ncdc commented May 5, 2020

ncdc commented May 5, 2020

JoelSpeed commented Apr 8, 2020 •

edited

Loading

JoelSpeed commented Apr 14, 2020 •

edited

Loading

benmoss commented May 4, 2020 •

edited

Loading

benmoss commented May 4, 2020 •

edited

Loading