Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add remote cluster cache manager #2880

Merged

Conversation

JoelSpeed
Copy link
Contributor

What this PR does / why we need it:
Add a remote cluster cache manager that owns the full lifecycle of
caches against workload clusters. Controllers that need to watch
resources in workload clusters should use this functionality.

This is the foundation for #2414 and #2577.

It does not currently check for connectivity failures to workload clusters. It only stops the workload cluster caches if there is an error retrieving the Cluster from the management cluster. If desired, I could add something that does a minimal "ping" to each workload cluster periodically, removing any caches that have connectivity issues. Or we could do that in a follow-up.

I have not updated MachineHealthCheck to use this. I could do that in this PR, or a follow-up.

I have not added anything for KubeadmControlPlane to use this. I could do that in this PR, or a follow-up (cc @detiber).

/priority important-soon

cc @vincepri @ncdc

--
This replaces #2835. I've taken the code that @ncdc wrote, addressed some feedback from the previous PR and added a test suite for this code. I have not done anything regarding this comment https://github.com/kubernetes-sigs/cluster-api/pull/2835/files#r401783111 as I was not sure how to interpret the conversation

With the code as is, it is not actually being used anywhere. To be used, we need to create a ClusterCacheTracker and call NewClusterCacheReconciler within main.go, pass the ClusterCacheTracker into controllers that will need access to remote caches (eg MHC), and then, from within the controller, call ClusterCacheTracker.Watch when reconciling their object. Tracking which of the clusters have already been watched and deduplication is handled within the ClusterCacheTracker, so no tracking is needed within individual controllers.

The deduplication will not work properly however with a different set of predicates for each Watch, I'm not sure if that's a major problem or not, open to suggestions for rewrites if there's a good way to fix that if we think it's important

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 8, 2020
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 8, 2020
@detiber
Copy link
Member

detiber commented Apr 8, 2020

It does not currently check for connectivity failures to workload clusters.

Things definitely brings up some interesting questions...

  • Should it be the responsibility of the tracker cache to ensure that users won't be retrieving "stale" resources from the cache?
  • Or should it be the responsibility of the consumer (potentially with help from a helper utility) to determine if they care about the Node info being "stale"?

@vincepri
Copy link
Member

vincepri commented Apr 8, 2020

/milestone v0.3.4

@k8s-ci-robot k8s-ci-robot added this to the v0.3.4 milestone Apr 8, 2020
@ncdc
Copy link
Contributor

ncdc commented Apr 8, 2020

I could see us adding goroutine(s) that periodically attempt to do a live query against each workload cluster's apiserver, maybe?

@vincepri
Copy link
Member

vincepri commented Apr 8, 2020

I'd leave it as a client responsibility to decide what to do when connectivity cannot be established to a workload cluster. In some cases, we might want a controller to not error if we can't establish contact, in some other case we might want to return an error and go in backoff loop.

Given that all use cases aren't clear yet, I'd leave it to clients to decide and bring up requirements in the future.

@ncdc
Copy link
Contributor

ncdc commented Apr 8, 2020

I think it should be this cluster cache manager's responsibility to evict the workload cache if connectivity is down. Otherwise, clients (controllers, really) will be unaware they're not getting updated data.

@vincepri
Copy link
Member

vincepri commented Apr 8, 2020

What if the lost of connectivity is temporary (e.g. upgrades), would we have the ability to recover?

@ncdc
Copy link
Contributor

ncdc commented Apr 8, 2020

We could code it to retry up to $conditions (n times, some timeout, etc) before evicting. After that, it would require either an event (create/update/delete) or a full (management cluster cache) resync to get it recreated.

@JoelSpeed
Copy link
Contributor Author

JoelSpeed commented Apr 8, 2020

A benefit(?) of @ncdc's suggestion above is that the Watch would then fail and the client (controller) would then be aware that there are connectivity issues, which they can handle in their own manner

@vincepri
Copy link
Member

vincepri commented Apr 8, 2020

That sounds good, I do think we need to think about recovering from connectivity errors

@ncdc
Copy link
Contributor

ncdc commented Apr 8, 2020

@JoelSpeed how would the watch fail?

@JoelSpeed
Copy link
Contributor Author

@ncdc When calling newClusterCache (assuming it's been deleted by the connectivity failure), calling cache.New attempts discovery of resources, when creating a NewDiscoveryRestMapper, from the API of the target cluster because we provide no Mapper through the options. If the discovery fails because the API server is still not available, that error should be propagated and returned by Watch

@ncdc
Copy link
Contributor

ncdc commented Apr 8, 2020

@JoelSpeed ah, that makes sense. Thanks for clarifying! My suggestion was in a different section of the code, though: a goroutine owned by the cluster cache manager/tracker/reconciler. The retries would be hidden to the other controllers. So a controller would have to be reconciling something and call Watch after the cache had already been evicted, after timing out during the retry period.

@JoelSpeed
Copy link
Contributor Author

@ncdc Ack yeah sorry, I follow, don't think I explained very well, I think we are on the same page. I meant that after an eviction, as you've said, the controller would need to have some event to trigger the reconcile, and then Watch would fail. So the client controller wouldn't know that the remote cluster is broken until it is next reconciling and calls Watch

@JoelSpeed
Copy link
Contributor Author

Do we want to try and tackle the connection loss in this PR or should that be considered a follow up?

@ncdc
Copy link
Contributor

ncdc commented Apr 14, 2020

I like the idea of having PRs be as complete as possible, but will defer to @vincepri if he feels differently.

@vincepri
Copy link
Member

Connection loss seems an integral part of this package, +1 to what Andy said

@JoelSpeed
Copy link
Contributor Author

JoelSpeed commented Apr 14, 2020

Ack, I can try having a go at the suggestion from @ncdc. Does anyone have any opinions about how long the time out should be (1m, 5m?), how long the polling interval should be (10s, 30s?), should these be user configurable (flags to the manager?), and a "safe" but lightweight way to check connectivity to the cluster (poll /healthz maybe)?

@vincepri
Copy link
Member

@JoelSpeed This is what CAPA is currently using to health check https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/9913ac8dbb96ed9bcffa6eba75b886cd4ef8c20b/pkg/cloud/services/elb/loadbalancer.go#L263-L269, we can work off of that

@vincepri
Copy link
Member

@JoelSpeed any updates on this PR?

@JoelSpeed
Copy link
Contributor Author

@vincepri Sorry, had got stuck on the health checking, was more challenging to work out what to do and how to test it than I thought it would be. I've pushed some health checking code though I'm not 100% sure about it, PTAL

@ncdc
Copy link
Contributor

ncdc commented May 4, 2020

LGTM, thanks!

@JoelSpeed
Copy link
Contributor Author

That was quick, thanks @ncdc! Squashed

@ncdc
Copy link
Contributor

ncdc commented May 4, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2020
@vincepri
Copy link
Member

vincepri commented May 4, 2020

/test pull-cluster-api-test

@ncdc
Copy link
Contributor

ncdc commented May 4, 2020

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 4, 2020
@ncdc
Copy link
Contributor

ncdc commented May 4, 2020

Something is flaky in the new tests:

• Failure [0.452 seconds]
ClusterCache Reconciler suite
/home/prow/go/src/sigs.k8s.io/cluster-api/controllers/remote/cluster_cache_reconciler_test.go:42
  When running the ClusterCacheReconciler
  /home/prow/go/src/sigs.k8s.io/cluster-api/controllers/remote/cluster_cache_reconciler_test.go:43
    when clusters are deleted
    /home/prow/go/pkg/mod/github.com/onsi/ginkgo@v1.12.0/extensions/table/table.go:92
      when no clusters deleted [It]
      /home/prow/go/pkg/mod/github.com/onsi/ginkgo@v1.12.0/extensions/table/table_entry.go:43
      Failed after 0.000s.
      
      Expected
          <bool>: true
      to be false
      /home/prow/go/src/sigs.k8s.io/cluster-api/controllers/remote/cluster_cache_reconciler_test.go:199

@JoelSpeed
Copy link
Contributor Author

Ack, will investigate, I didn't run the tests too much after fixing the races so may have introduced a problem

@ncdc
Copy link
Contributor

ncdc commented May 4, 2020

/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2020
@JoelSpeed
Copy link
Contributor Author

I've spent the last 90 mins investigating this to work out where this could have flaked.

For this test to fail, something had to call Stop on the clusterCache, there's only two places that's done, in the health checker and in the reconciler.

If it was the health checker, this would imply the health check failed, but it's health checking the testenv API so that shouldn't have happened and I would expect it to have manifested before this point.

If it was the reconciler, this would imply the cluster didn't exist when we called reconcile, again, I would have thought this would have come up before, but, I've added an extra check to make sure the cluster exists in the cache before we call that.

I played around with timings for this and added a big sleep at the beginning and was unable to reproduce the flake, so I think this should fix it.

I'd suggest we run the tests a few times on CI and see if it flakes again before squash/merge

@benmoss
Copy link

benmoss commented May 4, 2020

Investigating this too, this might be of help: https://gist.github.com/benmoss/a83b50cbc38d7b1c99a1a6387b43d0ad

edit: nevermind that was based on an older checkout of this PR

@benmoss
Copy link

benmoss commented May 4, 2020

I ran the tests 20 times and they didn't fail, there's a compilation error making it fail right now though

cluster-api/controllers/remote/cluster_cache_reconciler_test.go:78:19: no new variables on left side of :=

@vincepri
Copy link
Member

vincepri commented May 4, 2020

Have you rebased everything on master? CI does that for us, just want to make sure your last commit is also rebased

@JoelSpeed
Copy link
Contributor Author

Oops, I messed up my commit staging, will fix up, rebase on latest and re-push

@JoelSpeed
Copy link
Contributor Author

/test pull-cluster-api-test
/test pull-cluster-api-capd-e2e

@JoelSpeed
Copy link
Contributor Author

/test pull-cluster-api-test

Just want to run this a few times to see if it flakes again in CI

@vincepri
Copy link
Member

vincepri commented May 5, 2020

LGTM

@ncdc @benmoss ?

@JoelSpeed need to squash

@ncdc
Copy link
Contributor

ncdc commented May 5, 2020

Squash away!

Add a remote cluster cache manager that owns the full lifecycle of
caches against workload clusters. Controllers that need to watch
resources in workload clusters should use this functionality.

Signed-off-by: Andy Goldstein <goldsteina@vmware.com>
@ncdc
Copy link
Contributor

ncdc commented May 5, 2020

/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 5, 2020
@k8s-ci-robot k8s-ci-robot merged commit 5d7c145 into kubernetes-sigs:master May 5, 2020
@JoelSpeed JoelSpeed deleted the workload-cluster-cache branch May 5, 2020 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants