external.Get should get from cache #1663

xrmzju · 2019-10-28T05:56:46Z

What steps did you take and what happened:
i create about 60k machines in my env, but everytime cluster-api restart or relist, it tooks a long time to process the whole item, the bottleneck is machinecontroller from metrics record by prometheus

functions that took long times were dig out through add extra logs in machine controller: mainly because of external.Get func. it spend about 5 seconds every time we call it(reconcile Bootstrap and reconcileInfraStructure). client.Get() uses unstructuredClient to get unstructured obj, which will use dynamicResourceClient and send request to apiserver in every Get action. the request speed is limited by the QPS of the client(default is 20). i scaled the qps to 200, and client.Get() took about 0.5s.

Anything else you would like to add:
since we have list-watch external obj in cluster-api, i think external.Get func should take advantage of it and Get from the cache. it will also reduce the num of requests sent to apiserver

/kind bug

The text was updated successfully, but these errors were encountered:

xrmzju · 2019-10-28T07:08:45Z

is there a way to finish this?

xrmzju · 2019-10-28T13:36:10Z

is there a way to finish this?

the reason is machine controller uses DelegatingReader,who Get and List requests for unstructured types using ClientReader(request directly to apiserver) while requests for any other type of object with use the CacheReader。i fix this by using NewClientFunc options when calling manager.New。 the newClientFunc returns a client uses cacheReader to do get and list, while clientReader to do write action.

func newClientFunc(cache cache.Cache, config *rest.Config, options client.Options) (client.Client, error) {
	// Create the Client for Write operations.
	c, err := client.New(config, options)
	if err != nil {
		return nil, err
	}

	return &client.DelegatingClient{
		Reader:       cache,
		Writer:       c,
		StatusClient: c,
	}, nil
}
```go

ncdc · 2019-10-28T13:39:05Z

I was about to paste the same thing, from kubernetes-sigs/controller-runtime#615 (comment) 😄

ncdc · 2019-10-28T13:42:40Z

In the linked controller-runtime issue, they say they don't use the cache by default for unstructured objects out of concern that you might accidentally/unknowingly end up caching all the resources in a cluster, and so it's up to someone (i.e., us) to override that. Given that we're already starting shared informers for all our external types, we should take advantage of the cache. I'm 👍 to making this change.

xrmzju · 2019-10-28T13:46:30Z

In the linked controller-runtime issue, they say they don't use the cache by default for unstructured objects out of concern that you might accidentally/unknowingly end up caching all the resources in a cluster, and so it's up to someone (i.e., us) to override that. Given that we're already starting shared informers for all our external types, we should take advantage of the cache. I'm 👍 to making this change.

nice.. i can't believe i did not find the issue, i have been struggling whole day to fix this...

ncdc · 2019-10-28T13:56:08Z

@detiber @vincepri @chuckha @liztio WDYT?

vincepri · 2019-10-28T13:57:46Z

I haven't read the whole linked issue yet, but how do we prevent caching all resources in a cluster?

ncdc · 2019-10-28T14:07:45Z

Every reconciler we create results in caching all the resources of that specific type (e.g. the ClusterReconciler results in a shared informer that caches all Cluster resources). Additionally, any .Watches() calls that we make with a specific source.Kind result in a cache of that kind. So we're already caching all Clusters, MachineDeployments, MachineSets, Machines. And when we reconcileExternal to watch the infra & bootstrap resources, we're caching those too.

In other words, we will currently cache every typed resource we use the Manager's Client for, plus everything unstructured we use reconcileExternal for.

The thing that we're missing is using the cache against unstructured types.

Do we have any code that reads unstructured objects that we don't want cached?

xrmzju · 2019-10-28T14:09:59Z

I haven't read the whole linked issue yet, but how do we prevent caching all resources in a cluster?

InformersMap only start a informer for specific gvk when the Get func invoked

xrmzju · 2019-10-28T14:51:16Z

any idea to cache remote Cluster's nodes && secrets also? we list nodes too frequently, it is really expensive @vincepri @ncdc @detiber

xrmzju · 2019-10-28T14:53:20Z

any idea to cache remote Cluster's nodes && secrets also? we list nodes too frequently, it is really expensive @vincepri @ncdc @detiber

we list nodes for every reconcile in order to find the node with same providerid

cluster-api/controllers/machine_controller_noderef.go

Line 103 in f7e0486

nodeList, err := client.Nodes().List(listOpt)

ncdc · 2019-10-28T14:56:03Z

Let's file a separate issue for that?

detiber · 2019-10-28T15:15:49Z

I like the idea of adding the caching for unstructured resources. If for some reason we end up needing to avoid caching a specific resource type in the future I think we can cross that bridge when we get there.

vincepri · 2019-10-28T15:20:12Z

/priority important-soon
/help
/milestone v0.3.0

k8s-ci-robot · 2019-10-28T15:20:14Z

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/priority important-soon
/help
/milestone v0.3.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 28, 2019

xrmzju changed the title ~~external.Get took a long time~~ external.Get should get from cache Oct 28, 2019

k8s-ci-robot added this to the v0.3.0 milestone Oct 28, 2019

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 28, 2019

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 28, 2019

xrmzju mentioned this issue Oct 29, 2019

:running use a client reads from cache for controller manager #1673

Merged

ncdc assigned xrmzju Oct 29, 2019

k8s-ci-robot closed this as completed in #1673 Oct 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

external.Get should get from cache #1663

external.Get should get from cache #1663

xrmzju commented Oct 28, 2019

xrmzju commented Oct 28, 2019

xrmzju commented Oct 28, 2019 •

edited

Loading

ncdc commented Oct 28, 2019

ncdc commented Oct 28, 2019

xrmzju commented Oct 28, 2019

ncdc commented Oct 28, 2019

vincepri commented Oct 28, 2019

ncdc commented Oct 28, 2019

xrmzju commented Oct 28, 2019

xrmzju commented Oct 28, 2019

xrmzju commented Oct 28, 2019 •

edited

Loading

ncdc commented Oct 28, 2019

detiber commented Oct 28, 2019

vincepri commented Oct 28, 2019

k8s-ci-robot commented Oct 28, 2019

external.Get should get from cache #1663

external.Get should get from cache #1663

Comments

xrmzju commented Oct 28, 2019

xrmzju commented Oct 28, 2019

xrmzju commented Oct 28, 2019 • edited Loading

ncdc commented Oct 28, 2019

ncdc commented Oct 28, 2019

xrmzju commented Oct 28, 2019

ncdc commented Oct 28, 2019

vincepri commented Oct 28, 2019

ncdc commented Oct 28, 2019

xrmzju commented Oct 28, 2019

xrmzju commented Oct 28, 2019

xrmzju commented Oct 28, 2019 • edited Loading

ncdc commented Oct 28, 2019

detiber commented Oct 28, 2019

vincepri commented Oct 28, 2019

k8s-ci-robot commented Oct 28, 2019

xrmzju commented Oct 28, 2019 •

edited

Loading

xrmzju commented Oct 28, 2019 •

edited

Loading