Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external.Get should get from cache #1663

Closed
xrmzju opened this issue Oct 28, 2019 · 15 comments · Fixed by #1673
Closed

external.Get should get from cache #1663

xrmzju opened this issue Oct 28, 2019 · 15 comments · Fixed by #1673
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@xrmzju
Copy link
Member

xrmzju commented Oct 28, 2019

What steps did you take and what happened:
i create about 60k machines in my env, but everytime cluster-api restart or relist, it tooks a long time to process the whole item, the bottleneck is machinecontroller from metrics record by prometheus
work_duration
workqueue_depth

functions that took long times were dig out through add extra logs in machine controller: mainly because of external.Get func. it spend about 5 seconds every time we call it(reconcile Bootstrap and reconcileInfraStructure). client.Get() uses unstructuredClient to get unstructured obj, which will use dynamicResourceClient and send request to apiserver in every Get action. the request speed is limited by the QPS of the client(default is 20). i scaled the qps to 200, and client.Get() took about 0.5s.

Anything else you would like to add:
since we have list-watch external obj in cluster-api, i think external.Get func should take advantage of it and Get from the cache. it will also reduce the num of requests sent to apiserver

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 28, 2019
@xrmzju xrmzju changed the title external.Get took a long time external.Get should get from cache Oct 28, 2019
@xrmzju
Copy link
Member Author

xrmzju commented Oct 28, 2019

is there a way to finish this?

@xrmzju
Copy link
Member Author

xrmzju commented Oct 28, 2019

is there a way to finish this?

the reason is machine controller uses DelegatingReader,who Get and List requests for unstructured types using ClientReader(request directly to apiserver) while requests for any other type of object with use the CacheReader。i fix this by using NewClientFunc options when calling manager.New。 the newClientFunc returns a client uses cacheReader to do get and list, while clientReader to do write action.

func newClientFunc(cache cache.Cache, config *rest.Config, options client.Options) (client.Client, error) {
	// Create the Client for Write operations.
	c, err := client.New(config, options)
	if err != nil {
		return nil, err
	}

	return &client.DelegatingClient{
		Reader:       cache,
		Writer:       c,
		StatusClient: c,
	}, nil
}
```go

@ncdc
Copy link
Contributor

ncdc commented Oct 28, 2019

I was about to paste the same thing, from kubernetes-sigs/controller-runtime#615 (comment) 😄

@ncdc
Copy link
Contributor

ncdc commented Oct 28, 2019

In the linked controller-runtime issue, they say they don't use the cache by default for unstructured objects out of concern that you might accidentally/unknowingly end up caching all the resources in a cluster, and so it's up to someone (i.e., us) to override that. Given that we're already starting shared informers for all our external types, we should take advantage of the cache. I'm 👍 to making this change.

@xrmzju
Copy link
Member Author

xrmzju commented Oct 28, 2019

In the linked controller-runtime issue, they say they don't use the cache by default for unstructured objects out of concern that you might accidentally/unknowingly end up caching all the resources in a cluster, and so it's up to someone (i.e., us) to override that. Given that we're already starting shared informers for all our external types, we should take advantage of the cache. I'm 👍 to making this change.

nice.. i can't believe i did not find the issue, i have been struggling whole day to fix this...

@ncdc
Copy link
Contributor

ncdc commented Oct 28, 2019

@detiber @vincepri @chuckha @liztio WDYT?

@vincepri
Copy link
Member

I haven't read the whole linked issue yet, but how do we prevent caching all resources in a cluster?

@ncdc
Copy link
Contributor

ncdc commented Oct 28, 2019

Every reconciler we create results in caching all the resources of that specific type (e.g. the ClusterReconciler results in a shared informer that caches all Cluster resources). Additionally, any .Watches() calls that we make with a specific source.Kind result in a cache of that kind. So we're already caching all Clusters, MachineDeployments, MachineSets, Machines. And when we reconcileExternal to watch the infra & bootstrap resources, we're caching those too.

In other words, we will currently cache every typed resource we use the Manager's Client for, plus everything unstructured we use reconcileExternal for.

The thing that we're missing is using the cache against unstructured types.

Do we have any code that reads unstructured objects that we don't want cached?

@xrmzju
Copy link
Member Author

xrmzju commented Oct 28, 2019

I haven't read the whole linked issue yet, but how do we prevent caching all resources in a cluster?

InformersMap only start a informer for specific gvk when the Get func invoked

@xrmzju
Copy link
Member Author

xrmzju commented Oct 28, 2019

any idea to cache remote Cluster's nodes && secrets also? we list nodes too frequently, it is really expensive @vincepri @ncdc @detiber

@xrmzju
Copy link
Member Author

xrmzju commented Oct 28, 2019

any idea to cache remote Cluster's nodes && secrets also? we list nodes too frequently, it is really expensive @vincepri @ncdc @detiber

we list nodes for every reconcile in order to find the node with same providerid

nodeList, err := client.Nodes().List(listOpt)

@ncdc
Copy link
Contributor

ncdc commented Oct 28, 2019

Let's file a separate issue for that?

@detiber
Copy link
Member

detiber commented Oct 28, 2019

I like the idea of adding the caching for unstructured resources. If for some reason we end up needing to avoid caching a specific resource type in the future I think we can cross that bridge when we get there.

@vincepri
Copy link
Member

/priority important-soon
/help
/milestone v0.3.0

@k8s-ci-robot k8s-ci-robot added this to the v0.3.0 milestone Oct 28, 2019
@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 28, 2019
@k8s-ci-robot
Copy link
Contributor

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/priority important-soon
/help
/milestone v0.3.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants