Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node processor's global HTTP request timeout frequently exceeded in large clusters #387

Open
jlamillan opened this issue May 24, 2022 · 1 comment · May be fixed by #389
Open

node processor's global HTTP request timeout frequently exceeded in large clusters #387

jlamillan opened this issue May 24, 2022 · 1 comment · May be fixed by #389

Comments

@jlamillan
Copy link

jlamillan commented May 24, 2022

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

CCM Version: latest

What happened?

In large clusters (250+ nodes), the global HTTP request timeout of 10 seconds in the CCM's node processor is frequently reaching the deadline before it can iterate though all the VNICs in the compartment.

What you expected to happen?

The default timeout should be based on the number of nodes, or it should be configurable - at least large enough to not cause erroneous context deadline exceeded errors (resulting in other problems like failing to process nodes).

How to reproduce it (as minimally and precisely as possible)?

Have a large cluster, try and use the OCI CCM.

Anything else we need to know?

We first bumped the timeout to 1 minute, but that was not enough. We ultimately had to bump the default timeout to 5 minutes to resolve the issue.

Error log

oci-cloud-controller-manager-2gh4z oci-cloud-controller-manager 2022-05-24T18:58:47.397Z	ERROR	oci/node_info_controller.go:227	Failed to map provider ID to instance ID	{"component": "cloud-controller-manager", "node": "ashburn-ops-node-rtb-prd-ad2-public-vnic-180924", "error": "GetInstanceByNodeName: context deadline exceeded", "errorVerbose": "context deadline exceeded\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/oci/client.(*client).GetVNIC\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/oci/client/networking.go:54\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/oci/client.(*client).GetInstanceByNodeName\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/oci/client/compute.go:190\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*CloudProvider).InstanceID\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/instances.go:171\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.getInstanceByNode\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:225\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).processItem\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:162\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).processNextItem\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:139\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).runWorker\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:124\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).Run\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:119\nruntime.goexit\n\t/tools/go/src/runtime/asm_amd64.s:1374\nGetInstanceByNodeName\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*CloudProvider).InstanceID\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/instances.go:176\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.getInstanceByNode\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:225\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).processItem\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:162\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).processNextItem\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:139\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).runWorker\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:124\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90\ngit.luolix.top/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).Run\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:119\nruntime.goexit\n\t/tools/go/src/runtime/asm_amd64.s:1374"}
@AkarshES
Copy link
Contributor

AkarshES commented Nov 9, 2022

The loop over the vnics is only needed if the display name of the compute instance does not match the node name in the Kubernetes cluster. We have not seen others face the same issue, most likely because the display name they use is the same as the node name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants