Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce startup latency by removing some unneeded sleeps #2104

Merged
merged 1 commit into from
Oct 11, 2022

Conversation

bwagner5
Copy link
Contributor

@bwagner5 bwagner5 commented Oct 6, 2022

What type of PR is this?

cleanup

Which issue does this PR fix:
N/A

What does this PR do / Why do we need it:
This PR reduces the latency of starting up the VPC CNI by 2-4 secs which results in a K8s node getting into the Ready state faster.

When checking if kube-apiserver is accessible, the wait.PollInfinite(2*time.Second, func()...) func was used, which sleeps before any action and then executes the func. Since kube-apiserver is generally accessible when the CNI starts, this sleep was delaying a successful, quick request. I switched the wait func to use wait.PollImmediateInfinite instead which first executes the func and then sleeps if unsuccessful.

There was also a sleep when checking if the ipamd daemon had started, but this sleep was unnecessary since there's already a 1 sec connection timeout in the grpc-health-probe.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Before removing the sleeps:

## aws-node logs:

2022-10-06T22:39:55.171648812Z stdout F {"level":"info","ts":"2022-10-06T22:39:55.171Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
2022-10-06T22:39:55.172662433Z stdout F {"level":"info","ts":"2022-10-06T22:39:55.172Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
2022-10-06T22:39:55.183080631Z stdout F {"level":"info","ts":"2022-10-06T22:39:55.182Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
2022-10-06T22:39:55.18412313Z stdout F {"level":"info","ts":"2022-10-06T22:39:55.183Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
2022-10-06T22:39:57.190165337Z stdout F {"level":"info","ts":"2022-10-06T22:39:57.189Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
2022-10-06T22:39:59.196139932Z stdout F {"level":"info","ts":"2022-10-06T22:39:59.195Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
2022-10-06T22:39:59.21347371Z stdout F {"level":"info","ts":"2022-10-06T22:39:59.213Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
2022-10-06T22:39:59.216869945Z stdout F {"level":"info","ts":"2022-10-06T22:39:59.216Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
2022-10-06T22:39:59.217669859Z stdout F {"level":"info","ts":"2022-10-06T22:39:59.217Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

## ipamd logs:

{"level":"info","ts":"2022-10-06T22:39:55.194Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2022-10-06T22:39:55.195Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2022-10-06T22:39:55.203Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-10-06T22:39:55.203Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"info","ts":"2022-10-06T22:39:57.220Z","caller":"wait/wait.go:211","msg":"Successful communication with the Cluster! Cluster Version is: v1.22+. git version: v1.22.13-eks-15b7512. git tree state: clean. commit: 94138dfbea757d7aaf3b205419578ef186dd5efb. platform: linux/amd64"}

After removing the sleeps:

## aws-node logs:

2022-10-06T22:33:09.610064979Z stdout F {"level":"info","ts":"2022-10-06T22:33:09.609Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
2022-10-06T22:33:09.61105197Z stdout F {"level":"info","ts":"2022-10-06T22:33:09.610Z","caller":"entrypoint.sh","msg":"Copying CNI plugin binaries ... "}
2022-10-06T22:33:09.623162424Z stdout F {"level":"info","ts":"2022-10-06T22:33:09.622Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
2022-10-06T22:33:09.633499311Z stdout F {"level":"info","ts":"2022-10-06T22:33:09.633Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
2022-10-06T22:33:09.63456762Z stdout F {"level":"info","ts":"2022-10-06T22:33:09.634Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
2022-10-06T22:33:10.639838542Z stdout F {"level":"info","ts":"2022-10-06T22:33:10.639Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
2022-10-06T22:33:11.644939741Z stdout F {"level":"info","ts":"2022-10-06T22:33:11.644Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
2022-10-06T22:33:11.660966276Z stdout F {"level":"info","ts":"2022-10-06T22:33:11.660Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
2022-10-06T22:33:11.664213944Z stdout F {"level":"info","ts":"2022-10-06T22:33:11.664Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
2022-10-06T22:33:11.665077664Z stdout F {"level":"info","ts":"2022-10-06T22:33:11.664Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

## ipamd logs

{"level":"info","ts":"2022-10-06T22:33:09.645Z","caller":"logger/logger.go:52","msg":"Constructed new logger instance"}
{"level":"info","ts":"2022-10-06T22:33:09.645Z","caller":"eniconfig/eniconfig.go:61","msg":"Initialized new logger as an existing instance was not found"}
{"level":"info","ts":"2022-10-06T22:33:09.653Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-10-06T22:33:09.654Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"info","ts":"2022-10-06T22:33:09.665Z","caller":"wait/wait.go:211","msg":"Successful communication with the Cluster! Cluster Version is: v1.22+. git version: v1.22.13-eks-15b7512. git tree state: clean. commit: 94138dfbea757d7aaf3b205419578ef186dd5efb. platform: linux/amd64"}

Graph with latency reduction:

Screen Shot 2022-10-06 at 18 22 14

Testing done on this change:

See above

Automation added to e2e:

N/A

Will this PR introduce any new dependencies?:

Nope

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
N/A

Does this change require updates to the CNI daemonset config files to work?:

No

Does this PR introduce any user-facing change?:

No

Improve the startup time of the vpc-cni

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@bwagner5 bwagner5 requested a review from a team as a code owner October 6, 2022 22:50
@jdn5126 jdn5126 self-assigned this Oct 6, 2022
Copy link
Contributor

@jdn5126 jdn5126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the explanation and cleanup

@jdn5126
Copy link
Contributor

jdn5126 commented Oct 6, 2022

Looks like make check-format is failing against change

@jayanthvn
Copy link
Contributor

jayanthvn commented Oct 6, 2022

It will be because of go19 version in the workflows.

@achevuru achevuru self-requested a review October 6, 2022 23:21
@bwagner5
Copy link
Contributor Author

bwagner5 commented Oct 6, 2022

➜  amazon-vpc-cni-k8s git:(rm-unneeded-sleeps) make check-format
➜  amazon-vpc-cni-k8s git:(rm-unneeded-sleeps) echo $?
0
➜  amazon-vpc-cni-k8s git:(rm-unneeded-sleeps) go version
go version go1.19 darwin/amd64

Copy link
Contributor

@achevuru achevuru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intention was to add an additional 1 sec between retries for the grpc probe. Ideally, retry should only happen if there is a delay of some sort during bootup and we've seen issues under load especially.

@bwagner5
Copy link
Contributor Author

bwagner5 commented Oct 7, 2022

Intention was to add an additional 1 sec between retries for the grpc probe. Ideally, retry should only happen if there is a delay of some sort during bootup and we've seen issues under load especially.

Not opposed to keeping the 1 sec sleep on the grpc probe but would like to understand what it's protecting. I'm not sure how the sleep would help under load since it's only executed on startup and the service isn't accessible yet. There is an implicit one sec pause when the connection is attempted which seems sufficient.

@achevuru
Copy link
Contributor

achevuru commented Oct 7, 2022

Intention was to add an additional 1 sec between retries for the grpc probe. Ideally, retry should only happen if there is a delay of some sort during bootup and we've seen issues under load especially.

Not opposed to keeping the 1 sec sleep on the grpc probe but would like to understand what it's protecting. I'm not sure how the sleep would help under load since it's only executed on startup and the service isn't accessible yet. There is an implicit one sec pause when the connection is attempted which seems sufficient.

During a node bootup (or) restart, CNI is one of the initial pods that come up and so load won't affect it but a restart on a running node that is under resource stress can contribute to delays and one of the reasons we had to increase liveness/readiness (exec) probe timeouts to 10s. Now, this is a forever loop waiting on a particular condition during startup and this sleep I guess helps make the probe less aggressive with the downside of an additional startup latency of up to 1s. I mean 1s is the worst case scenario and CNI initialization can take a while anyways (we set initialDelaySeconds to 60s for probes) so maybe this is alright.

@jayanthvn
Copy link
Contributor

➜ amazon-vpc-cni-k8s git:(rm-unneeded-sleeps) make check-format
➜ amazon-vpc-cni-k8s git:(rm-unneeded-sleeps) echo $?
0
➜ amazon-vpc-cni-k8s git:(rm-unneeded-sleeps) go version
go version go1.19 darwin/amd64

Yeah, I tried patching your change and running make format. Seems like workflow issue.

@jayanthvn
Copy link
Contributor

We should be good to merge, need to look into the unit test workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants