Add timeout to SDK k8s client #3070

zmerlynn · 2023-04-05T00:14:16Z

This seems to help with (many of?) the flakes we're seeing in CI by forcing the informer to retry lists, rather than the SDK dying after 30s of hanging.

markmandel

Seems like a good change no matter what 👍🏻

google-oss-prow · 2023-04-05T00:42:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markmandel, zmerlynn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [markmandel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow · 2023-04-05T00:53:28Z

New changes are detected. LGTM label has been removed.

agones-bot · 2023-04-05T01:47:44Z

Build Succeeded 👏

Build Id: 8a612f44-eae4-49b7-ad09-35feb7880ab3

The following development artifacts have been built, and will exist for the next 30 days:

image: us-docker.pkg.dev/agones-images/ci/agones-controller:1.31.0-b5511ec-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.31.0-b5511ec-linux-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-ping:1.31.0-b5511ec-amd64
image: us-docker.pkg.dev/agones-images/ci/agones-allocator:1.31.0-b5511ec-amd64
Linux C++ SDK (build): agonessdk-1.31.0-b5511ec-amd64-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.31.0-b5511ec-amd64.zip

A preview of the website (the last 30 builds are retained):

https://b5511ec-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/3070/head:pr_3070 && git checkout pr_3070
helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-b5511ec-amd64

The SDK client only ever accesses small amounts of data (single object list / event updates), latency more than a couple of seconds is excessive. We need to keep a relatively tight timeout during initialization as well to allow the informer a chance to retry - the SDK won't reply to /healthz checks until the informer has synced once, and our liveness configuration only allows 9s before a positive /healthz.

The problem addressed by googleforgames#3070 is that on an indeterminate basis, we are seeing containers start without networking fully available. Once networking seems to work, it works fine. However, the fix in googleforgames#3070 introduced a downside: heavy watch traffic, because I didn't quite understand that it would also block the hanging GET of the watch. See googleforgames#3106. Instead of timing out the whole client, let's use an initial-probe approach and instead block on a successful GET (with a reasonable timeout) before we try to start informers. Fixes googleforgames#3106

* Revert #3070, wait on networking a different way The problem addressed by #3070 is that on an indeterminate basis, we are seeing containers start without networking fully available. Once networking seems to work, it works fine. However, the fix in #3070 introduced a downside: heavy watch traffic, because I didn't quite understand that it would also block the hanging GET of the watch. See #3106. Instead of timing out the whole client, let's use an initial-probe approach and instead block on a successful GET (with a reasonable timeout) before we try to start informers. Along the way: fix nil pointer deref when TestPingHTTP fails Fixes #3106

zmerlynn assigned markmandel Apr 5, 2023

google-oss-prow bot requested review from aLekSer and EricFortin April 5, 2023 00:14

google-oss-prow bot added the size/XS label Apr 5, 2023

zmerlynn requested review from markmandel and removed request for EricFortin and aLekSer April 5, 2023 00:14

Add timeout to SDK k8s client

33aff15

zmerlynn force-pushed the tight-client-timeout branch from f0fc75d to 33aff15 Compare April 5, 2023 00:21

markmandel approved these changes Apr 5, 2023

View reviewed changes

google-oss-prow bot added the lgtm label Apr 5, 2023

google-oss-prow bot added the approved label Apr 5, 2023

zmerlynn enabled auto-merge (squash) April 5, 2023 00:44

Merge branch 'main' into tight-client-timeout

b5511ec

google-oss-prow bot removed the lgtm label Apr 5, 2023

zmerlynn merged commit 29ec00d into googleforgames:main Apr 5, 2023

Kalaiselvi84 added the kind/feature New features for Agones label Apr 10, 2023

Kalaiselvi84 added this to the 1.31.0 milestone Apr 10, 2023

zmerlynn deleted the tight-client-timeout branch April 17, 2023 21:06

zmerlynn mentioned this pull request Apr 17, 2023

Excessive WATCH load from Agones 1.31 #3106

Closed

zmerlynn mentioned this pull request Apr 17, 2023

Revert #3070, wait on networking a different way #3107

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timeout to SDK k8s client #3070

Add timeout to SDK k8s client #3070

zmerlynn commented Apr 5, 2023 •

edited

Loading

markmandel left a comment

google-oss-prow bot commented Apr 5, 2023

google-oss-prow bot commented Apr 5, 2023

agones-bot commented Apr 5, 2023

Add timeout to SDK k8s client #3070

Add timeout to SDK k8s client #3070

Conversation

zmerlynn commented Apr 5, 2023 • edited Loading

markmandel left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Apr 5, 2023

google-oss-prow bot commented Apr 5, 2023

agones-bot commented Apr 5, 2023

zmerlynn commented Apr 5, 2023 •

edited

Loading