Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend e2e queue timings / Disable testing on Autopilot 1.26 #3059

Merged
merged 2 commits into from
Apr 3, 2023

Conversation

zmerlynn
Copy link
Collaborator

@zmerlynn zmerlynn commented Mar 31, 2023

This PR reworks the e2e timeouts to allow for more time for a given build to wait to run e2es, but tightens the e2e deadline slightly:

  • Tighten the per-e2e-configuration testcase to 1.5h. e2es are coming in close to an hour in some cases but now that we're not running consul, we don't need it as high as 2h. I don't think it's worth tightening this all the way to an hour, though it would probably work.

    • Also drops the queueTtl for the CI sub-builds, these should not be queued for long since we serialize e2es now.
  • Extends the e2e-wait-to-become-leader timeout to 3h. In higher traffic times, we're hitting this limit often now, which only results in a vicious cycle of retrying PRs. Instead wait longer to become leader.

  • Bumps the global timeout to 5h after aggregating: 3h (e2e-wait-to-become-leader) + 1.5h (e2e timeout) + 0.5h (everything else)

  • Remove vestigates of consul - it's no longer running anywhere.

  • Disables testing on Autopilot 1.26 to work around CI flaky on Autopilot 1.26 #3058. Will reenable in Rework game server health initial delay handling #3046.

@google-oss-prow google-oss-prow bot added the lgtm label Mar 31, 2023
@zmerlynn zmerlynn enabled auto-merge (squash) March 31, 2023 21:25
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 99b6ccdf-3e84-4532-9760-0e2cba467a32

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: a2550e3b-5312-4b8f-a749-ae3618ee84ed

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3059/head:pr_3059 && git checkout pr_3059
  • helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-81c99ad-amd64

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: f40cfb94-2f8e-4473-9d47-5e99e6a357f6

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 02966393-ed9a-4faa-a624-b32afc20c9b1

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: ca82581f-ca96-45b0-8796-a6da0744aa61

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

This PR reworks the e2e timeouts to allow for more time for a given
build to wait to run e2es, but tightens the e2e deadline slightly:

* Tighten the per-e2e-configuration testcase to 1.5h. e2es are coming
in close to an hour in some cases but now that we're not running
consul, we don't need it as high as 2h. I don't think it's worth
tightening this all the way to an hour, though it would probably work.
  * Also drops the queueTtl for the CI sub-builds, these should not be
  queued for long since we serialize e2es now.

* Extends the e2e-wait-to-become-leader timeout to 3h. In higher
traffic times, we're hitting this limit often now, which only results
in a vicious cycle of retrying PRs. Instead wait longer to become
leader.

* Bumps the global timeout to 5h after aggregating:
  3h (e2e-wait-to-become-leader) + 1.5h (e2e timeout) + 0.5h (everything else)

* Remove vestigates of consul - it's no longer running anywhere.
@google-oss-prow google-oss-prow bot added size/M and removed size/S labels Apr 3, 2023
@zmerlynn zmerlynn changed the title Extend e2e queue timings Extend e2e queue timings / Disable testing on Autopilot 1.26 Apr 3, 2023
@google-oss-prow google-oss-prow bot added the lgtm label Apr 3, 2023
@zmerlynn zmerlynn disabled auto-merge April 3, 2023 22:40
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gongmax, zmerlynn
Once this PR has been reviewed and has the lgtm label, please assign ericfortin for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zmerlynn zmerlynn enabled auto-merge (squash) April 3, 2023 22:40
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: a8173545-828e-44bd-a45c-c10faf4e06b3

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3059/head:pr_3059 && git checkout pr_3059
  • helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-d713c7a-amd64

@zmerlynn zmerlynn merged commit 23e14d5 into googleforgames:main Apr 3, 2023
@zmerlynn zmerlynn deleted the extend-timeouts branch April 3, 2023 23:43
zmerlynn added a commit to zmerlynn/agones that referenced this pull request Apr 3, 2023
zmerlynn added a commit that referenced this pull request Apr 4, 2023
* Rework health check handling of InitialDelaySeconds

See #2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds
* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in #3059)
zmerlynn added a commit to zmerlynn/agones that referenced this pull request Apr 5, 2023
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
zmerlynn added a commit to zmerlynn/agones that referenced this pull request Apr 5, 2023
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
zmerlynn added a commit to zmerlynn/agones that referenced this pull request Apr 5, 2023
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
zmerlynn added a commit to zmerlynn/agones that referenced this pull request Apr 6, 2023
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
zmerlynn added a commit that referenced this pull request Apr 6, 2023
* Rework game server health initial delay handling

This is a redrive of #3046, which was reverted in #3068

Rework health check handling of InitialDelaySeconds. See
#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in #3059)

* Close consistency race in syncGameServerRequestReadyState:
If the SDK and controller win the race to update the Pod with the
GameServerReadyContainerIDAnnotation before kubelet even gets a chance
to add the running containers to the Pod, the controller may update
the pod with an empty annotation, which then confuses further runs.

* Fixes TestPlayerConnectWithCapacityZero flakes

May fully fix #2445 as well
@Kalaiselvi84 Kalaiselvi84 added the area/tests Unit tests, e2e tests, anything to make sure things don't break label Apr 10, 2023
@Kalaiselvi84 Kalaiselvi84 added this to the 1.31.0 milestone Apr 10, 2023
Kalaiselvi84 pushed a commit to Kalaiselvi84/agones that referenced this pull request Apr 11, 2023
…orgames#3059)

* Extend e2e queue timings

This PR reworks the e2e timeouts to allow for more time for a given
build to wait to run e2es, but tightens the e2e deadline slightly:

* Tighten the per-e2e-configuration testcase to 1.5h. e2es are coming
in close to an hour in some cases but now that we're not running
consul, we don't need it as high as 2h. I don't think it's worth
tightening this all the way to an hour, though it would probably work.
  * Also drops the queueTtl for the CI sub-builds, these should not be
  queued for long since we serialize e2es now.

* Extends the e2e-wait-to-become-leader timeout to 3h. In higher
traffic times, we're hitting this limit often now, which only results
in a vicious cycle of retrying PRs. Instead wait longer to become
leader.

* Bumps the global timeout to 5h after aggregating:
  3h (e2e-wait-to-become-leader) + 1.5h (e2e timeout) + 0.5h (everything else)

* Remove vestigates of consul - it's no longer running anywhere.

* Stop testing on Autopilot 1.26 until after googleforgames#3046
Kalaiselvi84 pushed a commit to Kalaiselvi84/agones that referenced this pull request Apr 11, 2023
* Rework health check handling of InitialDelaySeconds

See googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds
* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
Kalaiselvi84 pushed a commit to Kalaiselvi84/agones that referenced this pull request Apr 11, 2023
* Rework game server health initial delay handling

This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)

* Close consistency race in syncGameServerRequestReadyState:
If the SDK and controller win the race to update the Pod with the
GameServerReadyContainerIDAnnotation before kubelet even gets a chance
to add the running containers to the Pod, the controller may update
the pod with an empty annotation, which then confuses further runs.

* Fixes TestPlayerConnectWithCapacityZero flakes

May fully fix googleforgames#2445 as well
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tests Unit tests, e2e tests, anything to make sure things don't break lgtm size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants