Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky: TestGameServerReserve #1414

Merged

Conversation

markmandel
Copy link
Member

After testing, it seems that the K8s API servers can get blocked for an indeterminate period during tests - a period that can be longer than the 10 seconds in which our GameServer goes from Ready->Reserverd->Ready - which caused the previous version of the test to fail.

Updated the test to only look at the event stream to ensure the e2e test state is verified. While it would be ideal to look at the state itself, we will assume that unit tests cover that the Event stream is accurate to the GameServer state being changed.

Fixes #1276

@markmandel markmandel added kind/bug These are bugs. area/tests Unit tests, e2e tests, anything to make sure things don't break labels Mar 15, 2020
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: e57e2aa2-e537-494f-9dbc-0414da0c9709

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1414/head:pr_1414 && git checkout pr_1414
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.5.0-6b324e0

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 5dd2eaa8-4b4f-48f5-aad5-ac52407f0a58

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1414/head:pr_1414 && git checkout pr_1414
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.5.0-6b324e0

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 86c704d1-6f7d-4ed9-a446-80b7f1f6124c

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1414/head:pr_1414 && git checkout pr_1414
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.5.0-6b324e0

@markmandel markmandel force-pushed the flaky/TestGameServerReserve-4 branch from 6b324e0 to 6ec632f Compare March 16, 2020 04:17
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: c841821c-3613-45e4-a1e2-fae28ad51b84

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1414/head:pr_1414 && git checkout pr_1414
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.5.0-6ec632f

@markmandel markmandel marked this pull request as ready for review March 16, 2020 04:37
@roberthbailey
Copy link
Member

Thinking about this a bit -- it seems like the old way didn't work because it was looking for the "edges" of state transitions (which could be missed) and the new way is relying on the event stream having correct data persisted (which isn't guaranteed).

Would it be better to step back and think of a different way for the program under test to communicate it's status to the test runner? Maybe setting an annotation on itself or having the test running read the container logs?

@markmandel
Copy link
Member Author

Thinking about this a bit -- it seems like the old way didn't work because it was looking for the "edges" of state transitions (which could be missed)

100% agreed 👍 that's exactly what happens, it gets missed.

and the new way is relying on the event stream having correct data persisted (which isn't guaranteed).

Not sure I understand this point -- why isn't the event stream not guaranteed to have the correct data? -- assuming it is covered appropriately by unit tests (which I believe it is), it should have appropriate guarantees, no?

@roberthbailey
Copy link
Member

I was trying to find something in the Kubernetes docs to substantiate this, but alas I didn't see anything.

My understanding is that Kubernetes events are not guaranteed. They can be dropped if the control plane is overloaded. They can be coalesced if duplicates are seen. I've always heard that you should never write a controller that relies on events for it's control logic, since you can't rely on events showing up, staying around, or having the correct cardinality. So I'm equally leery of writing test logic that relies on them.

@markmandel
Copy link
Member Author

Gotcha! I've yet to see the event stream not be consistent -- but it definitely does go away after a while, so I can see some concern there. I wonder how long events are held into for.

That being said -- I'm struggling to work out a better way to test this. This is the best idea I've got so far.

Does it make sense to run with this for now, but put a big comment on it re:potential flakiness down the road, and if we hit that, then we re-review to see if we can come up with a better answer?

@markmandel
Copy link
Member Author

I've just had 4 almost consecutive of these failures on #1409 - I'm at the point where I may just disable this test if this isn't a valid solution, it's giving me that much frustration.

@markmandel markmandel closed this Mar 17, 2020
@markmandel
Copy link
Member Author

Oops, hit the wrong button! ☺️

@markmandel markmandel reopened this Mar 17, 2020
@roberthbailey
Copy link
Member

Does it make sense to run with this for now, but put a big comment on it re:potential flakiness down the road, and if we hit that, then we re-review to see if we can come up with a better answer?

I'm ok with that, since I've sufficiently communicated the potential for flakes to you.

@markmandel markmandel force-pushed the flaky/TestGameServerReserve-4 branch from 6ec632f to 4cb7198 Compare March 17, 2020 23:07
After testing, it seems that the K8s API servers can get blocked for an
indeterminate period during tests - a period that can be longer than the
10 seconds in which our GameServer goes from Ready->Reserverd->Ready -
which caused the previous version of the test to fail.

Updated the test to only look at the event stream to ensure the e2e test
state is verified. While it would be ideal to look at the state itself,
we will assume that unit tests cover that the Event stream is accurate
to the GameServer state being changed.

Fixes googleforgames#1276
@markmandel markmandel force-pushed the flaky/TestGameServerReserve-4 branch from 4cb7198 to 347145f Compare March 17, 2020 23:08
@markmandel
Copy link
Member Author

I went to see how long events get persisted for - the only thing I could find was this:
kubernetes/kubernetes#52521

Which seems to indicate that events are persisted for an hour (at least by default). Seems like it should be long enough 🤷‍♂️

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: b94ebbf2-378a-4d87-b0ff-420037cbad2a

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1414/head:pr_1414 && git checkout pr_1414
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.5.0-347145f

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: aca0c388-b73f-4ab3-8eaa-e675cbed11b8

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1414/head:pr_1414 && git checkout pr_1414
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.5.0-9e5a612

@roberthbailey
Copy link
Member

1 hour is the default TTL, but Kubernetes service providers are free to change the default (and without doing some sort of benchmark you wouldn't know if or when they did). But I agree that it should be good enough for now.

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markmandel, roberthbailey

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [markmandel,roberthbailey]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@roberthbailey roberthbailey merged commit 2875c5f into googleforgames:master Mar 18, 2020
@markmandel markmandel deleted the flaky/TestGameServerReserve-4 branch March 18, 2020 14:11
@markmandel markmandel added this to the 1.5.0 milestone Apr 7, 2020
ilkercelikyilmaz pushed a commit to ilkercelikyilmaz/agones that referenced this pull request Oct 23, 2020
After testing, it seems that the K8s API servers can get blocked for an
indeterminate period during tests - a period that can be longer than the
10 seconds in which our GameServer goes from Ready->Reserverd->Ready -
which caused the previous version of the test to fail.

Updated the test to only look at the event stream to ensure the e2e test
state is verified. While it would be ideal to look at the state itself,
we will assume that unit tests cover that the Event stream is accurate
to the GameServer state being changed.

Fixes googleforgames#1276
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved area/tests Unit tests, e2e tests, anything to make sure things don't break kind/bug These are bugs. lgtm size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky: TestGameServerReserve
4 participants