-
Notifications
You must be signed in to change notification settings - Fork 799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky: TestGameServerReserve #1414
Flaky: TestGameServerReserve #1414
Conversation
Build Succeeded 👏 Build Id: e57e2aa2-e537-494f-9dbc-0414da0c9709 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Build Succeeded 👏 Build Id: 5dd2eaa8-4b4f-48f5-aad5-ac52407f0a58 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Build Succeeded 👏 Build Id: 86c704d1-6f7d-4ed9-a446-80b7f1f6124c The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
6b324e0
to
6ec632f
Compare
Build Succeeded 👏 Build Id: c841821c-3613-45e4-a1e2-fae28ad51b84 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Thinking about this a bit -- it seems like the old way didn't work because it was looking for the "edges" of state transitions (which could be missed) and the new way is relying on the event stream having correct data persisted (which isn't guaranteed). Would it be better to step back and think of a different way for the program under test to communicate it's status to the test runner? Maybe setting an annotation on itself or having the test running read the container logs? |
100% agreed 👍 that's exactly what happens, it gets missed.
Not sure I understand this point -- why isn't the event stream not guaranteed to have the correct data? -- assuming it is covered appropriately by unit tests (which I believe it is), it should have appropriate guarantees, no? |
I was trying to find something in the Kubernetes docs to substantiate this, but alas I didn't see anything. My understanding is that Kubernetes events are not guaranteed. They can be dropped if the control plane is overloaded. They can be coalesced if duplicates are seen. I've always heard that you should never write a controller that relies on events for it's control logic, since you can't rely on events showing up, staying around, or having the correct cardinality. So I'm equally leery of writing test logic that relies on them. |
Gotcha! I've yet to see the event stream not be consistent -- but it definitely does go away after a while, so I can see some concern there. I wonder how long events are held into for. That being said -- I'm struggling to work out a better way to test this. This is the best idea I've got so far. Does it make sense to run with this for now, but put a big comment on it re:potential flakiness down the road, and if we hit that, then we re-review to see if we can come up with a better answer? |
I've just had 4 almost consecutive of these failures on #1409 - I'm at the point where I may just disable this test if this isn't a valid solution, it's giving me that much frustration. |
Oops, hit the wrong button! |
I'm ok with that, since I've sufficiently communicated the potential for flakes to you. |
6ec632f
to
4cb7198
Compare
After testing, it seems that the K8s API servers can get blocked for an indeterminate period during tests - a period that can be longer than the 10 seconds in which our GameServer goes from Ready->Reserverd->Ready - which caused the previous version of the test to fail. Updated the test to only look at the event stream to ensure the e2e test state is verified. While it would be ideal to look at the state itself, we will assume that unit tests cover that the Event stream is accurate to the GameServer state being changed. Fixes googleforgames#1276
4cb7198
to
347145f
Compare
I went to see how long events get persisted for - the only thing I could find was this: Which seems to indicate that events are persisted for an hour (at least by default). Seems like it should be long enough 🤷♂️ |
Build Succeeded 👏 Build Id: b94ebbf2-378a-4d87-b0ff-420037cbad2a The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Build Succeeded 👏 Build Id: aca0c388-b73f-4ab3-8eaa-e675cbed11b8 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
1 hour is the default TTL, but Kubernetes service providers are free to change the default (and without doing some sort of benchmark you wouldn't know if or when they did). But I agree that it should be good enough for now. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: markmandel, roberthbailey The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
After testing, it seems that the K8s API servers can get blocked for an indeterminate period during tests - a period that can be longer than the 10 seconds in which our GameServer goes from Ready->Reserverd->Ready - which caused the previous version of the test to fail. Updated the test to only look at the event stream to ensure the e2e test state is verified. While it would be ideal to look at the state itself, we will assume that unit tests cover that the Event stream is accurate to the GameServer state being changed. Fixes googleforgames#1276
After testing, it seems that the K8s API servers can get blocked for an indeterminate period during tests - a period that can be longer than the 10 seconds in which our GameServer goes from Ready->Reserverd->Ready - which caused the previous version of the test to fail.
Updated the test to only look at the event stream to ensure the e2e test state is verified. While it would be ideal to look at the state itself, we will assume that unit tests cover that the Event stream is accurate to the GameServer state being changed.
Fixes #1276