-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resolve flaky e2e test #3616
resolve flaky e2e test #3616
Conversation
Build Failed 😱 Build Id: 600e2c27-3b31-48b4-ac8a-8eb5e34d8565 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Flake:
@igooch FYI. |
d1c648a
to
4fa9b50
Compare
Build Succeeded 👏 Build Id: 579a876b-04bc-4e33-bb5b-41a435e2704f The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
@@ -111,7 +111,7 @@ func waitForAgonesExtensionsRunning(ctx context.Context) error { | |||
return true, err | |||
} | |||
|
|||
if len(list.Items) != 2 { | |||
if len(list.Items) < 2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the expected behavior that there are two extension pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I am wrong, I think extension pods should be more than 1 (https://github.com/googleforgames/agones/blob/main/test/e2e/extensions/high_availability_test.go#L46). Even, we can extend it to more than 2 pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our default setup, there should be 2 replica for the extension: https://github.com/googleforgames/agones/blob/main/install/yaml/install.yaml#L17151. This is true for the e2e test cluster. So I am wondering what's the real edge case here? i.e. in what case the replica count goes above 2 and make this test flaky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting question! didn't get the real edge case yet. Also, couldn't replicate it locally with 2 replicas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what made you think it was stuck in this section?
One recommendation i would also make - is add a bunch of logging here when the test is in this section, so even if this doesn't solve the Flakiness, it should allow us to see what is going on.
For example - how many pods are there on each loop? What are their names? do we need to know anything else about them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn’t get it, if default value for replica of extensions is 2 then how it would have created 4 pods. Even, if we delete 1 pod it will immediately create 1 pod to meet the default replica 2.
Since, I was not able to replicate with 2 replicas, so I manually increased it to 4 so that I can replicate that error locally. And it worked, then I added few logs and found it was exactly failing in this if condition
(https://github.com/googleforgames/agones/blob/main/test/e2e/extensions/high_availability_test.go#L114).
So, I realize now there is no need to do changes in if condition
rather I should try to find out why and how it’s making 4 pods.
It has sufficient logs which is printing the pods name and other information. However, I added more logs for my understanding and ran that test multiple times with default replica but couldn’t replicate it and neither I found anything weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assumption: Either some process are running in parallel which is modifying the replica and due to delay in sync it's giving 4 pods or maybe rolling update would be happening somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So some thoughts:
We could improve the logging here, and wait for it to fail again in CI. I'm particularly interested in seeing the event stream and/or the status values for the Deployment if the replica count goes above 2, and maybe also listing out all the pod names and if they are running/terminating or not.
Handy for logging events:
agones/test/e2e/framework/framework.go
Line 829 in 589f77b
func (f *Framework) LogEvents(t *testing.T, log *logrus.Entry, namespace string, objOrRef k8sruntime.Object) { |
OH I just had an idea! (and could be good to add to logging too to confirm)
agones/test/e2e/extensions/high_availability_test.go
Lines 131 to 134 in 589f77b
func getAgoneseExtensionsPods(ctx context.Context) (*corev1.PodList, error) { | |
opts := metav1.ListOptions{LabelSelector: labels.Set{"agones.dev/role": "extensions"}.String()} | |
return framework.KubeClient.CoreV1().Pods("agones-system").List(ctx, opts) | |
} |
What if some of the Pods are stuck Terminating for a while? This function doesn't filter by Terminating pods (Pod.ObjectMeta.DeletionTimestamp.IsZero()
) -- should we be only looking at Pods that are running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I have added logs and also filter out the terminating pods. Only giving the running pods.
Could you please review it and let me know if any other changes are required?
Build Succeeded 👏 Build Id: 13a128ef-de38-4e16-86a8-108eab7d7fe0 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
66faebd
to
35d07e6
Compare
Build Failed 😱 Build Id: 2d8f9f68-83ad-4e23-a531-57e0836afa16 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: 55cc2638-95f1-458a-9936-a07f4f8b0bca To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
So I'm noticing this error in the logs, which seem consistent:
Rerunning to see if it's a flake. |
Build Failed 😱 Build Id: 172f7f0f-f62e-46fd-b8d4-7172eba5fde9 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Same error again. Looks like something in this PR is causing a consistent failure on that test. |
I think I might be onto something with looking at Pods being terminated, as I was just checking a cluster:
And there a bunch of controller pods that are left at "Completed" for a long time. |
Yes, and this consistent failure is only happening in CI/CD. |
I couldn't see |
Not sure - I expect it's intermittent, hence the flakiness. But just showing that it can happen! Would definitely explain the flakiness. |
test/e2e/framework/framework.go
Outdated
deployment, err := f.KubeClient.AppsV1().Deployments(namespace).Get(ctx, objMeta.GetName(), metav1.GetOptions{}) | ||
require.NoError(t, err, "error getting deployment") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markmandel,
It's exactly failing here as per CI/CD error. It couldn't find the deployment.
Could you please help me if this is the right way of getting deployments details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a silly question - why are you getting a Deployment? LogEvents
works on any K8s object and is used all around the test suites, so you just put somthing singularly test specific in a function that is a test wide utility - probably not a good idea!
I'd suggest moving the extra logging into the flaky test itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a silly question - why are you getting a Deployment?
LogEvents
works on any K8s object and is used all around the test suites, so you just put somthing singularly test specific in a function that is a test wide utility - probably not a good idea!I'd suggest moving the extra logging into the flaky test itself.
I think, I misunderstood statement mentioned over here: #3616 (comment)
Deployment is not needed. I have added the Log Events in the latest commit.
Shall we add the Log Events, even after deleting the one of extensions pod here?: https://github.com/googleforgames/agones/pull/3616/files#diff-bbe16f0894837a78bfd6ea977aa01c4cc44a35fe2dcdaa2066d051d43aac0335R58
e14e2aa
to
e714c84
Compare
Build Succeeded 👏 Build Id: 0aacc373-5446-488e-9ba0-2922b1a10d1d The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Running this again - but not seeing any issues 👍🏻 |
Build Failed 😱 Build Id: 4c6d6773-392c-4264-af65-24f11d9d13bb To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Seems like different flakes in the last failure. Rerunning. |
Build Failed 😱 Build Id: ff2606c1-e165-4f8a-9045-21611e6bc84b To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
gke-autopilot-1.26
Two times in a row. See #3553 . Restarting build. @ashutosji before approving this, wondering if you've done a 100 count (or similar) run of the |
Build Succeeded 👏 Build Id: 297b6767-1cd7-499f-8e37-27ebdf1a7f4b The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Previously, I had ran it couple of times manually. I didn't get above failure errors.
|
That's a weird one! My only thought would be on: agones/test/e2e/extensions/high_availability_test.go Lines 99 to 114 in b0f8c4b
All the i think let's get that in, and we'll merge this as is and keep an eye on things to see if this flake happens again - sound good? |
b0f8c4b
to
58f2681
Compare
Yes, This sounds reasonable. |
Build Succeeded 👏 Build Id: 70aca732-b3b5-4990-9be9-8c0dbfff92cc The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets merge this and see how we go!
What type of PR is this?
What this PR does / Why we need it: This will fix the flaky e2e test of TestGameServerCreationAfterDeletingOneExtensionsPod.
Which issue(s) this PR fixes:#3599
Closes #3599
Special notes for your reviewer: