Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolve flaky e2e test #3616

Merged
merged 8 commits into from
Mar 6, 2024

Conversation

ashutosji
Copy link
Contributor

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking
/kind bug
/kind cleanup
/kind documentation
/kind feature
/kind hotfix
/kind release

What this PR does / Why we need it: This will fix the flaky e2e test of TestGameServerCreationAfterDeletingOneExtensionsPod.

Which issue(s) this PR fixes:#3599

Closes #3599

Special notes for your reviewer:

@github-actions github-actions bot added kind/bug These are bugs. size/XS labels Jan 30, 2024
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 600e2c27-3b31-48b4-ac8a-8eb5e34d8565

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Member

Flake:

--- FAIL: TestSDKServerAddListValue (31.98s)
    --- FAIL: TestSDKServerAddListValue/Add_multiple_values_past_capacity (1.12s)
        sdkserver_test.go:1477: 
            	Error Trace:	/go/src/agones.dev/agones/pkg/sdkserver/sdkserver_test.go:1477
            	Error:      	Not equal: 
            	            	expected: v1.ListStatus{Capacity:10, Values:[]string{"one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"}}
            	            	actual  : v1.ListStatus{Capacity:10, Values:[]string{"one", "two", "three", "four", "five", "six", "seven", "eight", "nine"}}
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -2,3 +2,3 @@
            	            	  Capacity: (int64) 10,
            	            	- Values: ([]string) (len=10) {
            	            	+ Values: ([]string) (len=9) {
            	            	   (string) (len=3) "one",
            	            	@@ -11,4 +11,3 @@
            	            	   (string) (len=5) "eight",
            	            	-  (string) (len=4) "nine",
            	            	-  (string) (len=3) "ten"
            	            	+  (string) (len=4) "nine"
            	            	  }
            	Test:       	TestSDKServerAddListValue/Add_multiple_values_past_capacity
{"filePath":"/tmp/gameserver-3963978651","message":"Reading GameServer configuration","severity":"info",

@igooch FYI.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 579a876b-04bc-4e33-bb5b-41a435e2704f

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3616/head:pr_3616 && git checkout pr_3616
  • helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.39.0-dev-4fa9b50-amd64

@@ -111,7 +111,7 @@ func waitForAgonesExtensionsRunning(ctx context.Context) error {
return true, err
}

if len(list.Items) != 2 {
if len(list.Items) < 2 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the expected behavior that there are two extension pods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong, I think extension pods should be more than 1 (https://github.com/googleforgames/agones/blob/main/test/e2e/extensions/high_availability_test.go#L46). Even, we can extend it to more than 2 pods.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our default setup, there should be 2 replica for the extension: https://github.com/googleforgames/agones/blob/main/install/yaml/install.yaml#L17151. This is true for the e2e test cluster. So I am wondering what's the real edge case here? i.e. in what case the replica count goes above 2 and make this test flaky?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting question! didn't get the real edge case yet. Also, couldn't replicate it locally with 2 replicas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what made you think it was stuck in this section?

One recommendation i would also make - is add a bunch of logging here when the test is in this section, so even if this doesn't solve the Flakiness, it should allow us to see what is going on.

For example - how many pods are there on each loop? What are their names? do we need to know anything else about them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn’t get it, if default value for replica of extensions is 2 then how it would have created 4 pods. Even, if we delete 1 pod it will immediately create 1 pod to meet the default replica 2.
Since, I was not able to replicate with 2 replicas, so I manually increased it to 4 so that I can replicate that error locally. And it worked, then I added few logs and found it was exactly failing in this if condition(https://github.com/googleforgames/agones/blob/main/test/e2e/extensions/high_availability_test.go#L114).

So, I realize now there is no need to do changes in if condition rather I should try to find out why and how it’s making 4 pods.

It has sufficient logs which is printing the pods name and other information. However, I added more logs for my understanding and ran that test multiple times with default replica but couldn’t replicate it and neither I found anything weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumption: Either some process are running in parallel which is modifying the replica and due to delay in sync it's giving 4 pods or maybe rolling update would be happening somewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So some thoughts:

We could improve the logging here, and wait for it to fail again in CI. I'm particularly interested in seeing the event stream and/or the status values for the Deployment if the replica count goes above 2, and maybe also listing out all the pod names and if they are running/terminating or not.

Handy for logging events:

func (f *Framework) LogEvents(t *testing.T, log *logrus.Entry, namespace string, objOrRef k8sruntime.Object) {

OH I just had an idea! (and could be good to add to logging too to confirm)

func getAgoneseExtensionsPods(ctx context.Context) (*corev1.PodList, error) {
opts := metav1.ListOptions{LabelSelector: labels.Set{"agones.dev/role": "extensions"}.String()}
return framework.KubeClient.CoreV1().Pods("agones-system").List(ctx, opts)
}

What if some of the Pods are stuck Terminating for a while? This function doesn't filter by Terminating pods (Pod.ObjectMeta.DeletionTimestamp.IsZero()) -- should we be only looking at Pods that are running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I have added logs and also filter out the terminating pods. Only giving the running pods.
Could you please review it and let me know if any other changes are required?

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 13a128ef-de38-4e16-86a8-108eab7d7fe0

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3616/head:pr_3616 && git checkout pr_3616
  • helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.39.0-dev-66faebd-amd64

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 2d8f9f68-83ad-4e23-a531-57e0836afa16

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 55cc2638-95f1-458a-9936-a07f4f8b0bca

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Member

So I'm noticing this error in the logs, which seem consistent:

=== PAUSE TestPlayerConnectWithCapacityZero
=== CONT  TestPlayerConnectWithCapacityZero
time="2024-02-20 11:40:52.475" level=info msg="GameServer created, waiting for Ready" gs=game-serverl9tmb test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:40:52.508" level=info msg="Waiting for states to match" awaitingState=Ready currentState=PortAllocation gs=game-serverl9tmb test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:40:52.519" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Creating gs=game-serverl9tmb test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:40:53.526" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-serverl9tmb test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:40:54.526" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-serverl9tmb test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:40:55.528" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-serverl9tmb test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:40:56.529" level=info msg="Waiting for states to match" awaitingState=Ready currentState=Scheduled gs=game-serverl9tmb test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:40:57.527" level=info msg="GameServer states match" awaitingState=Ready currentState=Ready gs=game-serverl9tmb test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:40:57.527" level=info msg="GameServer Ready" gs=game-serverl9tmb
time="2024-02-20 11:40:57.527" level=info msg="Sending Player Connect" fields.msg="PLAYER_CONNECT 1"
time="2024-02-20 11:41:07.535" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:36043->34.168.52.62:7080: i/o timeout" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:17.540" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:57794->34.168.52.62:7080: i/o timeout" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:19.543" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:54965->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:21.543" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:56663->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:23.542" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:48930->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:35.541" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:45718->34.168.52.62:7080: i/o timeout" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:35.544" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:44650->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:37.542" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:43455->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:39.541" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:46106->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:41.541" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:43639->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:43.542" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:54143->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:45.543" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:35807->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:47.542" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:57623->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:49.542" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:35493->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:51.542" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:53580->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:53.541" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:60256->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:55.541" level=info msg="Could not read from address" address="34.168.52.62:7080" error="read udp 192.168.10.2:56268->34.168.52.62:7080: read: connection refused" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:57.529" level=info msg="Failed to send UDP packet to GameServer. Dumping Events!" gs=game-serverl9tmb status="{State:Ready Ports:[{Name:udp-port Port:7080}] Address:34.168.52.62 Addresses:[{Type:InternalIP Address:10.138.0.53} {Type:ExternalIP Address:34.168.52.62} {Type:Hostname Address:gke-standard-e2e-test-cluster-default-1139ab64-q6w9}] NodeName:gke-standard-e2e-test-cluster-default-1139ab64-q6w9 ReservedUntil:<nil> Players:0xc00069ad20 Counters:map[] Lists:map[] Eviction:0xc000376c40}" test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:57.529" level=info msg="Dumping Events:" kind= test=TestPlayerConnectWithCapacityZero
time="2024-02-20 11:41:57.555" level=info msg="Event!" lastTimestamp="2024-02-20 11:40:52 +0000 UTC" message="Port allocated" reason=PortAllocation test=TestPlayerConnectWithCapacityZero type=Normal
time="2024-02-20 11:41:57.556" level=info msg="Event!" lastTimestamp="2024-02-20 11:40:52 +0000 UTC" message="Pod game-serverl9tmb created" reason=Creating test=TestPlayerConnectWithCapacityZero type=Normal
time="2024-02-20 11:41:57.556" level=info msg="Event!" lastTimestamp="2024-02-20 11:40:52 +0000 UTC" message="Address and port populated" reason=Scheduled test=TestPlayerConnectWithCapacityZero type=Normal
time="2024-02-20 11:41:57.556" level=info msg="Event!" lastTimestamp="2024-02-20 11:40:56 +0000 UTC" message="SDK state change" reason=RequestReady test=TestPlayerConnectWithCapacityZero type=Normal
time="2024-02-20 11:41:57.556" level=info msg="Event!" lastTimestamp="2024-02-20 11:40:56 +0000 UTC" message="SDK.Ready() complete" reason=Ready test=TestPlayerConnectWithCapacityZero type=Normal
time="2024-02-20 11:41:57.556" level=info msg="Event!" lastTimestamp="2024-02-20 11:40:58 +0000 UTC" message="Issue with Gameserver pod" reason=Unhealthy test=TestPlayerConnectWithCapacityZero type=Warning
    framework.go:844: 
        	Error Trace:	/go/src/agones.dev/agones/test/e2e/framework.go:844
        	            				/go/src/agones.dev/agones/test/e2e/framework.go:557
        	            				/go/src/agones.dev/agones/test/e2e/framework.go:533
        	            				/go/src/agones.dev/agones/test/e2e/gameserver_test.go:1221
        	Error:      	Received unexpected error:
        	            	deployments.apps "game-serverl9tmb" not found
        	Test:       	TestPlayerConnectWithCapacityZero
        	Messages:   	error getting deployment
--- FAIL: TestPlayerConnectWithCapacityZero (65.16s)
FAIL test/e2e.TestPlayerConnectWithCapacityZero (re-run 1) (65.16s)

Rerunning to see if it's a flake.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 172f7f0f-f62e-46fd-b8d4-7172eba5fde9

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Member

Same error again. Looks like something in this PR is causing a consistent failure on that test.

@markmandel
Copy link
Member

I think I might be onto something with looking at Pods being terminated, as I was just checking a cluster:

NAME                                 READY   STATUS      RESTARTS   AGE
agones-allocator-76fcf5fc68-k74b4    1/1     Running     0          55m
agones-allocator-76fcf5fc68-ljxqm    1/1     Running     0          55m
agones-allocator-76fcf5fc68-xr24h    1/1     Running     0          55m
agones-controller-55745d798f-hbbv6   0/1     Completed   0          21h
agones-controller-5ccdcbf89d-h4vzd   0/1     Completed   0          3h55m
agones-controller-6bd8fb644-zh2cc    0/1     Completed   0          12h
agones-controller-76dd77859c-lmc7s   1/1     Running     0          55m
agones-controller-76dd77859c-xqc7j   1/1     Running     0          55m
agones-extensions-7b8cf5cd4f-cmq9z   1/1     Running     0          55m
agones-extensions-7b8cf5cd4f-jj28p   1/1     Running     0          55m
agones-ping-7fd56986c6-f9rtx         1/1     Running     0          3h49m
agones-ping-7fd56986c6-wh7st         1/1     Running     0          3h55m

And there a bunch of controller pods that are left at "Completed" for a long time.

@ashutosji
Copy link
Contributor Author

Same error again. Looks like something in this PR is causing a consistent failure on that test.

Yes, and this consistent failure is only happening in CI/CD.
Same, I tried in local and it's passed without any issue as you can see in the below screenshot.
https://screenshot.googleplex.com/6mQ52cCXV8gCnXc

@ashutosji
Copy link
Contributor Author

I think I might be onto something with looking at Pods being terminated, as I was just checking a cluster:

NAME                                 READY   STATUS      RESTARTS   AGE
agones-allocator-76fcf5fc68-k74b4    1/1     Running     0          55m
agones-allocator-76fcf5fc68-ljxqm    1/1     Running     0          55m
agones-allocator-76fcf5fc68-xr24h    1/1     Running     0          55m
agones-controller-55745d798f-hbbv6   0/1     Completed   0          21h
agones-controller-5ccdcbf89d-h4vzd   0/1     Completed   0          3h55m
agones-controller-6bd8fb644-zh2cc    0/1     Completed   0          12h
agones-controller-76dd77859c-lmc7s   1/1     Running     0          55m
agones-controller-76dd77859c-xqc7j   1/1     Running     0          55m
agones-extensions-7b8cf5cd4f-cmq9z   1/1     Running     0          55m
agones-extensions-7b8cf5cd4f-jj28p   1/1     Running     0          55m
agones-ping-7fd56986c6-f9rtx         1/1     Running     0          3h49m
agones-ping-7fd56986c6-wh7st         1/1     Running     0          3h55m

And there a bunch of controller pods that are left at "Completed" for a long time.

I couldn't see completed status of controller pod https://screenshot.googleplex.com/yef57cUjXZ7jXje.
May i know where exactly to find?

@markmandel
Copy link
Member

Not sure - I expect it's intermittent, hence the flakiness. But just showing that it can happen!

Would definitely explain the flakiness.

Comment on lines 843 to 844
deployment, err := f.KubeClient.AppsV1().Deployments(namespace).Get(ctx, objMeta.GetName(), metav1.GetOptions{})
require.NoError(t, err, "error getting deployment")
Copy link
Contributor Author

@ashutosji ashutosji Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markmandel,
It's exactly failing here as per CI/CD error. It couldn't find the deployment.
Could you please help me if this is the right way of getting deployments details?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a silly question - why are you getting a Deployment? LogEvents works on any K8s object and is used all around the test suites, so you just put somthing singularly test specific in a function that is a test wide utility - probably not a good idea!

I'd suggest moving the extra logging into the flaky test itself.

Copy link
Contributor Author

@ashutosji ashutosji Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a silly question - why are you getting a Deployment? LogEvents works on any K8s object and is used all around the test suites, so you just put somthing singularly test specific in a function that is a test wide utility - probably not a good idea!

I'd suggest moving the extra logging into the flaky test itself.

I think, I misunderstood statement mentioned over here: #3616 (comment)
Deployment is not needed. I have added the Log Events in the latest commit.

Shall we add the Log Events, even after deleting the one of extensions pod here?: https://github.com/googleforgames/agones/pull/3616/files#diff-bbe16f0894837a78bfd6ea977aa01c4cc44a35fe2dcdaa2066d051d43aac0335R58

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 0aacc373-5446-488e-9ba0-2922b1a10d1d

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3616/head:pr_3616 && git checkout pr_3616
  • helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.39.0-dev-e714c84-amd64

@markmandel
Copy link
Member

Running this again - but not seeing any issues 👍🏻

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 4c6d6773-392c-4264-af65-24f11d9d13bb

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Member

Seems like different flakes in the last failure. Rerunning.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: ff2606c1-e165-4f8a-9045-21611e6bc84b

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Member

gke-autopilot-1.26

    gameserverallocation_test.go:1417: 
        	Error Trace:	/go/src/agones.dev/agones/test/e2e/gameserverallocation_test.go:1417
        	Error:      	Not equal: 
        	            	expected: 100
        	            	actual  : 98
        	Test:       	TestGameServerAllocationDuringMultipleAllocationClients

Two times in a row. See #3553 . Restarting build.

@ashutosji before approving this, wondering if you've done a 100 count (or similar) run of the TestGameServerCreationAfterDeletingOneExtensionsPod e2e test? See if anything fails ?

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 297b6767-1cd7-499f-8e37-27ebdf1a7f4b

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3616/head:pr_3616 && git checkout pr_3616
  • helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.39.0-dev-b0f8c4b-amd64

@ashutosji
Copy link
Contributor Author

ashutosji commented Feb 29, 2024

gke-autopilot-1.26

    gameserverallocation_test.go:1417: 
        	Error Trace:	/go/src/agones.dev/agones/test/e2e/gameserverallocation_test.go:1417
        	Error:      	Not equal: 
        	            	expected: 100
        	            	actual  : 98
        	Test:       	TestGameServerAllocationDuringMultipleAllocationClients

Two times in a row. See #3553 . Restarting build.

@ashutosji before approving this, wondering if you've done a 100 count (or similar) run of the TestGameServerCreationAfterDeletingOneExtensionsPod e2e test? See if anything fails ?

Previously, I had ran it couple of times manually.
But this time I ran it using this command:
go test -run TestGameServerCreationAfterDeletingOneExtensionsPod -count 100

I didn't get above failure errors.
It ran this test function successfully till 26th times but after that it became flaky and throws below error :

panic: test timed out after 10m0s
running tests:
        TestGameServerCreationAfterDeletingOneExtensionsPod (11s)

goroutine 1350 [running]:
testing.(*M).startAlarm.func1()
        /usr/lib/google-golang/src/testing/testing.go:2380 +0x385
created by time.goFunc
        /usr/lib/google-golang/src/time/sleep.go:177 +0x2d
goroutine 1324 [select]:
github.com/stretchr/testify/assert.Eventually({0x1e18300, 0xc00014c340}, 0xc0005090e0, 0x45d964b800, 0x3b9aca00, {0x0, 0x0, 0x0})
/agones/vendor/github.com/stretchr/testify/assert/assertions.go:1744 +0x1aa
github.com/stretchr/testify/require.Eventually({0x1e1daf8, 0xc00014c340}, 0xc0005090e0, 0x45d964b800, 0x3b9aca00, {0x0, 0x0, 0x0})
/agones/vendor/github.com/stretchr/testify/require/require.go:361 +0xca
agones.dev/agones/test/e2e/extensions.deleteAgonesExtensionsPod({0x1e30ea8, 0x2bf3220}, 0xc00014c340, 0x1)
/agones/test/e2e/extensions/high_availability_test.go:109 +0x25e
agones.dev/agones/test/e2e/extensions.TestGameServerCreationAfterDeletingOneExtensionsPod(0xc00014c340)
/agones/test/e2e/extensions/high_availability_test.go:58 +0x5d7
testing.tRunner(0xc00014c340, 0x1ca3648)
        /usr/lib/google-golang/src/testing/testing.go:1689 +0xfb
created by testing.(*T).Run in goroutine 1
        /usr/lib/google-golang/src/testing/testing.go:1742 +0x390
exit status 2
FAIL    agones.dev/agones/test/e2e/extensions   601.289s

@markmandel
Copy link
Member

markmandel commented Mar 4, 2024

That's a weird one!

My only thought would be on:

func deleteAgonesExtensionsPod(ctx context.Context, t *testing.T, waitForExtensions bool) {
list, err := getAgoneseExtensionsPods(ctx)
assert.NoError(t, err)
policy := metav1.DeletePropagationBackground
podToDelete := list.Items[1]
err = framework.KubeClient.CoreV1().Pods("agones-system").Delete(ctx, podToDelete.ObjectMeta.Name,
metav1.DeleteOptions{PropagationPolicy: &policy})
assert.NoError(t, err)
if waitForExtensions {
require.Eventually(t, func() bool {
_, err := framework.KubeClient.CoreV1().Pods("agones-system").Get(ctx, podToDelete.ObjectMeta.Name, metav1.GetOptions{})
return err != nil
}, 5*time.Minute, time.Second)
}
}

All the assert.NoError(...) should really be require.NoError(...) I wonder if something in there is throwing an error, and then passing by the assertion (which won't stop the test), and therefore we end up in a weird panic and the original cause was lost.

i think let's get that in, and we'll merge this as is and keep an eye on things to see if this flake happens again - sound good?

@ashutosji
Copy link
Contributor Author

That's a weird one!

My only thought would be on:

func deleteAgonesExtensionsPod(ctx context.Context, t *testing.T, waitForExtensions bool) {
list, err := getAgoneseExtensionsPods(ctx)
assert.NoError(t, err)
policy := metav1.DeletePropagationBackground
podToDelete := list.Items[1]
err = framework.KubeClient.CoreV1().Pods("agones-system").Delete(ctx, podToDelete.ObjectMeta.Name,
metav1.DeleteOptions{PropagationPolicy: &policy})
assert.NoError(t, err)
if waitForExtensions {
require.Eventually(t, func() bool {
_, err := framework.KubeClient.CoreV1().Pods("agones-system").Get(ctx, podToDelete.ObjectMeta.Name, metav1.GetOptions{})
return err != nil
}, 5*time.Minute, time.Second)
}
}

All the assert.NoError(...) should really be require.NoError(...) I wonder if something in there is throwing an error, and then passing by the assertion (which won't stop the test), and therefore we end up in a weird panic and the original cause was lost.

i think let's get that in, and we'll merge this as is and keep an eye on things to see if this flake happens again - sound good?

Yes, This sounds reasonable.
I have replaced assert with require.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 70aca732-b3b5-4990-9be9-8c0dbfff92cc

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3616/head:pr_3616 && git checkout pr_3616
  • helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.39.0-dev-58f2681-amd64

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets merge this and see how we go!

@markmandel markmandel merged commit 6e08bd8 into googleforgames:main Mar 6, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky: e2e TestGameServerCreationAfterDeletingOneExtensionsPod
4 participants