[flake] InterpreterOperation InterpretHealth testing #3640

XiShanYongYe-Chang · 2023-06-06T03:30:52Z

Which jobs are flaking:

Resource interpreter customization testing when Apply single ResourceInterpreterCustomization without DependencyInterpretation operation InterpreterOperation InterpretHealth testing [It] InterpretHealth testing

Which test(s) are flaking:

https://github.com/karmada-io/karmada/actions/runs/5177126570/jobs/9327065064?pr=3621

Reason for failure:

Analyzing...

Anything else we need to know:

Deployment.apps "deploy-7m46p" is invalid: status.readyReplicas: Invalid value: 3: cannot be greater than status.replicas

Full Stack Trace
    github.com/karmada-io/karmada/test/e2e/framework.UpdateDeploymentStatus.func1()
    	/home/runner/work/karmada/karmada/test/e2e/framework/deployment.go:57 +0x1aa
    github.com/karmada-io/karmada/test/e2e/framework.UpdateDeploymentStatus({0x45022d0, 0xc0006ceb60}, 0xc000b18a00)
    	/home/runner/work/karmada/karmada/test/e2e/framework/deployment.go:48 +0x14a
    github.com/karmada-io/karmada/test/e2e.glob..func33.1.8.2.1(0x3)
    	/home/runner/work/karmada/karmada/test/e2e/resourceinterpreter_test.go:615 +0x37f
    github.com/karmada-io/karmada/test/e2e.glob..func33.1.8.2.3()
    	/home/runner/work/karmada/karmada/test/e2e/resourceinterpreter_test.go:636 +0x96
    github.com/karmada-io/karmada/test/e2e.glob..func33.1.8.2()
    	/home/runner/work/karmada/karmada/test/e2e/resourceinterpreter_test.go:635 +0x19b

The text was updated successfully, but these errors were encountered:

yike21 · 2023-06-07T06:33:18Z

/assign

XiShanYongYe-Chang · 2023-06-07T07:07:18Z

Hi @yike21 Thanks a lot

yike21 · 2023-06-07T08:37:25Z

Hi! I am interested in this issue and hope to do some thing.

I find some similar tests: the failed one and succeed1, succeed2.
We can see that the time interval between STEP: deployment healthy and STEP: Removing Deployment:

In failed one, it is 7 min, which is from 13:18:35.68 to 13:25:40.737. Then it failed.

STEP: deployment healthy @ 06/05/23 13:18:35.68
STEP: Update Deployment(karmadatest-4hblr/deploy-7m46p) status @ 06/05/23 13:18:40.726
[FAILED] in [It] - /home/runner/work/karmada/karmada/test/e2e/framework/deployment.go:57 @ 06/05/23 13:25:40.737
STEP: Removing Deployment(karmadatest-4hblr/deploy-7m46p) @ 06/05/23 13:25:40.737

In [succeed1], it is 5 s, which is from 02:42:29.611 to 02:42:34.656. And it succeed.

STEP: deployment healthy @ 06/07/23 02:42:29.611
STEP: Update Deployment(karmadatest-2rrtd/deploy-9wq8d) status @ 06/07/23 02:42:34.656
STEP: Removing Deployment(karmadatest-2rrtd/deploy-9wq8d) @ 06/07/23 02:42:34.676

In succeed2, it is 5 s, which is from 03:08:14.001 to 03:08:19.063. And it succeed.

STEP: deployment healthy @ 06/07/23 03:08:14.001
STEP: Update Deployment(karmadatest-hpdrt/deploy-bf2xw) status @ 06/07/23 03:08:19.063
STEP: Removing Deployment(karmadatest-hpdrt/deploy-bf2xw) @ 06/07/23 03:08:19.105

The time interval corresponds to the codes:
code1 "deployment healthy", which is the beginning of STEP: deployment healthy.
code2 framework.UpdateDeploymentStatus, which tries to update the deployment in member cluster with memberDeployment.Status.ReadyReplicas = readyReplicas. Here the readyReplicas is 3.
code3 gomega.Eventually, which updates deployment's status in member cluster and makes the pollTimeout 7 min and the pollInterval 5 s.

Considering that the error message is Deployment.apps \"deploy-7m46p\" is invalid: status.readyReplicas: Invalid value: 3: cannot be greater than status.replicas which means the updateStatus operation is error, the possible reason I think is that the number of pods created in the member cluster is less than 3. If the code update .status.readyReplicas to 3 at this time, we will get the above error message.

And this lead to another question: why haven't the member clusters in the test environment created 3 pods successfully within 7 minutes? I'm not sure, maybe the insufficient resources prevent scheduling or there are other specific reasons.

Thanks!

XiShanYongYe-Chang · 2023-06-07T09:13:45Z

Good analysis 👍
I do suspect that the number of deployment replicas on the member cluster is not all started. We can check the logs of the member cluster for information.
It's a shame that I didn't download the relevant logs, which is not conducive to our analysis. We may need to get the log the next time the same error occurs.

After reading the logic of this test case, I am a little confused. Theoretically, the deployment status in the member cluster should be automatically updated. We do not need to actively update the status. This operation may be redundant.

yike21 · 2023-06-07T09:29:57Z

Good analysis 👍 I do suspect that the number of deployment replicas on the member cluster is not all started. We can check the logs of the member cluster for information. It's a shame that I didn't download the relevant logs, which is not conducive to our analysis. We may need to get the log the next time the same error occurs.

After reading the logic of this test case, I am a little confused. Theoretically, the deployment status in the member cluster should be automatically updated. We do not need to actively update the status. This operation may be redundant.

Thanks!
I agree with you, maybe we can change the test case from update the .status.readyReplicas to 3 operation to wait the .status.readyReplicas==3 operation, which is more reasonable.

XiShanYongYe-Chang · 2023-06-07T09:34:43Z

I agree with you, maybe we can change the test case from update the .status.readyReplicas to 3 operation to wait the .status.readyReplicas==3 operation, which is more reasonable.

Yes, we can do a cleanup. The problem described in this issue can be traced continue.

yike21 · 2023-06-07T09:46:59Z

I agree with you, maybe we can change the test case from update the .status.readyReplicas to 3 operation to wait the .status.readyReplicas==3 operation, which is more reasonable.

Yes, we can do a cleanup. The problem described in this issue can be traced continue.

It sounds good! I'm willing to do it. 🚀

XiShanYongYe-Chang · 2024-03-04T09:19:53Z

This issue hasn't come up in a while, so let's close it first.
/close
Thanks for your contribution! @yike21

karmada-bot · 2024-03-04T09:19:57Z

@XiShanYongYe-Chang: Closing this issue.

In response to this:

This issue hasn't come up in a while, so let's close it first.
/close
Thanks for your contribution! @yike21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

XiShanYongYe-Chang mentioned this issue Jun 6, 2023

Add ut for pkg/controllers/applicationfailover and pkg/controllers/status #3621

Merged

karmada-bot assigned yike21 Jun 7, 2023

yike21 removed their assignment Jun 7, 2023

yike21 mentioned this issue Jun 10, 2023

cleanup the ci about interpreter-interpret-health test #3660

Merged

karmada-bot closed this as completed Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flake] InterpreterOperation InterpretHealth testing #3640

[flake] InterpreterOperation InterpretHealth testing #3640

XiShanYongYe-Chang commented Jun 6, 2023

yike21 commented Jun 7, 2023

XiShanYongYe-Chang commented Jun 7, 2023

yike21 commented Jun 7, 2023

XiShanYongYe-Chang commented Jun 7, 2023

yike21 commented Jun 7, 2023

XiShanYongYe-Chang commented Jun 7, 2023

yike21 commented Jun 7, 2023

XiShanYongYe-Chang commented Mar 4, 2024

karmada-bot commented Mar 4, 2024

[flake] InterpreterOperation InterpretHealth testing #3640

[flake] InterpreterOperation InterpretHealth testing #3640

Comments

XiShanYongYe-Chang commented Jun 6, 2023

Which jobs are flaking:

Which test(s) are flaking:

Reason for failure:

Anything else we need to know:

yike21 commented Jun 7, 2023

XiShanYongYe-Chang commented Jun 7, 2023

yike21 commented Jun 7, 2023

XiShanYongYe-Chang commented Jun 7, 2023

yike21 commented Jun 7, 2023

XiShanYongYe-Chang commented Jun 7, 2023

yike21 commented Jun 7, 2023

XiShanYongYe-Chang commented Mar 4, 2024

karmada-bot commented Mar 4, 2024