Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flake] InterpreterOperation InterpretHealth testing #3640

Closed
XiShanYongYe-Chang opened this issue Jun 6, 2023 · 9 comments
Closed

[flake] InterpreterOperation InterpretHealth testing #3640

XiShanYongYe-Chang opened this issue Jun 6, 2023 · 9 comments

Comments

@XiShanYongYe-Chang
Copy link
Member

Which jobs are flaking:

Resource interpreter customization testing when Apply single ResourceInterpreterCustomization without DependencyInterpretation operation InterpreterOperation InterpretHealth testing [It] InterpretHealth testing

Which test(s) are flaking:

https://github.com/karmada-io/karmada/actions/runs/5177126570/jobs/9327065064?pr=3621

Reason for failure:

Analyzing...

Anything else we need to know:

Deployment.apps "deploy-7m46p" is invalid: status.readyReplicas: Invalid value: 3: cannot be greater than status.replicas

Full Stack Trace
    github.com/karmada-io/karmada/test/e2e/framework.UpdateDeploymentStatus.func1()
    	/home/runner/work/karmada/karmada/test/e2e/framework/deployment.go:57 +0x1aa
    github.com/karmada-io/karmada/test/e2e/framework.UpdateDeploymentStatus({0x45022d0, 0xc0006ceb60}, 0xc000b18a00)
    	/home/runner/work/karmada/karmada/test/e2e/framework/deployment.go:48 +0x14a
    github.com/karmada-io/karmada/test/e2e.glob..func33.1.8.2.1(0x3)
    	/home/runner/work/karmada/karmada/test/e2e/resourceinterpreter_test.go:615 +0x37f
    github.com/karmada-io/karmada/test/e2e.glob..func33.1.8.2.3()
    	/home/runner/work/karmada/karmada/test/e2e/resourceinterpreter_test.go:636 +0x96
    github.com/karmada-io/karmada/test/e2e.glob..func33.1.8.2()
    	/home/runner/work/karmada/karmada/test/e2e/resourceinterpreter_test.go:635 +0x19b
@yike21
Copy link
Member

yike21 commented Jun 7, 2023

/assign

@XiShanYongYe-Chang
Copy link
Member Author

Hi @yike21 Thanks a lot

@yike21
Copy link
Member

yike21 commented Jun 7, 2023

Hi! I am interested in this issue and hope to do some thing.

I find some similar tests: the failed one and succeed1, succeed2.
We can see that the time interval between STEP: deployment healthy and STEP: Removing Deployment:

In failed one, it is 7 min, which is from 13:18:35.68 to 13:25:40.737. Then it failed.

STEP: deployment healthy @ 06/05/23 13:18:35.68
STEP: Update Deployment(karmadatest-4hblr/deploy-7m46p) status @ 06/05/23 13:18:40.726
[FAILED] in [It] - /home/runner/work/karmada/karmada/test/e2e/framework/deployment.go:57 @ 06/05/23 13:25:40.737
STEP: Removing Deployment(karmadatest-4hblr/deploy-7m46p) @ 06/05/23 13:25:40.737

In [succeed1], it is 5 s, which is from 02:42:29.611 to 02:42:34.656. And it succeed.

STEP: deployment healthy @ 06/07/23 02:42:29.611
STEP: Update Deployment(karmadatest-2rrtd/deploy-9wq8d) status @ 06/07/23 02:42:34.656
STEP: Removing Deployment(karmadatest-2rrtd/deploy-9wq8d) @ 06/07/23 02:42:34.676

In succeed2, it is 5 s, which is from 03:08:14.001 to 03:08:19.063. And it succeed.

STEP: deployment healthy @ 06/07/23 03:08:14.001
STEP: Update Deployment(karmadatest-hpdrt/deploy-bf2xw) status @ 06/07/23 03:08:19.063
STEP: Removing Deployment(karmadatest-hpdrt/deploy-bf2xw) @ 06/07/23 03:08:19.105

The time interval corresponds to the codes:
code1 "deployment healthy", which is the beginning of STEP: deployment healthy.
code2 framework.UpdateDeploymentStatus, which tries to update the deployment in member cluster with memberDeployment.Status.ReadyReplicas = readyReplicas. Here the readyReplicas is 3.
code3 gomega.Eventually, which updates deployment's status in member cluster and makes the pollTimeout 7 min and the pollInterval 5 s.

Considering that the error message is Deployment.apps \"deploy-7m46p\" is invalid: status.readyReplicas: Invalid value: 3: cannot be greater than status.replicas which means the updateStatus operation is error, the possible reason I think is that the number of pods created in the member cluster is less than 3. If the code update .status.readyReplicas to 3 at this time, we will get the above error message.

And this lead to another question: why haven't the member clusters in the test environment created 3 pods successfully within 7 minutes? I'm not sure, maybe the insufficient resources prevent scheduling or there are other specific reasons.

Thanks!

@yike21 yike21 removed their assignment Jun 7, 2023
@XiShanYongYe-Chang
Copy link
Member Author

Good analysis 👍
I do suspect that the number of deployment replicas on the member cluster is not all started. We can check the logs of the member cluster for information.
It's a shame that I didn't download the relevant logs, which is not conducive to our analysis. We may need to get the log the next time the same error occurs.

After reading the logic of this test case, I am a little confused. Theoretically, the deployment status in the member cluster should be automatically updated. We do not need to actively update the status. This operation may be redundant.

@yike21
Copy link
Member

yike21 commented Jun 7, 2023

Good analysis 👍 I do suspect that the number of deployment replicas on the member cluster is not all started. We can check the logs of the member cluster for information. It's a shame that I didn't download the relevant logs, which is not conducive to our analysis. We may need to get the log the next time the same error occurs.

After reading the logic of this test case, I am a little confused. Theoretically, the deployment status in the member cluster should be automatically updated. We do not need to actively update the status. This operation may be redundant.

Thanks!
I agree with you, maybe we can change the test case from update the .status.readyReplicas to 3 operation to wait the .status.readyReplicas==3 operation, which is more reasonable.

@XiShanYongYe-Chang
Copy link
Member Author

I agree with you, maybe we can change the test case from update the .status.readyReplicas to 3 operation to wait the .status.readyReplicas==3 operation, which is more reasonable.

Yes, we can do a cleanup. The problem described in this issue can be traced continue.

@yike21
Copy link
Member

yike21 commented Jun 7, 2023

I agree with you, maybe we can change the test case from update the .status.readyReplicas to 3 operation to wait the .status.readyReplicas==3 operation, which is more reasonable.

Yes, we can do a cleanup. The problem described in this issue can be traced continue.

It sounds good! I'm willing to do it. 🚀

@XiShanYongYe-Chang
Copy link
Member Author

This issue hasn't come up in a while, so let's close it first.
/close
Thanks for your contribution! @yike21

@karmada-bot
Copy link
Collaborator

@XiShanYongYe-Chang: Closing this issue.

In response to this:

This issue hasn't come up in a while, so let's close it first.
/close
Thanks for your contribution! @yike21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants