-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-2294] Flaky E2E Test: "Verify_Hard_GS_Failed_State" polling short-lived "Failing" application status #759
Conversation
… short-lived "Failing" application status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will very likely do nothing to fix this issue. Testing for transient states is not reliable, full stop. Instead, it would be better to use the REST API to grab the state transition log and look for the failing state. This is persistent for as long as the application object is visible.
Even better would be to simply ignore the failing status and check for externally visible behavior instead of internal implementation. If the application fails, the pods will be killed off (at least if I remember correctly). Set up an infinite timed pause pod with immediate termination timeout, and wait for the pod to terminate (for say 30s). If the pod terminates, the test passes. If a timeout occurs, the test fails.
Alternatively, if pods aren't automatically killed, we can wait for the application to disappear from the REST API (as that would have indicated failure as well). In either case, checking explicitly for FAILING state is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, waiting for a state which only exists for a very short period of time is just asking for trouble in an e2e test. This cannot be tested reliably on such a high level. Why can't we just wait for the terminal "Failed" state?
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #759 +/- ##
==========================================
+ Coverage 69.52% 71.30% +1.77%
==========================================
Files 50 43 -7
Lines 7990 7600 -390
==========================================
- Hits 5555 5419 -136
+ Misses 2247 1979 -268
- Partials 188 202 +14 ☔ View full report in Codecov by Sentry. |
Hi @craigcondit, @pbacsko ,
However, the result is a list and the appID is not unique, so I have to iterate it. |
@@ -21,7 +21,7 @@ module github.com/apache/yunikorn-k8shim | |||
go 1.20 | |||
|
|||
require ( | |||
github.com/apache/yunikorn-core v0.0.0-20240103094035-ba62c5db9f61 | |||
github.com/apache/yunikorn-core v0.0.0-20240105094327-77e19f6aca27 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to update core version to use the latest api:
- /ws/v1/partition/default/applications/completed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address the remaining issues.
Current changes look good, I'll merge when the e2e tests finish. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
What is this PR for?
2024/01/06 Update
What type of PR is it?
Todos
NA
What is the Jira issue?
https://issues.apache.org/jira/browse/YUNIKORN-2294
How should this be tested?
All the existing test should pass
Screenshots (if appropriate)
NA
Questions:
NA