[YUNIKORN-2294] Flaky E2E Test: "Verify_Hard_GS_Failed_State" polling short-lived "Failing" application status #759

chenyulin0719 · 2024-01-05T10:18:27Z

What is this PR for?

Change the App state polling frequency from 300ms to 100 ms

2024/01/06 Update

Change to check 'Failed' appInfo instead of 'Failing' appInfo.

What type of PR is it?

Todos

NA

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2294

How should this be tested?

All the existing test should pass

Screenshots (if appropriate)

NA

Questions:

NA

… short-lived "Failing" application status

craigcondit

This will very likely do nothing to fix this issue. Testing for transient states is not reliable, full stop. Instead, it would be better to use the REST API to grab the state transition log and look for the failing state. This is persistent for as long as the application object is visible.

Even better would be to simply ignore the failing status and check for externally visible behavior instead of internal implementation. If the application fails, the pods will be killed off (at least if I remember correctly). Set up an infinite timed pause pod with immediate termination timeout, and wait for the pod to terminate (for say 30s). If the pod terminates, the test passes. If a timeout occurs, the test fails.

Alternatively, if pods aren't automatically killed, we can wait for the application to disappear from the REST API (as that would have indicated failure as well). In either case, checking explicitly for FAILING state is unnecessary.

pbacsko

I agree, waiting for a state which only exists for a very short period of time is just asking for trouble in an e2e test. This cannot be tested reliably on such a high level. Why can't we just wait for the terminal "Failed" state?

codecov · 2024-01-05T22:25:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6689749) 69.52% compared to head (5c165af) 71.30%.
Report is 5 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #759      +/-   ##
==========================================
+ Coverage   69.52%   71.30%   +1.77%     
==========================================
  Files          50       43       -7     
  Lines        7990     7600     -390     
==========================================
- Hits         5555     5419     -136     
+ Misses       2247     1979     -268     
- Partials      188      202      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

chenyulin0719 · 2024-01-06T01:58:52Z

Hi @craigcondit, @pbacsko ,
I just realized that after YUNIKORN-2235, we can fetch 'Failed' appInfo through the API:

/ws/v1/partition/default/applications/completed

However, the result is a list and the appID is not unique, so I have to iterate it.
New commits are availiable now.

chenyulin0719 · 2024-01-06T02:05:52Z

go.mod

@@ -21,7 +21,7 @@ module github.com/apache/yunikorn-k8shim
 go 1.20

 require (
-	github.com/apache/yunikorn-core v0.0.0-20240103094035-ba62c5db9f61
+	github.com/apache/yunikorn-core v0.0.0-20240105094327-77e19f6aca27


Have to update core version to use the latest api:

/ws/v1/partition/default/applications/completed

pbacsko

Please address the remaining issues.

test/e2e/framework/helpers/yunikorn/rest_api_utils.go

pbacsko · 2024-01-09T15:27:44Z

Current changes look good, I'll merge when the e2e tests finish.

pbacsko

+1

[YUNIKORN-2294] Flaky E2E Test: "Verify_Hard_GS_Failed_State" polling…

dfaf8aa

… short-lived "Failing" application status

craigcondit requested changes Jan 5, 2024

View reviewed changes

craigcondit assigned chenyulin0719 Jan 5, 2024

pbacsko requested changes Jan 5, 2024

View reviewed changes

chenyulin0719 added 2 commits January 6, 2024 01:10

Check 'Failed' appInfo instead of 'Failing' appInfo

e91e2f2

Handle duplicated completed appID

e5090f7

chenyulin0719 commented Jan 6, 2024

View reviewed changes

chenyulin0719 requested review from craigcondit and pbacsko January 6, 2024 02:54

pbacsko requested changes Jan 8, 2024

View reviewed changes

test/e2e/framework/helpers/yunikorn/rest_api_utils.go Outdated Show resolved Hide resolved

set interval to 1 sec

5c165af

chenyulin0719 requested a review from pbacsko January 9, 2024 02:42

pbacsko approved these changes Jan 9, 2024

View reviewed changes

pbacsko closed this in dbfae6e Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-2294] Flaky E2E Test: "Verify_Hard_GS_Failed_State" polling short-lived "Failing" application status #759

[YUNIKORN-2294] Flaky E2E Test: "Verify_Hard_GS_Failed_State" polling short-lived "Failing" application status #759

chenyulin0719 commented Jan 5, 2024 •

edited

Loading

craigcondit left a comment •

edited

Loading

pbacsko left a comment

codecov bot commented Jan 5, 2024 •

edited

Loading

chenyulin0719 commented Jan 6, 2024 •

edited

Loading

chenyulin0719 Jan 6, 2024 •

edited

Loading

pbacsko left a comment

pbacsko commented Jan 9, 2024

pbacsko left a comment

[YUNIKORN-2294] Flaky E2E Test: "Verify_Hard_GS_Failed_State" polling short-lived "Failing" application status #759

[YUNIKORN-2294] Flaky E2E Test: "Verify_Hard_GS_Failed_State" polling short-lived "Failing" application status #759

Conversation

chenyulin0719 commented Jan 5, 2024 • edited Loading

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

craigcondit left a comment • edited Loading

Choose a reason for hiding this comment

pbacsko left a comment

Choose a reason for hiding this comment

codecov bot commented Jan 5, 2024 • edited Loading

Codecov Report

chenyulin0719 commented Jan 6, 2024 • edited Loading

chenyulin0719 Jan 6, 2024 • edited Loading

Choose a reason for hiding this comment

pbacsko left a comment

Choose a reason for hiding this comment

pbacsko commented Jan 9, 2024

pbacsko left a comment

Choose a reason for hiding this comment

chenyulin0719 commented Jan 5, 2024 •

edited

Loading

craigcondit left a comment •

edited

Loading

codecov bot commented Jan 5, 2024 •

edited

Loading

chenyulin0719 commented Jan 6, 2024 •

edited

Loading

chenyulin0719 Jan 6, 2024 •

edited

Loading