Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2294] Flaky E2E Test: "Verify_Hard_GS_Failed_State" polling short-lived "Failing" application status #759

Closed
wants to merge 4 commits into from

Conversation

chenyulin0719
Copy link
Contributor

@chenyulin0719 chenyulin0719 commented Jan 5, 2024

What is this PR for?

  • Change the App state polling frequency from 300ms to 100 ms

2024/01/06 Update

  • Change to check 'Failed' appInfo instead of 'Failing' appInfo.

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

NA

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2294

How should this be tested?

All the existing test should pass

Screenshots (if appropriate)

NA

Questions:

NA

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will very likely do nothing to fix this issue. Testing for transient states is not reliable, full stop. Instead, it would be better to use the REST API to grab the state transition log and look for the failing state. This is persistent for as long as the application object is visible.

Even better would be to simply ignore the failing status and check for externally visible behavior instead of internal implementation. If the application fails, the pods will be killed off (at least if I remember correctly). Set up an infinite timed pause pod with immediate termination timeout, and wait for the pod to terminate (for say 30s). If the pod terminates, the test passes. If a timeout occurs, the test fails.

Alternatively, if pods aren't automatically killed, we can wait for the application to disappear from the REST API (as that would have indicated failure as well). In either case, checking explicitly for FAILING state is unnecessary.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, waiting for a state which only exists for a very short period of time is just asking for trouble in an e2e test. This cannot be tested reliably on such a high level. Why can't we just wait for the terminal "Failed" state?

Copy link

codecov bot commented Jan 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6689749) 69.52% compared to head (5c165af) 71.30%.
Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #759      +/-   ##
==========================================
+ Coverage   69.52%   71.30%   +1.77%     
==========================================
  Files          50       43       -7     
  Lines        7990     7600     -390     
==========================================
- Hits         5555     5419     -136     
+ Misses       2247     1979     -268     
- Partials      188      202      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chenyulin0719
Copy link
Contributor Author

chenyulin0719 commented Jan 6, 2024

Hi @craigcondit, @pbacsko ,
I just realized that after YUNIKORN-2235, we can fetch 'Failed' appInfo through the API:

  • /ws/v1/partition/default/applications/completed

However, the result is a list and the appID is not unique, so I have to iterate it.
New commits are availiable now.

@@ -21,7 +21,7 @@ module github.com/apache/yunikorn-k8shim
go 1.20

require (
github.com/apache/yunikorn-core v0.0.0-20240103094035-ba62c5db9f61
github.com/apache/yunikorn-core v0.0.0-20240105094327-77e19f6aca27
Copy link
Contributor Author

@chenyulin0719 chenyulin0719 Jan 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to update core version to use the latest api:

  • /ws/v1/partition/default/applications/completed

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the remaining issues.

test/e2e/framework/helpers/yunikorn/rest_api_utils.go Outdated Show resolved Hide resolved
@chenyulin0719 chenyulin0719 requested a review from pbacsko January 9, 2024 02:42
@pbacsko
Copy link
Contributor

pbacsko commented Jan 9, 2024

Current changes look good, I'll merge when the e2e tests finish.

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@pbacsko pbacsko closed this in dbfae6e Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants