-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve staleness logic by improving error handling and retrying a subset of errors #32
Conversation
Many thanks to @anandswaminathan for helping debug and fix the pflags issue. For anyone who may run into this in their local dev:
|
a48ec6a
to
1fe6929
Compare
@mwylde @anandswaminathan After various fixes, I think this PR is now in a state to be reviewed. PTAL when you have a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! A few comments, but overall my biggest concern is that in the current design we have to be very careful to avoid applications getting stuck. For example, in some local testing of the PR I saw a case where killing the JM during savepointing caused the application to get stuck midway with
ts="2019-06-20T18:01:38-07:00" level=error msg="Check savepoint status failed with response {\"errors\":[\"Operation not found under key: org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@e0ff9448\"]}" app_name=operator-test-app-ha ns=default phase=Savepointing src="api.go:248"
This is especially dangerous because there's currently no way for a user to interrupt a deploy, so they will have to delete and recreate the application.
One thing I have noticed with both Operator SDK and Controller runtime is that on occasions - Callbacks can happen at a much faster loop, and not only at the What this can translate to - you might exhaust your maxRetries in very small interval. You will need to track time, and have logic like |
@mwylde Thanks for the initial and the second review! I'm going to respond to comments individually and resolve ones that are outdated. |
Things I'm planning to address in the next pass:
|
Changelog — it's perhaps easier to look at the entire diff than the ones for the commit.
|
As of yesterday, I've addressed all comments afaict. PTAL when you have a chance @mwylde @anandswaminathan |
…itjob, default time based application staleness, no error handling in running phase
PTAL @mwylde when you have a chance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 This is great!
Thanks for the thorough review @mwylde! |
This PR removes dependency on staleness duration. The application is now rolled back based on the error observed (and retried if applicable). It also makes error handling a bit more structured so that we can more easily specify retry-able errors.
Changelog
LastSeenError
and aRetryCount
to the FlinkApplicationStatusshouldRollback()
in flink_state_machine.go to check if the LastSeenError can be retried and is within the maxRetries limitNote: go generate on Config.go seems to be failing with: