-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow disabling Savepoint during updates #184
Conversation
/ptal (👀) @mwylde @glaksh100 @anandswaminathan @kumare3 @tweise |
docs/state_machine.mmd
Outdated
ClusterStarting -- Create fails --> DeployFailed | ||
|
||
Savepointing --> SubmittingJob | ||
Savepointing -- Savepoint fails --> Recovering | ||
Cancelling --> SubmittingJob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maghamravi Not a fan of this approach.
Ideally I would not change the existing statemachine as much as possible. If just cancel is needed without savepointing, my recommendation would be to introduce a new state, say "Cancelling" that is reached from Savepointing if savepoint is disabled in spec. I believe "Recovering" is also not associated with Cancelling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I think we clubbed "Cancel" and "Savepoint" (two independent actions) into one called "Savepointing" which IMO should be "Cancelling".
Dunno if it's just me(I think?), but when I see the state "Savepointing', I am under the impression that the operator is triggering a "Savepoint" on the job (https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs-jobid-savepoints ) and not Cancelling it (with the cancels option)
If the consensus is to have "Cancel" be called from Savepointing in cases when "savepoint" is disabled, though I disagree, I am happy to commit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. There are two things here:
- Better name for the Savepointing state.
- Separating the intention of this PR to a separate phase/state.
I merely am recommending (2) now. We can always revisit (1) and update the state names later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anandswaminathan Made the necessary changes!
Updated the PR with the following
|
cc @tweise @glaksh100 This change is small but you might need to add a check for your Green-Blue update. @maghamravi This looks good. You explanation for (3) is not always true. The ForceCancel call can fail if Jobmanager is restarting or for several other reasons. We should never assume and make progression during failure. Always rollback, and fail the deployment on failure. Add to
|
fmt.Sprintf("Could not cancel existing job: %s", reason)) | ||
application.Status.RetryCount = 0 | ||
application.Status.JobStatus.JobID = "" | ||
s.updateApplicationPhase(application, v1beta1.FlinkApplicationSubmittingJob) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be FlinkApplicationRollingBackJob
We can't move ahead and submit job in the second cluster, if the first cluster is still running the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Definitely better to err on the side of caution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
@@ -81,6 +81,132 @@ func TestHandleStartingClusterStarting(t *testing.T) { | |||
assert.Nil(t, err) | |||
} | |||
|
|||
func TestHandleNewOrCreateWithSavepointDisabled(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, but can you also add an integration test? Given this is a completely different flow, would be good to fully test out the successful and failure paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
return statusChanged, nil | ||
} | ||
|
||
err := s.flinkController.ForceCancel(ctx, application, application.Status.DeployHash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a potential case here where we get an error back from this call (e.g., a timeout), but the job does end up cancelled. We can handle that by first checking whether the job is running, and if not, moving on to SubmittingJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my testing, I saw two kinds of errors
- status code 404 when the job is already in Cancelled status
- status code 400 for a bad request like the job with Id doesn't exist.
It definitely doesn't hurt to make that extra call to check if the job is running or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in general for operators the pattern is (1) query the current state of the world; (2) make calls to update the world to the desired state. On each iteration of the reconciliation loop you don't really know what the state is until you query it (even the status might out of date or missing updates).
@@ -220,6 +221,7 @@ const ( | |||
FlinkApplicationSubmittingJob FlinkApplicationPhase = "SubmittingJob" | |||
FlinkApplicationRunning FlinkApplicationPhase = "Running" | |||
FlinkApplicationSavepointing FlinkApplicationPhase = "Savepointing" | |||
FlinkApplicationCancelling FlinkApplicationPhase = "Cancelling" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also update state_machine.md with the details of this new state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
@anandswaminathan @mwylde Addressed all review comments! |
+1 LGTM |
addressed review comments!
The PR
SavepointDisabled
to the spec to govern if Savepoints are needed during an update.FlinkApplicationSavepointing
toFlinkApplicationCancelling
.deleteMode: ForceCancel
if they prefer not having Savepoints taken during Delete/cancel.