Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow disabling Savepoint during updates #184

Merged
merged 9 commits into from
Mar 17, 2020

Conversation

maghamravi
Copy link
Contributor

The PR

  1. Adds support for a new field SavepointDisabled to the spec to govern if Savepoints are needed during an update.
  2. Renames the state / phase FlinkApplicationSavepointing to FlinkApplicationCancelling.
  3. Recovery workflow (if and when Cancel fails) remain the same.
  4. No changes to the Delete workflow. Users will have to set deleteMode: ForceCancel if they prefer not having Savepoints taken during Delete/cancel.

@maghamravi maghamravi reopened this Mar 6, 2020
@maghamravi
Copy link
Contributor Author

/ptal (👀) @mwylde @glaksh100 @anandswaminathan @kumare3 @tweise

ClusterStarting -- Create fails --> DeployFailed

Savepointing --> SubmittingJob
Savepointing -- Savepoint fails --> Recovering
Cancelling --> SubmittingJob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maghamravi Not a fan of this approach.

Ideally I would not change the existing statemachine as much as possible. If just cancel is needed without savepointing, my recommendation would be to introduce a new state, say "Cancelling" that is reached from Savepointing if savepoint is disabled in spec. I believe "Recovering" is also not associated with Cancelling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I think we clubbed "Cancel" and "Savepoint" (two independent actions) into one called "Savepointing" which IMO should be "Cancelling".

Dunno if it's just me(I think?), but when I see the state "Savepointing', I am under the impression that the operator is triggering a "Savepoint" on the job (https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs-jobid-savepoints ) and not Cancelling it (with the cancels option)

If the consensus is to have "Cancel" be called from Savepointing in cases when "savepoint" is disabled, though I disagree, I am happy to commit!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. There are two things here:

  1. Better name for the Savepointing state.
  2. Separating the intention of this PR to a separate phase/state.

I merely am recommending (2) now. We can always revisit (1) and update the state names later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anandswaminathan Made the necessary changes!

@maghamravi
Copy link
Contributor Author

Updated the PR with the following

  1. Created a new state "Cancelling".
  2. From "ClusterStarting" state, based on "SavepointDisabled" field in the spec, the state machine transitions to "Cancelling" or "Savepointing" state.
  3. Though cancelling a job (without save point) rarely fails, I have it go through the max retries. If we are still unlucky, we move to "Submitting" phase as by then current running job(if existing) is already dysfunctional.

@anandswaminathan
Copy link
Contributor

anandswaminathan commented Mar 10, 2020

cc @tweise

@glaksh100 This change is small but you might need to add a check for your Green-Blue update.

@maghamravi This looks good.

You explanation for (3) is not always true. The ForceCancel call can fail if Jobmanager is restarting or for several other reasons. We should never assume and make progression during failure. Always rollback, and fail the deployment on failure.

Add to

  1. https://github.com/lyft/flinkk8soperator/blob/master/docs/crd.md
  2. Validation here: https://github.com/lyft/flinkk8soperator/blob/master/deploy/crd.yaml#L23
  3. Integration test for this change - as you will be the first using this.

fmt.Sprintf("Could not cancel existing job: %s", reason))
application.Status.RetryCount = 0
application.Status.JobStatus.JobID = ""
s.updateApplicationPhase(application, v1beta1.FlinkApplicationSubmittingJob)
Copy link
Contributor

@anandswaminathan anandswaminathan Mar 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be FlinkApplicationRollingBackJob

We can't move ahead and submit job in the second cluster, if the first cluster is still running the job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Definitely better to err on the side of caution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@@ -81,6 +81,132 @@ func TestHandleStartingClusterStarting(t *testing.T) {
assert.Nil(t, err)
}

func TestHandleNewOrCreateWithSavepointDisabled(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, but can you also add an integration test? Given this is a completely different flow, would be good to fully test out the successful and failure paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

return statusChanged, nil
}

err := s.flinkController.ForceCancel(ctx, application, application.Status.DeployHash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a potential case here where we get an error back from this call (e.g., a timeout), but the job does end up cancelled. We can handle that by first checking whether the job is running, and if not, moving on to SubmittingJob.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing, I saw two kinds of errors

  1. status code 404 when the job is already in Cancelled status
  2. status code 400 for a bad request like the job with Id doesn't exist.

It definitely doesn't hurt to make that extra call to check if the job is running or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in general for operators the pattern is (1) query the current state of the world; (2) make calls to update the world to the desired state. On each iteration of the reconciliation loop you don't really know what the state is until you query it (even the status might out of date or missing updates).

@@ -220,6 +221,7 @@ const (
FlinkApplicationSubmittingJob FlinkApplicationPhase = "SubmittingJob"
FlinkApplicationRunning FlinkApplicationPhase = "Running"
FlinkApplicationSavepointing FlinkApplicationPhase = "Savepointing"
FlinkApplicationCancelling FlinkApplicationPhase = "Cancelling"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update state_machine.md with the details of this new state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@maghamravi
Copy link
Contributor Author

@anandswaminathan @mwylde Addressed all review comments!

@anandswaminathan
Copy link
Contributor

+1 LGTM
cc @mwylde @glaksh100

mwylde
mwylde previously approved these changes Mar 16, 2020
docs/crd.md Outdated Show resolved Hide resolved
@maghamravi maghamravi requested a review from mwylde March 17, 2020 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants