Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Flink API to retrieve savepoint name #22

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Use Flink API to retrieve savepoint name #22

wants to merge 2 commits into from

Conversation

Nick-Triller
Copy link
Contributor

@Nick-Triller Nick-Triller commented Nov 9, 2018

Thank you for this Project.

The PR changes the code in such a way that the Flink API endpoint /jobs/:jobid/savepoints/:triggerid is used to retrieve the name of the latest savepoint when updating a job.

How I imagine the deployer API should be used:

Intent Deployer command(s) Deployer behaviour
Update running job with new version (using an up-to-date savepoint) update with job-name-base Deployer cancels job and creates a savepoint in one operation. Deployer starts new job with savepoint.
Restart a job from an existing savepoint cancel (not implemented yet) + deploy with savepoint-path Deployer cancels the running job. Deployer starts a new job with the savepoint defined by a CLI argument.
Start a job without using a savepoint deploy Deployer starts new job without savepoint.

I removed the logic for retrieval of the most current savepoint through the file system completely. I think using the Flink API is a more flexible approach because you don't need to mount a volume with the snapshots, therefore allowing the usage of different storage solutions for savepoints / checkpoints such as blob stores like AWS S3 or Google Cloud Storage.

Also, the job name base mechanism didn't seem functional yet. It should work as advertised now.

Docker image: https://hub.docker.com/r/nicktriller/flink-deployer/

@codecov-io
Copy link

Codecov Report

Merging #22 into master will decrease coverage by 2.44%.
The diff coverage is 95.23%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #22      +/-   ##
=========================================
- Coverage   55.85%   53.4%   -2.45%     
=========================================
  Files          12      11       -1     
  Lines         478     440      -38     
=========================================
- Hits          267     235      -32     
+ Misses        172     168       -4     
+ Partials       39      37       -2
Impacted Files Coverage Δ
cmd/cli/main.go 27.18% <0%> (-0.78%) ⬇️
cmd/cli/operations/update_job.go 88.73% <100%> (-0.61%) ⬇️
cmd/cli/operations/deploy.go 75% <100%> (-1.32%) ⬇️
cmd/cli/flink/savepoint.go 77.77% <100%> (+0.63%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56a4d94...7b7556a. Read the comment docs.

@mrooding
Copy link

Hi Nick,

Thanks for the contribution. Both me and Niels have had quite a few debates about how to properly implement the savepoint retrieval. Referring to your table, option 2 would work if cancel would return the full savepoint-path. If not, restarting from an existing savepoint would require the caller of the deploy action to know the full path to the savepoint. That means that an external system would need to store the savepoint path and pass it to the deployer. This is something we didn't want to impose on our users immediately.

This problem also applies to supporting a deploy with a savepoint path without cancelling first. There's no way of knowing the full path to the savepoint.

@joshuavijay
Copy link
Contributor

@mrooding I think Nicks intentions were correct. We should be retrieving savepoint path from Flink API. There are some edge cases where two consecutive savepoints are triggered, or savepoints are triggered for different jobs (parallel invocation of flink-deployer, or one from flink-deployer and one from flinkUI), where simply retrieving the latest savepoint will not give the correct savepoint required to restart the job. Also, Nick's change support all savepoint schemes that flink supports and will support in the future (s3, hdfs) without having to implement specifically for each scheme like I saw in the other PR created by kerinin. Also currently flink-deployer only supports local file system, where as must flink deployments these days run in k8s where there is no local storage. What we need is part of Nick's changes merged into the latest code base:

  1. Leave the deploy action alone
  2. Update action will not call terminate, but trigger a savepoint and cancel the job as part of that trigger (Nick's change)
  3. Monitor savepoint will return in the response, the location of savepoint (because we monitor by triggerid); there is no chance of retrieving the wrong savepoint.
  4. Update need not call retrieveLatestSavepoint

What do you think? If Nick is busy I can do it.

@Nick-Triller
Copy link
Contributor Author

Nick-Triller commented Sep 17, 2019

Hi @joshuavijay!
Thanks for your input. I haven't been working much with Flink in 2019, therefore the details of the relevant mechanisms aren't on the top of my mind. Feel free to reuse any parts that are useful to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants