Skip to content
This repository has been archived by the owner on Mar 31, 2022. It is now read-only.

Create an optional mechanism to avoid duplicate jobs #80

Open
mucahitkantepe opened this issue Aug 3, 2018 · 4 comments
Open

Create an optional mechanism to avoid duplicate jobs #80

mucahitkantepe opened this issue Aug 3, 2018 · 4 comments

Comments

@mucahitkantepe
Copy link

mucahitkantepe commented Aug 3, 2018

We create Kubernetes pods to run Spydra and it submits a job to Dataproc. Sometimes our pods are removed and we automatically recreate the pod(Spydra) again, and it submits that job again. In the end, there are some duplicate jobs are running in Dataproc. Those jobs may take hours which costs a lot.

I think we can create an optional mechanism to avoid this situation by labeling jobs, and when we create a job, we can check whether there is any job with that label whose status is DONE, if so we should not submit that job and can throw an exception.

@karth295
Copy link
Contributor

karth295 commented Aug 6, 2018

Dataproc already has a mechanism for this -- the job id. You cannot have Dataproc jobs with duplicate ids. As long as you don't delete jobs after they finish, this can be used to avoid submitting the same job multiple times.

$ gcloud beta dataproc jobs submit spark-sql --id=bar -e "select 1" --cluster my-cluster
<lots-of-output>

$ gcloud beta dataproc jobs submit spark-sql --id=bar -e "select 1" --cluster my-cluster
ERROR: (gcloud.beta.dataproc.jobs.submit.spark-sql) ALREADY_EXISTS: Already exists: Failed to submit job: Job projects/myproject/regions/us-central1/jobs/bar

@mucahitkantepe
Copy link
Author

@karth295 Thanks for your answer but there is another case, what if the job had been submitted before and failed for some reason? We want to submit it again to try again. In this case, it will not be able to resubmit.

My scenario is:

- Submit the job for the first time, Job(id=1, labels= [ " task_id" -> "task-x" ]  )
     - Check if there is a job with that label if it is
         - running, throw an exception
         - done, throw an exception
         - otherwise, submit  
     - Result: It will be submitted.  

- After some time, the job is failed

- The job is resubmitted as Job(id=2, labels= [ " task_id" -> "task-x" ]  )
     - Check if there is a job with that label if it is
         - running, throw an exception
         - done, throw an exception
         - otherwise, submit  
     - Result: It will be submitted because even it had already submitted before, it had failed and we want to rerun it.  

Other scenario:

- Submit the job for the first time, Job(id=1, labels= [ " task_id" -> "task-x" ]  )
     - Check result: submit

- After some time, the job is done

- The job is resubmitted as Job(id=2, labels= [ " task_id" -> "task-x" ]  ) 
    - Check result: It will not be resubmitted because there is already a job with that label which is already done

@karth295
Copy link
Contributor

karth295 commented Aug 7, 2018

Ah, fair enough.

Another solution to consider is using restartable jobs and letting Dataproc re-run jobs on failure. You can specify a request_id (docs) so that when your pod is recreated, it adopts the same job.

That may or may not work for you, depending on what else your pod needs to do when it's recreated.

@mucahitkantepe
Copy link
Author

@karth295 Thanks, using both restartable jobs and request_id is a good approach but it does not fit in my case. Our jobs are not graceful enough.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants