Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent "Job X has multiple successful attempts" error #4378

Closed
jrhizor opened this issue Jun 28, 2021 · 3 comments
Closed

prevent "Job X has multiple successful attempts" error #4378

jrhizor opened this issue Jun 28, 2021 · 3 comments
Assignees
Labels
priority/high High priority type/bug Something isn't working

Comments

@jrhizor
Copy link
Contributor

jrhizor commented Jun 28, 2021

There have now been several occurrences of this bug:

While there is a workaround for manually deleting the problematic record, this issue blocks all future syncs until it is manually resolved.

We need to figure out the root cause here.

@cgardens
Copy link
Contributor

cgardens commented Jul 2, 2021

Summary

@jrhizor and I dug into this today we think the root cause is a race condition between the JobSubmitter and when a submitted job actually begins working. The key clue here is that the log paths in these cases are the same across multiple attempts.

We believe both the "Job X has multiple successful attempts" and the "Job X is in status pending but has an attempt in status running" are both caused by the same root cause.

The Bug

Here's the mechanics behind the problem:

  • We create a LifecycledCallable in the JobSubmitter.
  • The onStart of the LifecycledCallable is in charge of changing the status of the Job from pending to running. Note: it hasn't been executed yet.
  • It is then submitted to a separate thread pool (the worker thread pool).
  • When the thread pool starts that LifecycledCallable onStart is called.

The problem is that if that worker thread pool is highly subscribed. It means that the submitted LifecycledCallable may not begin executing for little while. If it has to wait too long, the next time the JobSubmitter runs, it will pick up the same job because it has no way of knowing that that job is already queued to run. Boom.

Possible Solutions

  • Use a concurrent hash set to keep track of what jobs are being processed in the worker thread pool. If a job is in that thread pool (even if it hasn't started), don't submit it again. Downsides here are that is a bit of a hack. It also present a front of line waiting problem, because the JobSubmitter will keep trying to submit the same job until it starts runng and we will just keep throwing it out until it runs, which means all other work will get blocked until that job starts running. This may be tolerable, because if the job is queued but not running, it means there's no capacity to run new jobs anyway, so submitting more jobs doesn't actually move things forward.
  • Make temporal do it. This is probably too big a project to tackle given the speed with which we want to resolve this issue.
  • Add onEnqueue method or equivalent to LifeCycledCallable so that the code that creates the attempt happens in the JobSubmitter and not in the worker pool. Haven't fully thought through the safety of doing this yet.

Future Work

  • The whole point of Temporal is that we do not want to spend time reinventing / debugging this stuff ourselves. Another point for completing the migration to Temporal.

@davinchia fyi

@davinchia
Copy link
Contributor

We are seeing this manifest here too.

@davinchia
Copy link
Contributor

davinchia commented Jul 12, 2021

I think this can also be solved by increasing the number of MAX_WORKERS. Today this is 4, and lines up with the behaviour we are seeing with logs, where repeated submission happens on the 5th job.

Extract from the above linked logs:

2021-07-11 08:39:15 INFO  enqueuing pending job for scope: 3f7f34d8-cb7c-4e65-9ef8-17afba7927de
2021-07-11 08:39:15 INFO  enqueuing pending job for scope: d4d39f9c-95be-4774-bf94-0f2f091e1473
2021-07-11 08:39:15 INFO  enqueuing pending job for scope: d2758017-248c-49f4-8174-5ed068fe65e5
2021-07-11 08:39:15 INFO  enqueuing pending job for scope: c9f3d7bb-824e-4136-8068-fa9dc87d3729
2021-07-11 08:39:15 INFO  enqueuing pending job for scope: c1b81c58-ce9b-465b-a57e-2393cbce7790
2021-07-11 08:39:15 INFO  enqueuing pending job for scope: f2cfe9ea-a11e-4585-9d81-91f36153f0ed
2021-07-11 08:39:15 INFO  enqueuing pending job for scope: 17be9cc9-26e8-4150-931d-abf34b4750fc
2021-07-11 08:39:15 INFO  enqueuing pending job for scope: aba7f2c4-5cf1-4eb0-8f33-e3277f6cbbb0
2021-07-11 08:39:15 INFO  enqueuing pending job for scope: 9fe76f7d-7ef8-409b-81f5-aebca66142e4
2021-07-11 08:39:15 INFO  enqueuing pending job for scope: b8a26fc1-2253-4659-ae39-2a0a141fa53c
2021-07-11 08:39:15 INFO  enqueuing pending job for scope: 4f7f8bb8-ddda-4556-a771-7a1aea114728
# 1st
2021-07-11 08:39:15 INFO  Job-Submitter Summary. Submitted job with scope 3f7f34d8-cb7c-4e65-9ef8-17afba7927de
# 2nd
2021-07-11 08:39:21 INFO  Job-Submitter Summary. Submitted job with scope d4d39f9c-95be-4774-bf94-0f2f091e1473
# 3rd
2021-07-11 08:39:26 INFO  Job-Submitter Summary. Submitted job with scope d2758017-248c-49f4-8174-5ed068fe65e5
# 4th
2021-07-11 08:39:31 INFO  Job-Submitter Summary. Submitted job with scope c9f3d7bb-824e-4136-8068-fa9dc87d3729
2021-07-11 08:39:37 INFO  Job-Submitter Summary. Submitted job with scope c9f3d7bb-824e-4136-8068-fa9dc87d3729
2021-07-11 08:39:42 INFO  Job-Submitter Summary. Submitted job with scope c9f3d7bb-824e-4136-8068-fa9dc87d3729
2021-07-11 08:39:47 INFO  Job-Submitter Summary. Submitted job with scope c9f3d7bb-824e-4136-8068-fa9dc87d3729
2021-07-11 08:39:52 INFO  Job-Submitter Summary. Submitted job with scope c9f3d7bb-824e-4136-8068-fa9dc87d3729
2021-07-11 08:39:57 INFO  Job-Submitter Summary. Submitted job with scope c9f3d7bb-824e-4136-8068-fa9dc87d3729
2021-07-11 08:40:03 INFO  Job-Submitter Summary. Submitted job with scope c9f3d7bb-824e-4136-8068-fa9dc87d3729
2021-07-11 08:40:08 INFO  Job-Submitter Summary. Submitted job with scope c1b81c58-ce9b-465b-a57e-2393cbce7790
2021-07-11 08:40:13 INFO  Job-Submitter Summary. Submitted job with scope c1b81c58-ce9b-465b-a57e-2393cbce7790

Here it's the 4th, but I think it's because there is a running job taking up the slot. I think we should also make max-workers an injectable env var.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/high High priority type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants