Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve only failed jobs #136

Closed
morgoth opened this issue Sep 14, 2020 · 4 comments · Fixed by #145
Closed

Preserve only failed jobs #136

morgoth opened this issue Sep 14, 2020 · 4 comments · Fixed by #145
Labels
enhancement New feature or request

Comments

@morgoth
Copy link
Collaborator

morgoth commented Sep 14, 2020

Looks like currently there is no configuration option to preserve only failed jobs and delete successfully finished ones, as preserve_job_records will preserve all.

I think it would be very helpful to have such functionality out of the box (or by config) to be possible to easily investigate failed job errors.

Is this easy to do on my own, with currently available GoodJob API?

@bensheldon bensheldon added the enhancement New feature or request label Sep 15, 2020
@bensheldon
Copy link
Owner

@morgoth I think this is a great feature request and could see it implemented in GoodJob.

I looked at how easy it would be to implement outside of GoodJob; GoodJob isn't currently instrumented to make it easy :-( So there are two options:

  1. Monkeypatch GoodJob::Job#perform by wrapping the original method and then deleting the job if the job is finished and there isn't an error.
  2. Write your own rake task that deletes finished/unerrored jobs and delete them regularly. GoodJob::Job.where(errored_at: nil).where.not(finished_at: nil).delete_all

@morgoth
Copy link
Collaborator Author

morgoth commented Sep 16, 2020

@bensheldon How would you see it implemented?

I was thinking about changing config values for GoodJob.preserve_job_records to be something like :never, :always, :on_error
and then changing the perform method signature to something like def perform(preserve_record: GoodJob.preserve_job_records, reperform_on_standard_error: GoodJob.reperform_jobs_on_standard_error)

Then I guess we would also need to set finished_at on such errored record so it is not retried again.

Do you have some different idea on your mind?

@6temes
Copy link

6temes commented Sep 16, 2020

Sidekiq has an interesting way to manage the failed jobs:

image

"Failed jobs" means the number of times a job has failed, even if, maybe, it eventually ran successfully after several retries.

"Dead jobs" is a queue where the jobs that have been retried failed more than N number of times go.

Usually, when you have a problem with a job, you catch it in the retries queue because they usually retry during a couple of times, but when Sidekiq gives up, you can find them in "Dead jobs."

I think that this is a really useful pattern.

So, my point is that, keeping jobs that have ran successfully can be optional, but dead jobs should never be removed.

@bensheldon
Copy link
Owner

bensheldon commented Sep 16, 2020

@morgoth Yes! That's very close to my initial thoughts. Some suggested adjustments:

  • The Boolean signature for GoodJob.preserve_job_records= needs to be preserved for compatibility. I think the smallest change would be to allow nil, False, True, and the new one :on_error.

  • For GoodJob::Job#perform, I've been thinking that we can remove the arguemtns entirely and use GoodJob globals within the #perform itself. I parameterized #perform originally because I wasn't sure how the globals would evolve, but I think it's cleaner to remove them altogether. The method definition would become simply def perform.

  • I think we could deprecate the predicate GoodJob.preserve_job_records? and instead ask that the value of the accessor GoodJob.preserve_job_records be used directly.

  • Then I guess we would also need to set finished_at on such errored record so it is not retried again.

    Exactly! I spent an uncomfortable amount of time staring at this part of the code recently, but it's where I think the changes would take place

    if rescued_error && reperform_on_standard_error
    save!
    else
    self.finished_at = Time.current
    if destroy_after
    destroy!
    else
    save!
    end
    end

@6temes Thanks for sharing how Sidekiq organizes them. That's really helpful. I like the idea of Failed vs Dead.

I'm trying to think about the analogous data with GoodJob and it's complicated. A wrinkle of GoodJob is that an ActiveJob job, when retried, will generate a new GoodJob Job in the database. In other words, there is not a 1-to-1 correspondence between an ActiveJob Job and a GoodJob Job.

To determine a "Dead" ActiveJob Job would require identifying Errored GoodJob Jobs that don't have a newer matching enqueued job. That's doable, but it might be unsatisfyingly messy. I'm imagining, for example, that a GoodJob::Job, if it is a retried job, could store a reference to the previously errored job that generated it. To give an example of the messiness, currently tracking the error state on retries/discards requires passing global state around:

initializer "good_job.active_job_notifications" do
ActiveSupport::Notifications.subscribe "enqueue_retry.active_job" do |event|
GoodJob::CurrentExecution.error_on_retry = event.payload[:error]
end
ActiveSupport::Notifications.subscribe "discard.active_job" do |event|
GoodJob::CurrentExecution.error_on_discard = event.payload[:error]
end
end

I have appetite for using global state to enable these features, but, well, it's complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants