Support manual task retries? #117

wlandau · 2024-05-28T13:02:32Z

Previously, the dispatcher automatically retried tasks which did not get a chance to finish. 6a72261 removes this feature and causes these tasks to return with an "error 19" code.

Without automatic retries, the dispatcher does not have to hold on to data from unresolved tasks, so I can see how this would increase memory efficiency. However, it is a setback for {crew} because there is no longer a way to recover from crashed workers. Workers can crash for perfectly normal and expected reasons: for example, if a spike in spot instance pricing causes an AWS Batch worker to exit. We want to retry the task in those cases.

Would it be possible to manually retry a mirai object? If so, crew would be able to easily resubmit the task when it tries to collect the result and instead gets the "error 19" code. It might also be good to take this opportunity to limit the number of retries for a task, so keeping track of the number of tries for each task would also be valuable.

The text was updated successfully, but these errors were encountered:

wlandau · 2024-05-28T13:20:34Z

Thinking about something like a retry_mirai() function (doesn’t have to be called that) which simply accepts a mirai object and resubmits it to its compute profile.

This could also complement the current stop_mirai(), and it would be convenient in a variety of situations where the user wants to retry on certain kinds of errors but does not want to have to re-enter the R command or data to call mirai() from scratch.

shikokuchuo · 2024-05-28T13:34:53Z

I see your point, and this is actually what the NNG-level retry system is good for. Unfortunately, if it returns an error then the original data is freed and can't be re-sent.

If I try and implement a manual system it would seem to require another copy of the data to hang on to, or if not then I'd have to make my own fork of NNG, which prevents usage of a system-installed library (and is not a route I want to go down unless I have to).

Perhaps I have to enable auto-retry as an option. I can default to 'on' for now, and then if crew can set this once the new mirai is out, I can then switch the default to 'off' in the next version.

wlandau · 2024-05-28T13:42:06Z

Perhaps I have to enable auto-retry as an option. I can default to 'on' for now, and then if crew can set this once the new mirai is out, I can then switch the default to 'off' in the next version.

That seems ideal, thanks!

If the user opts out of automatic retries, will this free up memory on the dispatcher? From crew's side, this would be one reason a user might want to opt out. There have been reports in the past of OOM errors on the dispatcher.

Were there any issues with bugs and automatic retries, or was 6a72261 motivated for a different reason?

shikokuchuo · 2024-05-28T13:46:55Z

If the user opts out of automatic retries, will this free up memory on the dispatcher? From crew's side, this would be one reason a user might want to opt out. There have been reports in the past of OOM errors on the dispatcher.

Doesn't help unfortunately. It should be possible to free the data early and I've requested this from GDA and he agrees, but he has many higher priority items on his plate...

Were there any issues with bugs and automatic retries, or was 6a72261 motivated for a different reason?

Purely for user ergonomics, as it won't be apparent that a task has crashed unless you know to check status(). No bugs and not an issue to re-enable.

shikokuchuo · 2024-05-28T14:26:28Z

There is now a 'retry' argument at dispatcher() implemented in bd7fd6c. Defaults to TRUE for now.

wlandau · 2024-05-29T14:22:16Z

Thanks @shikokuchuo, rerunning https://github.com/wlandau/crew/blob/main/tests/local/test-backlog_crash.R confirms that this works on my end. When the next release of mirai is out, I will expose the retry argument in crew to set a default of TRUE and allow debugging.

shikokuchuo closed this as completed May 28, 2024

wlandau mentioned this issue May 29, 2024

set retry = TRUE in mirai::dispatcher() wlandau/crew#170

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support manual task retries? #117

Support manual task retries? #117

wlandau commented May 28, 2024

wlandau commented May 28, 2024

shikokuchuo commented May 28, 2024

wlandau commented May 28, 2024

shikokuchuo commented May 28, 2024

shikokuchuo commented May 28, 2024

wlandau commented May 29, 2024

Support manual task retries? #117

Support manual task retries? #117

Comments

wlandau commented May 28, 2024

wlandau commented May 28, 2024

shikokuchuo commented May 28, 2024

wlandau commented May 28, 2024

shikokuchuo commented May 28, 2024

shikokuchuo commented May 28, 2024

wlandau commented May 29, 2024