Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support manual task retries? #117

Closed
wlandau opened this issue May 28, 2024 · 6 comments
Closed

Support manual task retries? #117

wlandau opened this issue May 28, 2024 · 6 comments

Comments

@wlandau
Copy link

wlandau commented May 28, 2024

Previously, the dispatcher automatically retried tasks which did not get a chance to finish. 6a72261 removes this feature and causes these tasks to return with an "error 19" code.

Without automatic retries, the dispatcher does not have to hold on to data from unresolved tasks, so I can see how this would increase memory efficiency. However, it is a setback for {crew} because there is no longer a way to recover from crashed workers. Workers can crash for perfectly normal and expected reasons: for example, if a spike in spot instance pricing causes an AWS Batch worker to exit. We want to retry the task in those cases.

Would it be possible to manually retry a mirai object? If so, crew would be able to easily resubmit the task when it tries to collect the result and instead gets the "error 19" code. It might also be good to take this opportunity to limit the number of retries for a task, so keeping track of the number of tries for each task would also be valuable.

@wlandau
Copy link
Author

wlandau commented May 28, 2024

Thinking about something like a retry_mirai() function (doesn’t have to be called that) which simply accepts a mirai object and resubmits it to its compute profile.

This could also complement the current stop_mirai(), and it would be convenient in a variety of situations where the user wants to retry on certain kinds of errors but does not want to have to re-enter the R command or data to call mirai() from scratch.

@shikokuchuo
Copy link
Owner

I see your point, and this is actually what the NNG-level retry system is good for. Unfortunately, if it returns an error then the original data is freed and can't be re-sent.

If I try and implement a manual system it would seem to require another copy of the data to hang on to, or if not then I'd have to make my own fork of NNG, which prevents usage of a system-installed library (and is not a route I want to go down unless I have to).

Perhaps I have to enable auto-retry as an option. I can default to 'on' for now, and then if crew can set this once the new mirai is out, I can then switch the default to 'off' in the next version.

@wlandau
Copy link
Author

wlandau commented May 28, 2024

Perhaps I have to enable auto-retry as an option. I can default to 'on' for now, and then if crew can set this once the new mirai is out, I can then switch the default to 'off' in the next version.

That seems ideal, thanks!

If the user opts out of automatic retries, will this free up memory on the dispatcher? From crew's side, this would be one reason a user might want to opt out. There have been reports in the past of OOM errors on the dispatcher.

Were there any issues with bugs and automatic retries, or was 6a72261 motivated for a different reason?

@shikokuchuo
Copy link
Owner

If the user opts out of automatic retries, will this free up memory on the dispatcher? From crew's side, this would be one reason a user might want to opt out. There have been reports in the past of OOM errors on the dispatcher.

Doesn't help unfortunately. It should be possible to free the data early and I've requested this from GDA and he agrees, but he has many higher priority items on his plate...

Were there any issues with bugs and automatic retries, or was 6a72261 motivated for a different reason?

Purely for user ergonomics, as it won't be apparent that a task has crashed unless you know to check status(). No bugs and not an issue to re-enable.

@shikokuchuo
Copy link
Owner

There is now a 'retry' argument at dispatcher() implemented in bd7fd6c. Defaults to TRUE for now.

@wlandau
Copy link
Author

wlandau commented May 29, 2024

Thanks @shikokuchuo, rerunning https://github.com/wlandau/crew/blob/main/tests/local/test-backlog_crash.R confirms that this works on my end. When the next release of mirai is out, I will expose the retry argument in crew to set a default of TRUE and allow debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants