-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support manual task retries? #117
Comments
Thinking about something like a retry_mirai() function (doesn’t have to be called that) which simply accepts a mirai object and resubmits it to its compute profile. This could also complement the current stop_mirai(), and it would be convenient in a variety of situations where the user wants to retry on certain kinds of errors but does not want to have to re-enter the R command or data to call mirai() from scratch. |
I see your point, and this is actually what the NNG-level retry system is good for. Unfortunately, if it returns an error then the original data is freed and can't be re-sent. If I try and implement a manual system it would seem to require another copy of the data to hang on to, or if not then I'd have to make my own fork of NNG, which prevents usage of a system-installed library (and is not a route I want to go down unless I have to). Perhaps I have to enable auto-retry as an option. I can default to 'on' for now, and then if |
That seems ideal, thanks! If the user opts out of automatic retries, will this free up memory on the dispatcher? From Were there any issues with bugs and automatic retries, or was 6a72261 motivated for a different reason? |
Doesn't help unfortunately. It should be possible to free the data early and I've requested this from GDA and he agrees, but he has many higher priority items on his plate...
Purely for user ergonomics, as it won't be apparent that a task has crashed unless you know to check |
There is now a 'retry' argument at |
Thanks @shikokuchuo, rerunning https://github.com/wlandau/crew/blob/main/tests/local/test-backlog_crash.R confirms that this works on my end. When the next release of |
Previously, the dispatcher automatically retried tasks which did not get a chance to finish. 6a72261 removes this feature and causes these tasks to return with an "error 19" code.
Without automatic retries, the dispatcher does not have to hold on to data from unresolved tasks, so I can see how this would increase memory efficiency. However, it is a setback for {crew} because there is no longer a way to recover from crashed workers. Workers can crash for perfectly normal and expected reasons: for example, if a spike in spot instance pricing causes an AWS Batch worker to exit. We want to retry the task in those cases.
Would it be possible to manually retry a
mirai
object? If so,crew
would be able to easily resubmit the task when it tries to collect the result and instead gets the "error 19" code. It might also be good to take this opportunity to limit the number of retries for a task, so keeping track of the number of tries for each task would also be valuable.The text was updated successfully, but these errors were encountered: