Catch terminated/killed workers #583
-
Hi, thanks so much for developing this package! @bguillod and I are using the future package for parallelization (indirectly via
whereas in other cases we only get a warning:
Neither of these is nicely reproducible. In either case, we presume that the child processes were killed due to out-of-memory issues. We're running the process(es) in a Docker container, where the manager is killing the processes. Since we have no certain way of knowing whether the host which ends up running the code will present these OOM issues, we want to be able to catch these errors: We want to avoid incomplete output due to missing furrr chunks. Our questions are the following:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi, I glad to hear you find the package useful. I refer to such errors "orchestration" errors, in contract, to run-time errors (e.g. library(future)
plan(multicore)
## Emulate a crash multicore worker
f <- future({ tools::pskill(Sys.getpid()); 42 })
v <- value(f)
#> Error: Failed to retrieve the result of MulticoreFuture (<none>)
#> from the forked worker (on localhost; PID 25336). Post-mortem
#> diagnostic: No process exists with this PID, i.e. the forked
#> localhost worker is no longer alive
#> In addition: Warning message:
#> In mccollect(jobs = jobs, wait = TRUE) :
#> 1 parallel job did not deliver a result
## Handle above error
v <- tryCatch(value(f), FutureError = identity)
if (inherits(v, "FutureError")) {
## Do something special for orchestration errors, e.g.
stop("Something went wrong")
} Now, to the tricky part. With Because of this, and because everything should work the same regardless of future backend, one basically need to treat the state of R (and it's parallel setup) as broken, with the solution of restarting everything ... just like when the state of a sequential R session gets corrupted. But, at least the Remember, all code written using futures should work the same way regardless of future backend. As soon as we side step from that, the code is moving one step down in the parallelization stack, where it makes assumptions about the backend. At that point, the developer becomes in charge of much more, which the future framework otherwise takes care off. For my, and developers' sanity, this is why I insist everything should work the same regardless of future backend. I can see, and have ideas, that we can improve on some of this later on, but it is really tricky to get working in a robust way. A common feature request is to be able to terminate long-running futures, e.g. via a time-out mechanism. This might require having to be able to literally kill the underlying worker (also remotely). So, in order to support that use case, a lot of the above infrastructure has to be implemented. For example, the first step is to support killing workers to avoid leaving stray processes behind that consumes CPU slots/cycles, e.g. futureverse/parallelly#33. Oh well, ... Things will be improved, but it's on a long-term roadmap. I hope this explains the complexity of what it takes to handle orchestration errors other than restarting everything.
I'm a bit concerned about this; are you 100% sure they are not coupled with errors, as in the above example? Because, if there are warnings like these that are not coupled with signaled FutureError:s, then there's a bug in future.
So, unless the above is true and there's a bug in future, then, no, results should never ever be compromised. That's part of the core design of the future framework. Everything should give the same results regardless of sequential or parallel processing and where things are processed. If something fails because a parallel worker can complete the calculations, then there is an error. This is also try for higher-level packages. For example, the API of the future.apply package is designed to behave just like base R apply functions. The same should be true for furrr. BTW, I thought about muffling these warnings when coupled with FutureError:s (#425 (comment)). However, for now, I decided to leave them in, because they might carry some additional information for troubleshooting purposes. |
Beta Was this translation helpful? Give feedback.
Hi, I glad to hear you find the package useful. I refer to such errors "orchestration" errors, in contract, to run-time errors (e.g.
log("a")
). All orchestration errors signaled are of class FutureError, which you can catch when callingvalue()
, e.g.