Catch terminated/killed workers #583

mmyrte · 2022-01-18T14:36:39Z

mmyrte
Jan 18, 2022

Hi, thanks so much for developing this package!

@bguillod and I are using the future package for parallelization (indirectly via furrr::future_map both with future::plan('multisession') and future::plan('multicore')), and in some instances we get the following nice error

Error: Failed to retrieve the result of MulticoreFuture (<none>) from the forked worker (on localhost; PID 11184). Post-mortem diagnostic: No process exists with this PID, i.e. the forked localhost worker is no longer alive

whereas in other cases we only get a warning:

1: In mccollect(jobs = jobs, wait = TRUE) :
1 parallel job did not deliver a result

Neither of these is nicely reproducible.

In either case, we presume that the child processes were killed due to out-of-memory issues. We're running the process(es) in a Docker container, where the manager is killing the processes. Since we have no certain way of knowing whether the host which ends up running the code will present these OOM issues, we want to be able to catch these errors: We want to avoid incomplete output due to missing furrr chunks.

Our questions are the following:

Is the assumption correct that with the above warning, the calculation result might be incomplete?
If so, how can we catch this error? We didn't see any handlers we could pass along when initiating a Future.

Answered by HenrikBengtsson

Jan 18, 2022

Hi, I glad to hear you find the package useful. I refer to such errors "orchestration" errors, in contract, to run-time errors (e.g. log("a")). All orchestration errors signaled are of class FutureError, which you can catch when calling value(), e.g.

library(future)
plan(multicore)

## Emulate a crash multicore worker
f <- future({ tools::pskill(Sys.getpid()); 42 })

v <- value(f)
#> Error: Failed to retrieve the result of MulticoreFuture (<none>)
#> from the forked worker (on localhost; PID 25336). Post-mortem
#> diagnostic: No process exists with this PID, i.e. the forked
#> localhost worker is no longer alive
#> In addition: Warning message:
#> In mccollect(jobs = jobs, wait = TRUE) :
#>…

View full answer

HenrikBengtsson · 2022-01-18T19:25:09Z

HenrikBengtsson
Jan 18, 2022
Maintainer

Hi, I glad to hear you find the package useful. I refer to such errors "orchestration" errors, in contract, to run-time errors (e.g. log("a")). All orchestration errors signaled are of class FutureError, which you can catch when calling value(), e.g.

library(future)
plan(multicore)

## Emulate a crash multicore worker
f <- future({ tools::pskill(Sys.getpid()); 42 })

v <- value(f)
#> Error: Failed to retrieve the result of MulticoreFuture (<none>)
#> from the forked worker (on localhost; PID 25336). Post-mortem
#> diagnostic: No process exists with this PID, i.e. the forked
#> localhost worker is no longer alive
#> In addition: Warning message:
#> In mccollect(jobs = jobs, wait = TRUE) :
#>   1 parallel job did not deliver a result

## Handle above error
v <- tryCatch(value(f), FutureError = identity)
if (inherits(v, "FutureError")) {
  ## Do something special for orchestration errors, e.g.
  stop("Something went wrong")
}

Now, to the tricky part. With multicore, future.callr::callr, and future.batchtools backends, I think it's safe to discard f and retry. I write "think", because there's a risk that we end up with some memory-like leak here, but right now I don't see how. However, with other types of parallel backends, you might end up with a broken set of workers. For example, if a multisession worker dies, there's nothing in there that we attempt to create a new one and replace the missing one. Another reason for orchestration errors are communication errors, e.g. the network went down in the middle of a data transfer. In such cases, you might have a running worker, but it's waiting for the main R process to complete, and vice version. If this happens, it's not just a matter of discarding a future, but you also need to terminate a worker, possibly a remote one, and the restart it.

Because of this, and because everything should work the same regardless of future backend, one basically need to treat the state of R (and it's parallel setup) as broken, with the solution of restarting everything ... just like when the state of a sequential R session gets corrupted. But, at least the FutureError class is there for you to detect this. Of course, long-running processes can be protected against loss by caching to file so that one can quickly pick up where things left of earlier.

Remember, all code written using futures should work the same way regardless of future backend. As soon as we side step from that, the code is moving one step down in the parallelization stack, where it makes assumptions about the backend. At that point, the developer becomes in charge of much more, which the future framework otherwise takes care off. For my, and developers' sanity, this is why I insist everything should work the same regardless of future backend.

I can see, and have ideas, that we can improve on some of this later on, but it is really tricky to get working in a robust way. A common feature request is to be able to terminate long-running futures, e.g. via a time-out mechanism. This might require having to be able to literally kill the underlying worker (also remotely). So, in order to support that use case, a lot of the above infrastructure has to be implemented. For example, the first step is to support killing workers to avoid leaving stray processes behind that consumes CPU slots/cycles, e.g. futureverse/parallelly#33. Oh well, ... Things will be improved, but it's on a long-term roadmap. I hope this explains the complexity of what it takes to handle orchestration errors other than restarting everything.

whereas in other cases we only get a warning:
1: In mccollect(jobs = jobs, wait = TRUE) :
1 parallel job did not deliver a result

I'm a bit concerned about this; are you 100% sure they are not coupled with errors, as in the above example? Because, if there are warnings like these that are not coupled with signaled FutureError:s, then there's a bug in future.

Is the assumption correct that with the above warning, the calculation result might be incomplete?

So, unless the above is true and there's a bug in future, then, no, results should never ever be compromised. That's part of the core design of the future framework. Everything should give the same results regardless of sequential or parallel processing and where things are processed. If something fails because a parallel worker can complete the calculations, then there is an error. This is also try for higher-level packages. For example, the API of the future.apply package is designed to behave just like base R apply functions. The same should be true for furrr.

BTW, I thought about muffling these warnings when coupled with FutureError:s (#425 (comment)). However, for now, I decided to leave them in, because they might carry some additional information for troubleshooting purposes.

1 reply

mmyrte Jan 19, 2022
Author

Thanks for taking the time for such a detailed answer!

I must have used just the wrong search terms to not find #425 - that's likely what happened in our case, so I'm presuming it's not a bug in future. We'll go ahead and catch the FutureError in that case. I'll have to see whether the orchestration within Docker can be made adaptive - say, reduce chunk size or number of workers to alleviate memory pressure on a second try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch terminated/killed workers #583

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Catch terminated/killed workers #583

mmyrte Jan 18, 2022

Replies: 1 comment · 1 reply

HenrikBengtsson Jan 18, 2022 Maintainer

mmyrte Jan 19, 2022 Author

mmyrte
Jan 18, 2022

Replies: 1 comment 1 reply

HenrikBengtsson
Jan 18, 2022
Maintainer

mmyrte Jan 19, 2022
Author