Divide and conquer with workaround for batch submission #717
-
From what I understand, when working with computer clusters, the first approach is to try to use batchtools via future.batchtools. The system my coauthor is working with is HTCondor, which unfortunately isn't currently supported (see futureverse/future.batchtools#29 and mllg/batchtools#68). I don't have access to HTCondor myself, so I won't be able to work on adding support. I thus find myself trying to come up with a workaround. There are a few details specific to our workflow and use of future:
From what I understand (I'm new to computing clusters and HPC), it's common to request many jobs with 1 core from the batch system. This is recommended (e.g., discussed here https://jepusto.github.io/Designing-Simulations-in-R/parallel-processing.html) at various sources. The challenge, then, is how to set up the seed on each instance. The goal is to achieve numeric reproducibility, using future, no matter how the code is run. I believe I need to generate the seeds, using future, and each instance would fast-forward to the appropriate seed. For this, perhaps I can use After figuring out the seed issue, I just need to retrieve the saved .Rds files, import them, and merge them together. In theory, this should lead to numeric reproducibility. Has anyone done something like this? Is the strategy at least reasonable (given the constraints) ? I'm working on a general Monte Carlo R package, which relies on future. The above details are sufficient, I believe, but for completeness the full package is here: montetools. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
@HenrikBengtsson No problem if you don't have time to look at the details of this, but I'd be curious if the approach I suggest at least sounds reasonable to you. |
Beta Was this translation helpful? Give feedback.
-
Using: set.seed(0xBEEF)
y <- future_lapply(X, FUN = my_fcn, future.seed = TRUE) should be 100% reproducible, i.e. no need to orchestrate the initial random seeds ( If you're concerned about some tasks failing and not wanting to have rerun everything from scratch, you can use memoization for my_fcn <- function(x) {
file <- x_to_rds(x)
## Already processed?
if (already_exists(file)) return(file)
## Otherwise, run the analysis
file <- full_run(x)
file
} Yes, this would be a bit wasteful on the job scheduler, because you're requesting jobs for steps that will be skipped. Right now, we don't have a mechanism to avoid this. If we could run If you really want to pre-generate your own list of old_plan <- plan(sequential)
set.seed(0xBEEF)
seeds <- future_lapply(X, FUN = function(x) get(".Random.seed", envir = globalenv()), future.seed = TRUE)
plan(old_plan) You can then use these seeds in your calls as: y <- future_lapply(X, FUN = my_fcn, future.seed = seeds) and if you only want to process a subset of idxs <- c(1,3,8)
y[idxs] <- future_lapply(X[idxs], FUN = my_fcn, future.seed = seeds[idxs]) |
Beta Was this translation helpful? Give feedback.
Using:
should be 100% reproducible, i.e. no need to orchestrate the initial random seeds (
.Random.seed
) yourself.If you're concerned about some tasks failing and not wanting to have rerun everything from scratch, you can use memoization for
my_fcn()
. The gist:Yes, this would be a bit wasteful on the job scheduler, because you're requesting jobs for steps that will be skipped. Right now, we don't have a mechanism to avoid this. If we could run