-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identical RNG state for each task, despite setting different seeds in different tasks #265
Comments
FYI @HenrikBengtsson |
"The winter is coming" ... or parallel RNG is really hard when it comes to cover all scenario. I've got a bit of a backlog and this is really hard, so I don't have time to dive into it right now. It could be a simple reason and fix for your case, or it could be something bigger that requires design changes. FWIW, there's also futureverse/future.apply#108, which may or may not be related. |
There's a root in the DAG, correct? If so, could you define a unique walk-through of the DAG from the root and outwards? That would allow you to order the nodes in a deterministic way (as long as the DAG does not change). With that, you could generate a unique set of RNG streams for your DAG so that they are assigned to the nodes in a deterministic way. In this sense, lapply/map/foreach uses a linear DAG where "next" is obvious (but one could come up with other walk-throughs that would also work, e.g. reverse) |
Yes, To rephrase: the recursiveness of Take a simple graph LR
data --> summary
The topological sort is trivial: library(igraph)
graph <- graph_from_literal(data-+summary)
names(topo_sort(graph))
#> [1] "data" "summary" To use recursive L'Ecuyer streams, the stream of But then what if the user adds a new graph LR
data --> model
data --> summary
The topological sort changes: graph <- graph_from_literal(data-+model, data-+summary)
names(topo_sort(graph))
#> [1] "data" "model" "summary" If streams are assigned recursively in topological order, then In addition, there are two DAGs now: an explicit DAG for the intended dependency relationships and an implicit DAG for the extra dependency relationships induced by the RNG streams. graph LR
data --> model
data --> summary
model --> summary
In the general case, even medium-sized DAGs would contort into bizarre, unpredictable, disruptive abominations. To prevent the whole paradigm of If at some point there is a way to generate safer deterministic seeds independently of one another, I will switch |
But the original issue in this thread is more serious: different tasks in the same |
Just realized #265 (comment) had typos in key places. Now fixed. |
A bit more context for others who might jump in: |
(sorry @DavisVaughan for adding noise here; @wlandau , feel free to move this over to another issue of yours, if you think there's a better place to discuss this) I might miss something, but the idea that we use for map-reduce calls in Futureverse is to pre-generate the RNG streams for all "tasks". This is expensive for numerous tasks, but I don't think there's another way to achieve this. Here's the gist: ## Imaginary tasks
X <- 1:20
## Tasks are processed in random order.
## Can also skip already done tasks.
## Result will be the same regardless.
idxs <- sample.int(length(X), size = length(X))
## Pre-generate deterministic RNG streams for _all_ tasks
RNGkind("L'Ecuyer-CMRG")
set.seed(42)
seeds <- list()
seeds[[1]] <- get(".Random.seed", envir = globalenv(), inherits = FALSE)
for (kk in 2:length(X)) seeds[[kk]] <- parallel::nextRNGStream(seeds[[kk-1]])
## Process tasks order give above with fully deterministic RNG seeds
y <- rep(NA_real_, times = length(X))
for (kk in idxs) {
## Use deterministic RNG stream for this task
seed <- seeds[[kk]]
assign(".Random.seed", value = seed, envir = globalenv(), inherits = FALSE)
y[kk] <- rnorm(n = 1L)
} |
Moved to ropensci/targets#1139, starting with ropensci/targets#1139 (comment). |
At least the part about why |
From #251 and from
vignette("parallel", package = "parallel")
, I understand the desire to assign widely-spaced L'Ecuyer RNG streams to parallel workers. L'Ecuyer andparallel::nextRNGStream()
minimize the risk of overlapping sequences. Unfortunately, fortargets
andtarchetypes
, there is no such thing as a "next stream" because tasks run in a DAG instead of a linear sequence. In addition,targets
has special responsibilities when it comes to reproducibility, so each target must have its own reproducible RNG state which does not depend on the parallel backend or the RNG state of another target. See ropensci/targets#1139 and https://books.ropensci.org/targets/random.html.All this is to say, my packages rely on the ability to call
set.seed()
from inside a task.In ropensci/tarchetypes#156, @solmos noticed that in the context of
tarchetypes
,future_map()
is forcing each task inside a worker to have the same RNG state. This is a different problem than #251 becauseset.seed()
is called within the task. See below for a reprex. Oddly enough, the results are incorrect in my local R session but correct when I call it inside the actualreprex
package.I would expect to see:
Session info:
The text was updated successfully, but these errors were encountered: