You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current default with future.seed=FALSE is not set in stone but I need to think about it more. Ideally, if we could detect whether parallel RNG is needed or not, it could be set automatically. But I doubt that will ever be possible - it would require annotating all functions specify if they use RNGs or not. There was also the discussion of detecting when RNGs were indeed used even if future.seed=FALSE. If detected, a warning or even an error could be produced. This would not be too hard to implement and the overhead would be minimal. This should prevent calling future_lapply() et al. without future.seed=TRUE when truly
needed. (This is on my radar since a while)
Regarding a dynamic setting of future.seed: Would it make sense to check which future::plan is requested? If it is a parallel one, turn it on internally by default - if not, leave it off.
This way users would use the default RNG kind when using plan(sequential) and the "L'Ecuyer-CMRG" one in parallel scenarios.
Due to future.seed = TRUE, both would magically work with just set.seed() and there is no overhead when it is not needed.
When it comes to scenarios when reproducibility is not wanted but only speed: It would be great if users could turn future.seed off on their side and not rely on what a package devs set it to within the package.
Hence I'd like to an option to overwrite future.seed on the user level when setting the future::plan() - this is even unrelated to all other ideas in here.
With all the options from above, practical scenarios could look as follows:
Parallel processes are reproducible by default because "L'Ecuyer-CMRG" is used via a dynamic future.seed argument
No overhead for sequential runs (future.seed = FALSE always). If a sequential plan detected future.seed = TRUE, a warning could be issued
If speed is > reproducibility, users can turn off the latter by setting future::plan(<plan>, future.seed = FALSE) which will take precedence over any settings downstream in any future_*apply() call
The text was updated successfully, but these errors were encountered:
The overall design objective is that futures should give the exact same results regardless of backend. I believe that is one of the core strengths of the Future API. It minimizes surprises and helps developers and users to focus on the task/analysis at hand without having to worry about various ifs and whats. This should also explain why I'm hesitant/conservative in introducing features to plan() where the user can potentially break the intention that the developer had in mind. Having said that, I'm constantly trying to figure ways to allow for adjustments without breaking this objective.
Regarding RNG in map-reduce pattern: it is known that the current, very conservative, approach that future.apply takes, which pre-generate a RNG seed for each element processed, is time consuming. This overhead can be ignored in very long-running tasks, but for quicker one it becomes a show stopper. There is an open future.apply issue (SPEED: Add support for per-chunk/per-future seeds future.apply#20) which would open up for producing a single RNG seed per future. This would break perfect reproducibility, but would still be statistically sound. This what parallel::mclapply() does by default. This approach I believe is safe to introduce, because it is in control of the developer and not the user. This should solve your slowness issues.
Quoting from our mail conversation
Regarding a dynamic setting of
future.seed
: Would it make sense to check which future::plan is requested? If it is a parallel one, turn it on internally by default - if not, leave it off.This way users would use the default RNG kind when using
plan(sequential)
and the "L'Ecuyer-CMRG" one in parallel scenarios.Due to
future.seed = TRUE
, both would magically work with justset.seed()
and there is no overhead when it is not needed.When it comes to scenarios when reproducibility is not wanted but only speed: It would be great if users could turn
future.seed
off on their side and not rely on what a package devs set it to within the package.Hence I'd like to an option to overwrite
future.seed
on the user level when setting the future::plan() - this is even unrelated to all other ideas in here.With all the options from above, practical scenarios could look as follows:
future.seed
argumentfuture.seed = FALSE
always). If a sequential plan detectedfuture.seed = TRUE
, a warning could be issuedfuture::plan(<plan>, future.seed = FALSE)
which will take precedence over any settings downstream in anyfuture_*apply()
callThe text was updated successfully, but these errors were encountered: