Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the future.seed default #365

Open
pat-s opened this issue Mar 18, 2020 · 1 comment
Open

About the future.seed default #365

pat-s opened this issue Mar 18, 2020 · 1 comment
Labels

Comments

@pat-s
Copy link

pat-s commented Mar 18, 2020

Quoting from our mail conversation

The current default with future.seed=FALSE is not set in stone but I need to think about it more. Ideally, if we could detect whether parallel RNG is needed or not, it could be set automatically. But I doubt that will ever be possible - it would require annotating all functions specify if they use RNGs or not. There was also the discussion of detecting when RNGs were indeed used even if future.seed=FALSE. If detected, a warning or even an error could be produced. This would not be too hard to implement and the overhead would be minimal. This should prevent calling future_lapply() et al. without future.seed=TRUE when truly
needed. (This is on my radar since a while)

Regarding a dynamic setting of future.seed: Would it make sense to check which future::plan is requested? If it is a parallel one, turn it on internally by default - if not, leave it off.

This way users would use the default RNG kind when using plan(sequential) and the "L'Ecuyer-CMRG" one in parallel scenarios.
Due to future.seed = TRUE, both would magically work with just set.seed() and there is no overhead when it is not needed.

When it comes to scenarios when reproducibility is not wanted but only speed: It would be great if users could turn future.seed off on their side and not rely on what a package devs set it to within the package.
Hence I'd like to an option to overwrite future.seed on the user level when setting the future::plan() - this is even unrelated to all other ideas in here.

With all the options from above, practical scenarios could look as follows:

  • Parallel processes are reproducible by default because "L'Ecuyer-CMRG" is used via a dynamic future.seed argument
  • No overhead for sequential runs (future.seed = FALSE always). If a sequential plan detected future.seed = TRUE, a warning could be issued
  • If speed is > reproducibility, users can turn off the latter by setting future::plan(<plan>, future.seed = FALSE) which will take precedence over any settings downstream in any future_*apply() call
@HenrikBengtsson
Copy link
Collaborator

HenrikBengtsson commented Mar 30, 2020

Some quick comments:

  1. The overall design objective is that futures should give the exact same results regardless of backend. I believe that is one of the core strengths of the Future API. It minimizes surprises and helps developers and users to focus on the task/analysis at hand without having to worry about various ifs and whats. This should also explain why I'm hesitant/conservative in introducing features to plan() where the user can potentially break the intention that the developer had in mind. Having said that, I'm constantly trying to figure ways to allow for adjustments without breaking this objective.

  2. Regarding RNG in map-reduce pattern: it is known that the current, very conservative, approach that future.apply takes, which pre-generate a RNG seed for each element processed, is time consuming. This overhead can be ignored in very long-running tasks, but for quicker one it becomes a show stopper. There is an open future.apply issue (SPEED: Add support for per-chunk/per-future seeds future.apply#20) which would open up for producing a single RNG seed per future. This would break perfect reproducibility, but would still be statistically sound. This what parallel::mclapply() does by default. This approach I believe is safe to introduce, because it is in control of the developer and not the user. This should solve your slowness issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants