Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNG: Protect against misuse, i.e. using random numbers without setting up proper parallel RNG #353

Closed
HenrikBengtsson opened this issue Dec 23, 2019 · 6 comments

Comments

@HenrikBengtsson
Copy link
Collaborator

Proposal

Have the future framework detect when random numbers are used/generated/produced although no parallel RNG streams are in place. Make this check optional via an option. For example,

library(future.apply)
plan(multisession, workers = 2)

options(future.rng.onMisuse = "ignore")
set.seed(42)
y <- future_lapply(1:2, FUN = rnorm)
str(y)
## List of 2
##  $ : num -1.07
##  $ : num [1:2] 2.85 1.15

options(future.rng.onMisuse = "error")
set.seed(42)
y <- future_lapply(1:2, FUN = rnorm)
## Error: Future expression used the random number generator (RNG) without using having set up a proper parallel RNG seed

set.seed(42)
y <- future_lapply(1:2, FUN = rnorm, future.seed = TRUE)
str(y)
## List of 2
##  $ : num -0.169
##  $ : num [1:2] 0.649 1.193

The analogue for basic futures would be:

f <- future(rnorm(2))
v <- value(f)
## Error: Future expression used the random number generator (RNG) without using having set up a proper parallel RNG seed

RNGkind("L'Ecuyer-CMRG")
set.seed(42)
seed <- parallel::nextRNGStream(.Random.seed)
f <- future(rnorm(2), seed = seed)
v <- value(f)

where

  • future.rng.onMisuse: If random numbers are used in futures, then parallel (L'Ecuyer-CMRG) RNG should be used in order to get statistical sound RNGs. The defaults in the future framework assume that no random number generation (RNG) is taken place in the future expression because L'Ecuyer-CMRG RNGs come with an unnecessary overhead if not needed. To protect against mistakes, the future framework attempts to detect when random numbers are used despite L'Ecuyer-CMRG RNGs are not in place. If this is detected, and future.rng.onMisuse = "error", then an informative error message is produced. If "warning", then a warning message is produced. If "ignore"`, no check is performed.

BTW

The last future example above, reminds me that it would be convenient if we could use regular fixed RNG seed to generate L'Ecuyer-CMRG seeds, e.g.

f <- future(rnorm(2), seed = 42)

as well as

set.seed(42)
f <- future(rnorm(2), seed = TRUE)

Especially the latter would be useful since that is a common use pattern with future.apply.

See also

This idea was triggered by a Twitter discussion on 2019-11-11 (https://twitter.com/henrikbengtsson/status/1194007939725479937).

@HenrikBengtsson
Copy link
Collaborator Author

This is now implemented in the develop branch.

For example, one, proper, way to use RNG is futures is to use:

library(future)
options(future.rng.onMisuse = "warning")

f <- future(rnorm(1), seed=TRUE)
value(f)
## [1] 1.685746

If we had used

f <- future(rnorm(1), seed=FALSE)

we'd get:

value(f)
## Warning message:
## UNRELIABLE VALUE: Detected that random numbers were generated while future
## ('<none>') was resolved. Because future argument 'seed' was set to 'FALSE',
## those random numbers may not be statistical sound. To fix this, specify
## argument '[future.]seed', e.g. 'seed=TRUE'. To disable this check, set
## option 'future.rng.onMisuse' to "ignore". 
## [1] 0.2068498

The current default is:

[1] -0.4169564
f <- future(rnorm(1))
value(f)
## [1] 1.365723

which, for backward-compatible reasons, is equivalent too:

f <- future(rnorm(1), seed=NULL)
value(f)
## [1] 1.685746

Using seed=NULL, is the same as seed=FALSE without the RNG validation.

@HenrikBengtsson
Copy link
Collaborator Author

Further updates on detecting misuse of parallel RNG; with

remotes::install_github("HenrikBengtsson/future@develop")

and setting

options(future.rng.onMisuse = "warning")   ## or "error"

or

R_FUTURE_RNG_ON_MISUSE=warning

in, say, ~/.Renviron, we will now warning elsewhere in the future ecosystem.

For example, with future.apply:

> y <- future.apply::future_lapply(1:2, FUN = function(x) { rnorm(1) })
Warning message:
UNRELIABLE VALUE: Detected that random numbers were generated while future ('future_lapply-1')
was resolved. Because future argument 'seed' was set to 'FALSE', those random numbers
may not be statistical sound. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'.
To disable this check, set option 'future.rng.onMisuse' to "ignore".
>

To avoid this, we must explicitly request to use parallel RNG;

y <- future.apply::future_lapply(1:2, FUN = function(x) { rnorm(1) }, future.seed = TRUE)
>

Similar for foreach with doFuture;

> library(foreach)
> doFuture::registerDoFuture()
> y <- foreach(x = 1:2) %dopar% { rnorm(1); x }
Warning message:
UNRELIABLE VALUE: Detected that random numbers were generated while future ('<none>')
was resolved. Because future argument 'seed' was set to 'FALSE', those random numbers
may not be statistical sound. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. 
To disable this check, set option 'future.rng.onMisuse' to "ignore". 
> 

The correct way here is (for now*) to use doRNG:

> library(foreach)
> doFuture::registerDoFuture()
> y <- foreach(x = 1:2) %dopar% { rnorm(1); x }
> doRNG::registerDoRNG()
> library(doRNG) ## https://github.com/renozao/doRNG/issues/13
> y <- foreach(x = 1:2) %dopar% { rnorm(1); x }
Warning message:
In list(args = (1:2)(), argnames = "x", evalenv = <environment>,  :
  Foreach loop had changed the current RNG type: RNG was restored to same type, next state
>

Comment: That warning comes from doRNG and is tracked in futureverse/doFuture#42

(*) I might add built-in parallel RNG support to doFuture in next-next-release.

For furrr, we have:

> y <- furrr::future_map(1:2, function(x) { rnorm(1); x })
Warning message:
UNRELIABLE VALUE: Detected that random numbers were generated while future ('<none>')
was resolved. Because future argument 'seed' was set to 'FALSE', those random numbers
may not be statistical sound. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. 
To disable this check, set option 'future.rng.onMisuse' to "ignore". 
>

which the correction being:

> y <- furrr::future_map(1:2, function(x) { rnorm(1); x }, .options = furrr::future_options(seed = TRUE))
>

Obviously, if RNG is not used, there will be no warning, e.g.

> y <- future.apply::future_lapply(1:2, FUN = function(x) { x })
>

Roadmap

  1. In the next release, I'll have future.rng.onMisuse default to "ignore" to avoid introducing too many warnings to the end user all at once.
  2. Tell downstream packages to fix this mistake. I've already run some revdep checks with R_FUTURE_RNG_ON_MISUSE=error and found that some packages are missing seed = TRUE.
  3. Have the future package set future.rng.onMisuse="error" on load if (and only if) R CMD check is running. This will force downstream packages to fix the problem and prevent new ones from introducing the problem.
  4. Make future.rng.onMisuse="warning" the new default. This will help fix existing user scripts and maybe make users report problems they detect elsewhere.

I don't know when it's safe to set future.rng.onMisuse="error" as the default because we might get semi-false-positives such as the RNG state being updated when a package is loaded/attached, e.g. r-lib/withr#124 (comment).

cc/ h/t @pat-s

@pat-s
Copy link

pat-s commented Jan 14, 2020

This is a huge step forward :)

Roadmap

I like it a lot, especially setting future.rng.onMisuse="error" for pkg devs so that the setting populates!

Warning message content

Warning message:
UNRELIABLE VALUE: Detected that random numbers were generated while future ('future_lapply-1')
was resolved. Because future argument 'seed' was set to 'FALSE', those random numbers
may not be statistical sound. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'.
To disable this check, set option 'future.rng.onMisuse' to "ignore".

I am wondering if this is descriptive enough to the average user to take the right action/understand what#s going on. Maybe a little bit of extra context would help here to simplify taking action?

Warning message:
UNRELIABLE VALUE: Detected that random numbers were generated (either via set.seed() or by setting no seed at all) while future ('future_lapply-1')
was resolved. Because future argument 'seed' was set to 'FALSE', those random numbers
may not be statistical sound. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This will ensure reproducibility by using the "L'Ecuyer-CMRG"RNG kind in the background.
To disable this check, set option 'future.rng.onMisuse' to "ignore".

Maybe this warning could even include a link to a FAQ'isch like page which again links to more information about the whole topic.


I know you do not like the term "reproducible" in this context but my naming for the option would be something like

  • future.rng.reproducible
  • future.check.parallel.rng

Thanks again for the good work!

@HenrikBengtsson
Copy link
Collaborator Author

Thanks for the suggestions - yeah, that message is a bit clunky/technical. What about:

UNRELIABLE VALUE: Future ('future_lapply-1') unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, set option 'future.rng.onMisuse' to "ignore".

I'm avoiding mentioning set.seed() because it is ok to set that outside of the future, but not inside, and explaining that difference makes it more complicated.

I like the option name future.check.parallel.rng better. OTH, options with onNnn names makes is at bit cleared that they take "actions" as alternatives. I'll think about it.

@pat-s
Copy link

pat-s commented Jan 14, 2020

I'm avoiding mentioning set.seed() because it is ok to set that outside of the future, but not inside, and explaining that difference makes it more complicated.

Oh yeah, that might trigger confusion whether set.seed() might be something "bad" in fact.

OTH, options with onNnn names makes is at bit cleared that they take "actions" as alternatives. I'll think about it.

Agree. It's up to you, both have their pros and cons :)

@HenrikBengtsson
Copy link
Collaborator Author

FYI, use of RNG by mistake will produce a warning by default in the next release of {future}, cf. commit df11e64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants