RNG: Protect against misuse, i.e. using random numbers without setting up proper parallel RNG #353

HenrikBengtsson · 2019-12-23T03:07:49Z

Proposal

Have the future framework detect when random numbers are used/generated/produced although no parallel RNG streams are in place. Make this check optional via an option. For example,

library(future.apply)
plan(multisession, workers = 2)

options(future.rng.onMisuse = "ignore")
set.seed(42)
y <- future_lapply(1:2, FUN = rnorm)
str(y)
## List of 2
##  $ : num -1.07
##  $ : num [1:2] 2.85 1.15

options(future.rng.onMisuse = "error")
set.seed(42)
y <- future_lapply(1:2, FUN = rnorm)
## Error: Future expression used the random number generator (RNG) without using having set up a proper parallel RNG seed

set.seed(42)
y <- future_lapply(1:2, FUN = rnorm, future.seed = TRUE)
str(y)
## List of 2
##  $ : num -0.169
##  $ : num [1:2] 0.649 1.193

The analogue for basic futures would be:

f <- future(rnorm(2))
v <- value(f)
## Error: Future expression used the random number generator (RNG) without using having set up a proper parallel RNG seed

RNGkind("L'Ecuyer-CMRG")
set.seed(42)
seed <- parallel::nextRNGStream(.Random.seed)
f <- future(rnorm(2), seed = seed)
v <- value(f)

where

future.rng.onMisuse: If random numbers are used in futures, then parallel (L'Ecuyer-CMRG) RNG should be used in order to get statistical sound RNGs. The defaults in the future framework assume that no random number generation (RNG) is taken place in the future expression because L'Ecuyer-CMRG RNGs come with an unnecessary overhead if not needed. To protect against mistakes, the future framework attempts to detect when random numbers are used despite L'Ecuyer-CMRG RNGs are not in place. If this is detected, and future.rng.onMisuse = "error", then an informative error message is produced. If "warning", then a warning message is produced. If "ignore"`, no check is performed.

BTW

The last future example above, reminds me that it would be convenient if we could use regular fixed RNG seed to generate L'Ecuyer-CMRG seeds, e.g.

f <- future(rnorm(2), seed = 42)

as well as

set.seed(42)
f <- future(rnorm(2), seed = TRUE)

Especially the latter would be useful since that is a common use pattern with future.apply.

value(f)
## Warning message:
## UNRELIABLE VALUE: Detected that random numbers were generated while future
## ('<none>') was resolved. Because future argument 'seed' was set to 'FALSE',
## those random numbers may not be statistical sound. To fix this, specify
## argument '[future.]seed', e.g. 'seed=TRUE'. To disable this check, set
## option 'future.rng.onMisuse' to "ignore". 
## [1] 0.2068498

The current default is:

[1] -0.4169564
f <- future(rnorm(1))
value(f)
## [1] 1.365723

which, for backward-compatible reasons, is equivalent too:

f <- future(rnorm(1), seed=NULL)
value(f)
## [1] 1.685746

Using seed=NULL, is the same as seed=FALSE without the RNG validation.

…E' [#353]

HenrikBengtsson · 2020-01-10T03:23:40Z

Further updates on detecting misuse of parallel RNG; with

remotes::install_github("HenrikBengtsson/future@develop")

and setting

options(future.rng.onMisuse = "warning")   ## or "error"

or

R_FUTURE_RNG_ON_MISUSE=warning

in, say, ~/.Renviron, we will now warning elsewhere in the future ecosystem.

For example, with future.apply:

> y <- future.apply::future_lapply(1:2, FUN = function(x) { rnorm(1) })
Warning message:
UNRELIABLE VALUE: Detected that random numbers were generated while future ('future_lapply-1')
was resolved. Because future argument 'seed' was set to 'FALSE', those random numbers
may not be statistical sound. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'.
To disable this check, set option 'future.rng.onMisuse' to "ignore".
>

To avoid this, we must explicitly request to use parallel RNG;

y <- future.apply::future_lapply(1:2, FUN = function(x) { rnorm(1) }, future.seed = TRUE)
>

Similar for foreach with doFuture;

> library(foreach)
> doFuture::registerDoFuture()
> y <- foreach(x = 1:2) %dopar% { rnorm(1); x }
Warning message:
UNRELIABLE VALUE: Detected that random numbers were generated while future ('<none>')
was resolved. Because future argument 'seed' was set to 'FALSE', those random numbers
may not be statistical sound. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. 
To disable this check, set option 'future.rng.onMisuse' to "ignore". 
>

The correct way here is (for now*) to use doRNG:

> library(foreach)
> doFuture::registerDoFuture()
> y <- foreach(x = 1:2) %dopar% { rnorm(1); x }
> doRNG::registerDoRNG()
> library(doRNG) ## https://github.com/renozao/doRNG/issues/13
> y <- foreach(x = 1:2) %dopar% { rnorm(1); x }
Warning message:
In list(args = (1:2)(), argnames = "x", evalenv = <environment>,  :
  Foreach loop had changed the current RNG type: RNG was restored to same type, next state
>

Comment: That warning comes from doRNG and is tracked in futureverse/doFuture#42

(*) I might add built-in parallel RNG support to doFuture in next-next-release.

For furrr, we have:

> y <- furrr::future_map(1:2, function(x) { rnorm(1); x })
Warning message:
UNRELIABLE VALUE: Detected that random numbers were generated while future ('<none>')
was resolved. Because future argument 'seed' was set to 'FALSE', those random numbers
may not be statistical sound. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. 
To disable this check, set option 'future.rng.onMisuse' to "ignore". 
>

which the correction being:

> y <- furrr::future_map(1:2, function(x) { rnorm(1); x }, .options = furrr::future_options(seed = TRUE))
>

Obviously, if RNG is not used, there will be no warning, e.g.

> y <- future.apply::future_lapply(1:2, FUN = function(x) { x })
>

Roadmap

In the next release, I'll have future.rng.onMisuse default to "ignore" to avoid introducing too many warnings to the end user all at once.
Tell downstream packages to fix this mistake. I've already run some revdep checks with R_FUTURE_RNG_ON_MISUSE=error and found that some packages are missing seed = TRUE.
Have the future package set future.rng.onMisuse="error" on load if (and only if) R CMD check is running. This will force downstream packages to fix the problem and prevent new ones from introducing the problem.
Make future.rng.onMisuse="warning" the new default. This will help fix existing user scripts and maybe make users report problems they detect elsewhere.

I don't know when it's safe to set future.rng.onMisuse="error" as the default because we might get semi-false-positives such as the RNG state being updated when a package is loaded/attached, e.g. r-lib/withr#124 (comment).

cc/ h/t @pat-s

pat-s · 2020-01-14T16:43:09Z

This is a huge step forward :)

Roadmap

I like it a lot, especially setting future.rng.onMisuse="error" for pkg devs so that the setting populates!

Warning message content

Warning message:
UNRELIABLE VALUE: Detected that random numbers were generated while future ('future_lapply-1')
was resolved. Because future argument 'seed' was set to 'FALSE', those random numbers
may not be statistical sound. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'.
To disable this check, set option 'future.rng.onMisuse' to "ignore".

I am wondering if this is descriptive enough to the average user to take the right action/understand what#s going on. Maybe a little bit of extra context would help here to simplify taking action?

Warning message:
UNRELIABLE VALUE: Detected that random numbers were generated (either via set.seed() or by setting no seed at all) while future ('future_lapply-1')
was resolved. Because future argument 'seed' was set to 'FALSE', those random numbers
may not be statistical sound. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This will ensure reproducibility by using the "L'Ecuyer-CMRG"RNG kind in the background.
To disable this check, set option 'future.rng.onMisuse' to "ignore".

Maybe this warning could even include a link to a FAQ'isch like page which again links to more information about the whole topic.

I know you do not like the term "reproducible" in this context but my naming for the option would be something like

future.rng.reproducible
future.check.parallel.rng

Thanks again for the good work!

HenrikBengtsson · 2020-01-14T17:22:05Z

Thanks for the suggestions - yeah, that message is a bit clunky/technical. What about:

UNRELIABLE VALUE: Future ('future_lapply-1') unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, set option 'future.rng.onMisuse' to "ignore".

I'm avoiding mentioning set.seed() because it is ok to set that outside of the future, but not inside, and explaining that difference makes it more complicated.

I like the option name future.check.parallel.rng better. OTH, options with onNnn names makes is at bit cleared that they take "actions" as alternatives. I'll think about it.

pat-s · 2020-01-14T20:16:02Z

I'm avoiding mentioning set.seed() because it is ok to set that outside of the future, but not inside, and explaining that difference makes it more complicated.

Oh yeah, that might trigger confusion whether set.seed() might be something "bad" in fact.

OTH, options with onNnn names makes is at bit cleared that they take "actions" as alternatives. I'll think about it.

Agree. It's up to you, both have their pros and cons :)

…ved to future orchestration errors [#353]

…353]

HenrikBengtsson · 2020-07-16T18:35:09Z

FYI, use of RNG by mistake will produce a warning by default in the next release of {future}, cf. commit df11e64

HenrikBengtsson added enhancement feature request labels Dec 23, 2019

HenrikBengtsson added this to the Next release milestone Dec 23, 2019

HenrikBengtsson added a commit that referenced this issue Dec 24, 2019

Add support for seed=42 and seed=TRUE/FALSE [#353]

9bdc242

HenrikBengtsson added a commit that referenced this issue Dec 24, 2019

FutureResult gained field 'rng' [#353]

375bb3e

HenrikBengtsson added a commit that referenced this issue Dec 26, 2019

Now value() detects if a future used RNG without specifying 'seed' [#353

73c8fb1

]

HenrikBengtsson added a commit that referenced this issue Dec 26, 2019

AD HOC: Avoid RNG checking when doRNG/%dorng% is in use [#353]

75f00da

HenrikBengtsson added a commit that referenced this issue Dec 29, 2019

Make seed = FALSE (was seed = NULL) the new default [#353]

68c996b

HenrikBengtsson added a commit that referenced this issue Dec 29, 2019

Option 'future.rng.onMisuse' default to env var 'R_FUTURE_RNG_ONMISUS…

4cbe404

…E' [#353]

HenrikBengtsson added a commit that referenced this issue Dec 29, 2019

Travis CI: Test with R_FUTURE_RNG_ONMISUSE=error [#353]

27549b5

HenrikBengtsson closed this as completed Jan 1, 2020

HenrikBengtsson added a commit that referenced this issue Jan 15, 2020

Don't produce Future{Error,Warning}:s on RNG mistakes; they are reser…

df1a1e6

…ved to future orchestration errors [#353]

HenrikBengtsson mentioned this issue Jan 16, 2020

Forgotten future.seed = TRUE in future_lapply() tlverse/origami#48

Closed

HenrikBengtsson added the RNG label Mar 30, 2020

HenrikBengtsson added a commit that referenced this issue Jul 16, 2020

R_FUTURE_RNG_ONMISUSE/future.rng.onMisuse now defaults to "warning" [#…

df11e64

…353]

a-torgovitsky mentioned this issue Sep 8, 2020

Something is wrong with the seed conroylau/lpinfer#82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RNG: Protect against misuse, i.e. using random numbers without setting up proper parallel RNG #353

RNG: Protect against misuse, i.e. using random numbers without setting up proper parallel RNG #353

HenrikBengtsson commented Dec 23, 2019

HenrikBengtsson commented Dec 28, 2019

HenrikBengtsson commented Jan 10, 2020

pat-s commented Jan 14, 2020

HenrikBengtsson commented Jan 14, 2020

pat-s commented Jan 14, 2020

HenrikBengtsson commented Jul 16, 2020

RNG: Protect against misuse, i.e. using random numbers without setting up proper parallel RNG #353

RNG: Protect against misuse, i.e. using random numbers without setting up proper parallel RNG #353

Comments

HenrikBengtsson commented Dec 23, 2019

Proposal

BTW

See also

HenrikBengtsson commented Dec 28, 2019

HenrikBengtsson commented Jan 10, 2020

Roadmap

pat-s commented Jan 14, 2020

Roadmap

Warning message content

HenrikBengtsson commented Jan 14, 2020

pat-s commented Jan 14, 2020

HenrikBengtsson commented Jul 16, 2020