Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More divergences with composed warmup-then-sample than combined warmup-and-sample #966

Closed
mike-lawrence opened this issue Jan 2, 2021 · 23 comments

Comments

@mike-lawrence
Copy link

mike-lawrence commented Jan 2, 2021

Take the 8schools_ncp example:

library(cmdstanr)
example_program = 'schools_ncp.stan'
example_data = 'schools_ncp.data.json'
tmp <- file.path(tempdir(), example_program)
if (!file.exists(tmp)) {
  file.copy(system.file(example_program, package = "cmdstanr"), tmp)
}
mod <- cmdstan_model(tmp)
data_file <- system.file(example_data, package = "cmdstanr")

seed = 1
iter_warmup = 1e3
iter_sample = 1e3
parallel_chains = 4

#first the typical combined warmup-and-sample run:
both = mod$sample(
  data = data_file
  , chains = parallel_chains
  , parallel_chains = parallel_chains
  , refresh = 0
  , show_messages = F
  , seed = seed
  , iter_warmup = iter_warmup
  , iter_sampling = iter_sample
)

Which yields no divergences. Now an attempt at doing warmup and sampling separately, using the adapted info from the former in the latter:

warmup = mod$sample(
  data = data_file
  , chains = parallel_chains
  , parallel_chains = parallel_chains
  , refresh = 0
  , show_messages = F
  , seed = seed
  , iter_warmup = iter_warmup
  , save_warmup = T #for inits
  , sig_figs = 18
  , iter_sampling = 0
)

get_sampling_inits_from_warmup = function(chain_id){
	warmup_draws = warmup$draws(inc_warmup=T)
	final_warmup_value = warmup_draws[iter_warmup,chain_id,]
	init_list = as.list(final_warmup_value)
	names(init_list) = dimnames(final_warmup_value)[[3]]
	init_list = init_list[names(init_list)!='lp__']
	return(init_list)
}

samples = mod$sample(
  data = data_file
  , chains = parallel_chains
  , parallel_chains = parallel_chains
  , refresh = 0
  , show_messages = F
  , seed = seed
  , iter_warmup = 0
  , adapt_engaged = FALSE
  , inv_metric = warmup$inv_metric(matrix=F)
  , step_size = warmup$metadata()$step_size_adaptation
  , iter_sampling = iter_sample
  , init = get_sampling_inits_from_warmup
)

And now we get divergences from the sampling run. I have more rigorous testing of this detailed here, but the gist is that doing the warmup-then-sample approach is somehow generating more divergences than the combined warmup-and-sample approach despite to all examination all the inputs being as expected. I don't believe this is an issue with cmdstanr's passing of the pertinent info to cmdstan because I've checked the init jsons and output csvs and they have all the expected content.

@mike-lawrence
Copy link
Author

mike-lawrence commented Jan 3, 2021

Actually, here's some data and a model that seems to more reliably show the difference:
divergence_demo.zip

And here's a script to loop over lots of seeds, showing that with nearly every one the number of divergences in the composed approach is far higher than those in the standard approach:
composed_divergences.zip

@mike-lawrence
Copy link
Author

mike-lawrence commented Jan 3, 2021

And here's the functions I used to confirm that the inputs are as expected in the csv files (implying that this is not a cmdstanr problem):

check_inputs = function(chain){
	sampling_step_size = as.numeric(strsplit(readLines(samples$output_files()[chain])[42],'=')[[1]][2])
	warmup_step_size = warmup$metadata()$step_size_adaptation[chain]
	step_size_ok = dplyr::near(
		sampling_step_size
		, warmup_step_size
	)
	
	temp = readLines(samples$output_files()[chain])[44]
	temp = str_replace(temp,'#','')
	sampling_inv_metric = as.numeric(unlist(strsplit(temp,',',)))
	warmup_inv_metric = unlist(warmup$inv_metric(matrix=F)[chain])
	inv_metric_ok = all(dplyr::near(
		sampling_inv_metric
		, warmup_inv_metric
	))
	
	samples_inits = 
		(
			tibble::tibble(
				v = readLines(samples$metadata()$init[chain])
			)
			%>% dplyr::slice(2:(n()-1))
			%>% tidyr::separate(
				v
				, into = c('par','value')
				, sep = ':'
			)
			%>% dplyr::mutate(
				value = as.numeric(stringr::str_replace(value,',',''))
			)
			%>% dplyr::pull(value)
		)
	warmup_inits = as.numeric(get_sampling_inits_from_warmup(chain))
	inits_ok = all(dplyr::near(samples_inits,warmup_inits))
	return(tibble::tibble(chain,step_size_ok,inv_metric_ok,inits_ok))
}

purrr::map_dfr(1:parallel_chains,check_inputs)

@mitzimorris
Copy link
Member

yes, this happens. it's not a bug - it's an artifact of the sampler trying to take the specified initial parameters values and meet model constraints. also, even when specifying the seed, the RNG is in a different state.

@mitzimorris
Copy link
Member

reopening - @bbbales2 et al can investigate.

@mitzimorris mitzimorris reopened this Jan 3, 2021
@mike-lawrence
Copy link
Author

Here's a gist containing the stan file, the code to generate data and the code to loop over seeds, generating new data and attempting to sample in the two ways for each seed.

Data and model are a standard hierarchical model where there are 10 subjects each observed with common gaussian measurement noise 10 times in each of 2 conditions, and the subjects' intercept and condition effect are modelled as multivariate normal.

@mike-lawrence
Copy link
Author

Huh, running that just crashed my rstudio, but only once it was about halfway done. I'll try again and see if it continues to occur...

@mike-lawrence
Copy link
Author

Ran it a few times now and no more crashes, so possibly the crash was unrelated (though I didn't have much else happening on my computer at the time of the crash; plenty of free RAM and CPU).

@mike-lawrence
Copy link
Author

Note: confirmed to occur with cmdstanpy as well as cmdstanr, reinforcing that this is core to cmdstan itself. I also confirmed it persists when using the development version of cmdstan.

@ahartikainen
Copy link
Contributor

Should we extract the cmd used for calling cmdstan?

In cmdstanpy

fit.runset.cmd

@mitzimorris
Copy link
Member

there's a slight complication because the command is composed on the fly w/ the chain id - cmdstan_call(chain_id)
or something like that

@mike-lawrence
Copy link
Author

@mitzimorris @jgabry @rok-cesnovar Tagging you as core interface devs. @bbbales2 Investigated this, using my hierarchical model where I observe the most reliable increased divergences rates. He confirmed that the interfaces are creating an init json with the proper info, but found that if he tweaked the structure of these jsons the increased-divergences behaviour goes away.

Here's what both interfaces produce for the inits json:

{
  "chol_corr[1,1]": 1,
  "chol_corr[2,1]": -0.469573042026972,
  "chol_corr[1,2]": 0,
  "chol_corr[2,2]": 0.882893627908559,
  "mean_coef_[1]": -0.0305049133852781,
  "mean_coef_[2]": 0.164671727959854,
  "sd_coef_[1]": 0.00665194669853791,
  "sd_coef_[2]": 0.148205452770937,
  "multi_normal_helper[1,1]": -0.0216987316133846,
  "multi_normal_helper[2,1]": 0.724235080770058,
  "multi_normal_helper[1,2]": 0.43239986307413,
  "multi_normal_helper[2,2]": 1.75514903811538,
  "multi_normal_helper[1,3]": -0.691867572248652,
  "multi_normal_helper[2,3]": -2.66409204621652,
  "multi_normal_helper[1,4]": -0.395599105436074,
  "multi_normal_helper[2,4]": -0.231732460746632,
  "multi_normal_helper[1,5]": 0.439963477396995,
  "multi_normal_helper[2,5]": -0.460156658379562,
  "multi_normal_helper[1,6]": 1.46417143002824,
  "multi_normal_helper[2,6]": -1.2165899374329,
  "multi_normal_helper[1,7]": 1.54988112300058,
  "multi_normal_helper[2,7]": -0.90213563970599,
  "multi_normal_helper[1,8]": 0.740397448247888,
  "multi_normal_helper[2,8]": 1.32996651536118,
  "multi_normal_helper[1,9]": 0.456738742343979,
  "multi_normal_helper[2,9]": -0.306703562089387,
  "multi_normal_helper[1,10]": -0.530873450669662,
  "multi_normal_helper[2,10]": 1.49202297709515,
  "noise_": 0.981719532981492
}

And here's the alternative structure that he observes as solving the issue:

{
    "chol_corr": [[1, 0], [-0.469573042026972, 0.882893627908559]],
    "mean_coef_": [-0.0305049133852781, 0.164671727959854],
    "sd_coef_": [0.00665194669853791, 0.148205452770937],
    "multi_normal_helper": [[-0.0216987316133846, 0.43239986307413, -0.691867572248652, -0.395599105436074, 0.439963477396995, 1.46417143002824, 1.54988112300058, 0.740397448247888, 0.456738742343979, -0.530873450669662],
			    [0.724235080770058, 1.75514903811538, -2.66409204621652, -0.231732460746632, -0.460156658379562, -1.2165899374329, -0.90213563970599, 1.32996651536118, -0.306703562089387, 1.49202297709515]],
    "noise_": 0.981719532981492
}

So is this something that needs to be addressed in the interfaces, or in cmdstan?

@rok-cesnovar
Copy link
Member

In that its a bug in the interfaces. The below example is how it should be.

@ahartikainen
Copy link
Contributor

At least CmdStanPy doesn't create multidimensional objects yet.
Is there possibility to fix this in CmdStan side?

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Jan 13, 2021

If we change this in Cmdstan that will not be backwards incompatible.

@bbbales2
Copy link
Member

Maybe we could print warnings about unused elements of the JSON in cmdstan? That way the first json would at least emit a ton of messages and someone would know something is up.

These initial conditions have always been a bit annoying -- there's no part of the process that gives you feedback and says you set your inits right, which always leaves me wondering.

At least CmdStanPy doesn't create multidimensional objects yet.

Is this difficulty from parsing output draws and then trying to format them back into multidimensional objects again?

@rok-cesnovar
Copy link
Member

Maybe we could print warnings about unused elements of the JSON in cmdstan?

We need to warn that only a part of a parameter is being initialized from supplied data. But we can only do that in cmdstan and it probably isnt easy.

Is this difficulty from parsing output draws and then trying to format them back into multidimensional objects again?

Yes, essentially the extract functionality of rstan is missing. If you put a manually constructed matrix to init in a sample call it will work fine. I am going to close this and we can figure it out in cmdstanr/py, though we already have those issues open for it, they just might need a bigger priority.

@mitzimorris
Copy link
Member

In that its a bug in the interfaces. The below example is how it should be.

could @bbbales2 or @rok-cesnovar explain exactly what the bug in the interfaces is here?
what elements of JSON are unused and why?

@bbbales2
Copy link
Member

I don't think it's a bug so much as we should probably give some better feedback about what is happening.
If someone is trying to initialize the parameter chol_corr in their model and they supply this:

{
  "chol_corr[1,1]": 1,
  "chol_corr[2,1]": -0.469573042026972,
  "chol_corr[1,2]": 0,
  "chol_corr[2,2]": 0.882893627908559,
}

cmdstan silently accepts it but still keeps random initialization for chol_corr (it doesn't recognize that the user is actually trying to initialize chol_corr).

If you supply this instead it will work:

{
    "chol_corr": [[1, 0], [-0.469573042026972, 0.882893627908559]],
}

@bbbales2
Copy link
Member

And the reason I think we should give better feedback is it took us a lot of time to figure out what was going wrong here, even if we were doing something that, by the book, was simply wrong. Hard to debug.

@mike-lawrence
Copy link
Author

Certainly I think that if inits are supplied but are found to be in an unexpected format, a warning should be emitted rather than silently using new random inits.

FYI, here's a super kludgey update to my get_inits() function that at least makes things work for vectors and matrices (so far as I've tested, which admittedly hasn't been much yet):

get_inits = function(chain_id){
  warmup_draws = warmup$draws(inc_warmup=T)
  final_warmup_value = warmup_draws[iter_warmup,chain_id,2:(dim(warmup_draws)[3])]
  (
    final_warmup_value
    %>% tibble::as_tibble(
      .name_repair = function(names){
        dimnames(final_warmup_value)$variable
      }
    )
    %>% tidyr::pivot_longer(cols=dplyr::everything())
    %>% tidyr::separate(
      name
      , into = c('variable','index')
      , sep = "\\["
      , fill = 'right'
    )
    %>% dplyr::group_split(variable)
    %>% purrr::map(
      .f = function(x){
        (
          x
          %>% dplyr::mutate(
            index = stringr::str_replace(index,']','')
          )
          %>% tidyr::separate(
            index
            , into = c('first','second')
            , sep = ","
            , fill = 'right'
            , convert = T
          )
          %>% dplyr::arrange(first,second)
          %>% (function(x){
            out = list()
            if(all(is.na(x$second))){
              out[[1]] = x$value
            }else{
              out[[1]] = matrix(
                x$value
                , nrow = max(x$first)
                , ncol = max(x$second)
              )
            }
            names(out) = x$variable[1]
            return(out)
          })
        )         
      }
    )
    %>% unlist(recursive=F)
    %>% return()
  )
}

@mitzimorris
Copy link
Member

many thanks, now I understand.
this is a fundamental problem for the following reasons:

  1. there is no requirement that the user supply initial parameter values for all parameters
  2. the stan::io::varcontext object can contain extra variables
  3. the CmdStan interface doesn't provide a way to interact with the instantiated model directly - if it did, then it would be possible to follow the approach taken by standalone generated quantities to identify the model parameters.

the best way to address this problem is better documentation - I think that's what we're suggesting here, correct?

@bbbales2
Copy link
Member

Wow, that combination of #2 and #3 is a problem. Ouch ouch ouch. I'd like some sort of more active fix than docs (since it tricked me).

Best I can think of off the cuff is cmdstanr/cmdstanpy error if a variable name created in the inits isn't a valid Stan variable name. For instance a[5] can't be a Stan variable name.

@mitzimorris
Copy link
Member

Best I can think of off the cuff is cmdstanr/cmdstanpy error if a variable name created in the inits isn't a valid Stan variable name. For instance a[5] can't be a Stan variable name.

brilliant and easy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants