Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add $variables() #519

Merged
merged 19 commits into from
Aug 17, 2021
Merged

Add $variables() #519

merged 19 commits into from
Aug 17, 2021

Conversation

rok-cesnovar
Copy link
Member

Summary

This PR will attempt to add $variables(). First we need to settle what will be the args (if any) and what it will return.

Current suggestion is:

library(cmdstanr)

code <- "
data {
  int<lower=0> N;
  int<lower=0> K;
  int<lower=0,upper=1> y[N];
  matrix[N, K] X;
}
parameters {
  real alpha;
  vector[K] beta;
}
model {
  target += normal_lpdf(alpha | 0, 1);
  target += normal_lpdf(beta | 0, 1);
  target += bernoulli_logit_glm_lpmf(y | X, alpha, beta);
}
generated quantities {
  vector[N] log_lik;
  for (n in 1:N) log_lik[n] = bernoulli_logit_lpmf(y[n] | alpha + X[n] * beta);
}
"

mod <- cmdstan_model(write_stan_file(code))
mod$variables()
$data
$data$N
$data$N$type
[1] "int"

$data$N$dimensions
[1] 0


$data$K
$data$K$type
[1] "int"

$data$K$dimensions
[1] 0


$data$y
$data$y$type
[1] "int"

$data$y$dimensions
[1] 1


$data$X
$data$X$type
[1] "real"

$data$X$dimensions
[1] 2



$parameters
$parameters$alpha
$parameters$alpha$type
[1] "real"

$parameters$alpha$dimensions
[1] 0


$parameters$beta
$parameters$beta$type
[1] "real"

$parameters$beta$dimensions
[1] 1



$transformed_parameters
NULL

$generated_quantities
$generated_quantities$log_lik
$generated_quantities$log_lik$type
[1] "real"

$generated_quantities$log_lik$dimensions
[1] 1
> names(mod$variables())
[1] "data"                   "parameters"             "transformed_parameters" "generated_quantities"  
> names(mod$variables()$data)
[1] "N" "K" "y" "X"

Copyright and Licensing

Please list the copyright holder for the work you are submitting
(this will be you or your assignee, such as a university or company):
Rok Češnovar

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the following licenses:

@codecov-commenter
Copy link

codecov-commenter commented Jun 18, 2021

Codecov Report

Merging #519 (e6646f8) into master (0872512) will decrease coverage by 1.12%.
The diff coverage is 96.00%.

❗ Current head e6646f8 differs from pull request most recent head fa98252. Consider uploading reports for the commit fa98252 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #519      +/-   ##
==========================================
- Coverage   92.77%   91.65%   -1.13%     
==========================================
  Files          12       12              
  Lines        3047     3031      -16     
==========================================
- Hits         2827     2778      -49     
- Misses        220      253      +33     
Impacted Files Coverage Δ
R/model.R 91.20% <95.83%> (-0.54%) ⬇️
R/csv.R 98.21% <100.00%> (-0.45%) ⬇️
R/install.R 63.27% <0.00%> (-5.56%) ⬇️
R/run.R 94.07% <0.00%> (-1.65%) ⬇️
R/utils.R 90.71% <0.00%> (-1.43%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0872512...fa98252. Read the comment docs.

@jgabry
Copy link
Member

jgabry commented Jun 18, 2021

First we need to settle what will be the args (if any) and what it will return.

I like your current suggestion. Just a few thoughts:

  • Will the use of the name "dimensions" here be confusing? For example if we have vector[100] x then fit$metadata()$stan_variable_dims$x would say 100 but variables() would say "dimensions" is 1. Both make sense, but they're using "dimensions" (or "dims" in the case of metadata()) to mean two different things.

  • I don't think we absolutely need any arguments but we could add an argument block. That way we could do mod$variables("parameters") instead of mod$variables()$parameters. But that doesn't really matter too much.

And a few questions more related to CmdStan and stanc3 than CmdStanR:

  • Is the "--info" option documented anywhere for stanc3? I couldn't find it so I opened an issue: CmdStan: New command line option in 2.27 is not documented docs#372

  • Is there any way to get more detailed type information from stanc3? That is, if I have a simplex or correlation matrix, is it possible to get it to give me those types instead of just "real"?

@rok-cesnovar
Copy link
Member Author

Will the use of the name "dimensions" here be confusing

could be yes.m, good call. Any other suggestions? Dimension_length? dims_length? Any other?

we could add an argument block

I like it!

Is the "--info" option documented anywhere for stanc3?

That is missing yes. Thanks for the issue.

Is there any way to get more detailed type information from stanc3?

not at the moment, but we can definitely add stuff to what —info returns.

@jgabry
Copy link
Member

jgabry commented Jun 18, 2021

could be yes.m, good call. Any other suggestions? Dimension_length? dims_length? Any other?

Maybe n_dims or num_dims?

Alternatively, we could keep dimensions here and instead change stan_variable_dims to stan_variable_sizes or stan_variable_elements?

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented Jul 10, 2021

This is now ready for review finally. Changes:

  • removed the block arg
  • added basic docs
  • changed stan_variable_dims to stan_variable_sizes
  • reorganized the return a bit

)
variables <- jsonlite::read_json(out_file, na = "null")
variables$data <- variables$inputs
variables$inputs <- NULL
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think data is a better name here than inputs.

expect_equal(mod$variables()$parameters$theta$type, "real")
expect_equal(mod$variables()$parameters$theta$dimensions, 0)
expect_equal(length(mod$variables()[["transformed parameters"]]), 0)
expect_equal(length(mod$variables()[["generated quantities"]]), 0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename these two to a name with an underscore?

@mitzimorris
Copy link
Member

mitzimorris commented Jul 14, 2021

regarding names - CmdStanPy and CmdStanR have somewhat diverged on the methods for the CmdStanMCMC object
w/r/t the functions which return the draws for the sampler output variables.

I would love to come up with a good and consistent set of names for these functions on both the CmdStanModel and CmdStanMCMC object. the difference in dimensionality may be confusing to users - a Stan variable as declared in
the model has N dimensions, in the sampler output, it has N+1 dimensions because is an array of length draws where each
array element has N dimensions. that's as concisely as I can put it, but is it concise or confusing?

Also, CmdStanMCMC object in CmdStanPy makes a distinction between sampler_vars corresponding to lp__ and friends, and stan_vars which are only the model variables which are output by the write_array function, i.e., variables in the parameters, transformed parameters, and generated quantities block. OTOH, from the model stan_variables returns data and transformed data variables as well. Getting the names and dimensions of the data variables is useful. Getting the names and dimensions of the transformed data variables is not particularly useful. Should we make a distinction between data variables and output variables?

@rok-cesnovar
Copy link
Member Author

that's as concisely as I can put it, but is it concise or confusing?

I think its concise.

Also, CmdStanMCMC object in CmdStanPy makes a distinction between sampler_vars corresponding to lp__ and friends, and stan_vars which are only the model variables which are output by the write_array function, i.e., variables in the parameters, transformed parameters, and generated quantities block.

Yes, CmdStanMCMC does this similarly, just with $draws() that returns draws for all the parameters/trans. parameters/GQ but also include lp__. The rest of the MCMC diagnostics are returned by sampler_diagnostics. At the time lp__ was considered to be more of a regular variable then a sampler diagnostic, which I tend to agree.

Should we make a distinction between data variables and output variables?

I agree that transformed data names are not relevant as they do not occur in the input nor the output, they are not present here.

The purpose of $variables() in CmdstanModel is to return the information on all the input or output model variables. So this does not return any values, that would be part of CmdstanMCMC, this is just to return the names, dimenionalities and types of the variables (whether they require integers or reals).

@mitzimorris
Copy link
Member

mitzimorris commented Jul 14, 2021

The purpose of $variables() in CmdstanModel is to return the information on all the input or output model variables. So this does not return any values, that would be part of CmdstanMCMC, this is just to return the names, dimenionalities and types of the variables (whether they require integers or reals).

right, this is good and useful. as it's a method on the CmdStanModel object, the name model_variables would be both long and redundant, nonetheless, variables seems vague. what about splitting this into data_vars and output_vars or something like that?

@rok-cesnovar
Copy link
Member Author

Sure, no issue in splitting, we discussed this in the issue I think as well.

Options:
a) $data_variables(), $output_variables()
b) $input_variables(), $output_variables()
c) $data_vars(), $output_vars()
d) $input_vars(), $output_vars()

Not a huge fan of the vars shorthand but seems that is where cmdstanpy is headed, so maybe we should as well. I dont have a strong preference though. @jgabry might have one?

@mitzimorris
Copy link
Member

mitzimorris commented Jul 14, 2021

I'm happy to go with what whatever consensus we can find.
I have no objection to changing vars to variables and actually prefer the latter.
similarly, can see arguments for either data or input.
the inference methods all have argument data and the block is named data, but inasmuch as everything's data, calling it input data also makes sense.

discuss on forums?

until we get to 1.0, we can change names - but only if it's for the better.

@jgabry
Copy link
Member

jgabry commented Aug 5, 2021

Sure, no issue in splitting, we discussed this in the issue I think as well.

Options:
a) $data_variables(), $output_variables()
b) $input_variables(), $output_variables()
c) $data_vars(), $output_vars()
d) $input_vars(), $output_vars()

Not a huge fan of the vars shorthand but seems that is where cmdstanpy is headed, so maybe we should as well. I dont have a strong preference though. @jgabry might have one?

Sorry for the delay on this! I totally missed these comments earlier.

I think I prefer variables over vars. And I slightly favor data instead of inputs.

@jgabry
Copy link
Member

jgabry commented Aug 5, 2021

I think I prefer variables over vars. And I slightly favor data instead of inputs.

But neither preference is super super strong

@mitzimorris
Copy link
Member

cmdstanpy has stan_variables and method_variables
the latter is for things like lp__
and stan_variables are the model variables that are written to the Stan CSV file - i.e.,
parameters, transformed parameters, and generated quantities.

@mitzimorris
Copy link
Member

are data_variables always available? in CmdStanPy, if you reconstitute an object from a Stan CSV file, you don't have access to the input data.

@jgabry
Copy link
Member

jgabry commented Aug 5, 2021

are data_variables always available? in CmdStanPy, if you reconstitute an object from a Stan CSV file, you don't have access to the input data.

Good point. I think this is true for CmdStanR too

@rok-cesnovar
Copy link
Member Author

are data_variables always available?

mod$data_variables() and mod$output_variables() will only return the information on all the input and output model variables, not the actual values. So this will return names, dimensionalities and types. Given that its part of the model class, its not possible to return anything else.

Not sure why availability is the question here? Maybe I am misunderstanding, which is always an option :)

@mitzimorris
Copy link
Member

mitzimorris commented Aug 6, 2021 via email

@jgabry
Copy link
Member

jgabry commented Aug 6, 2021

Not sure why availability is the question here? Maybe I am misunderstanding, which is always an option :)

No you're right. In my haste I got confused, which tends to happen :)

@jgabry
Copy link
Member

jgabry commented Aug 6, 2021

given that the model code is available, why are these methods needed?

I'm not sure it's absolutely needed, but it can be helpful. For example, it would allow us to solve this: #513

@mitzimorris
Copy link
Member

mitzimorris commented Aug 6, 2021

I'm not sure it's absolutely needed, but it can be helpful. For example, it would allow us to solve this: #513

if that's the case, then you should add methods. data_variables, parameter_variables, and generated_quantities variables. and maybe transformed_data and transformed_params too.

how about variables as function name, and arg block allowing folks to examine just the data block, etc.

@rok-cesnovar
Copy link
Member Author

how about variables as function name, and arg block allowing folks to examine just the data block, etc.

Yes, this is basically what is currently implemented. I had the block arg initially, but I felt its redundant as one can simply do mod$variables()$data or mod$variables()[["data"]].

Copy link
Member

@jgabry jgabry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just the one comment about the missing transformed parameters.

```{r variable-type-dims}
variables$data$J
variables$data$sigma
variables$transformed_parameters$theta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rok-cesnovar When I run this I get NULL for the transformed parameters even though it should have theta. Do you get that too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry, should have used

variables$`transformed parameters`
variables$`generated quantities`
variables[["generated quantities"]]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was leftover from an earlier version when I replaced the space with underscore. I am not sure which one is better. Space is more similar to the actual block name in the Stan model, underscore is easier to use with $. I am fine with either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underscore is easier to use with $. I am fine with either.

I think I prefer the underscore if that's ok with you. Sorry for the extra work!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem at all. Done.

Copy link
Member

@jgabry jgabry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Changes look good. I'll go ahead and merge now.

@jgabry jgabry merged commit 58f0980 into master Aug 17, 2021
@jgabry jgabry deleted the add_model_variables branch August 17, 2021 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants