Add vignette on benchmarking model options #695

jamesmbaazam · 2024-06-13T19:07:51Z

Description

This PR closes #629.

Initial submission checklist

My PR is based on a package issue and I have explicitly linked it.
I have tested my changes locally (using devtools::test() and devtools::check()).
I have added or updated unit tests where necessary.
I have updated the documentation if required and rebuilt docs if yes (using devtools::document()).
I have followed the established coding standards (and checked using lintr::lint_package()).
I have added a news item linked to this PR.

After the initial Pull Request

I have reviewed Checks for this PR and addressed any issues as far as I am able.

jamesmbaazam · 2024-06-17T14:42:32Z

As an update, the {rstan} models are working fine but the {cmdstanr} models are giving various errors that I will report on.

Error when using laplace() method from {cmdstanr}.

library(EpiNow2)
library(cmdstanr)
# Set the number of cores to use
options(mc.cores = 4)

# Generation time
generation_time <- Gamma(
  shape = Normal(1.3, 0.3),
  rate = Normal(0.37, 0.09),
  max = 14
)

# Incubation period
incubation_period <- LogNormal(
  meanlog = Normal(1.6, 0.05),
  sdlog = Normal(0.5, 0.05),
  max = 14
)

# Reporting delay
reporting_delay <- LogNormal(
  meanlog = 0.5,
  sdlog = 0.5,
  max = 10
)

# Combine the incubation period and reporting delay into one delay
delay <- incubation_period + reporting_delay

# Observation model options
obs <- obs_opts(
  scale = list(mean = 0.1, sd = 0.025),
  return_likelihood = TRUE
)

# Run model
epinow(
  data = example_confirmed,
  generation_time = generation_time_opts(generation_time),
  delays = delay_opts(delay),
  obs = obs,
  horizon = 0,
  rt = NULL,
  stan = stan_opts(
    method = "laplace",
    backend = "cmdstanr"
  )
)

Error in fit_model_approximate(args, id = id) : 
  Approximate inference failed due to: Error: 'jacobian' argument to optimize and laplace must match!
laplace was called with jacobian=TRUE
optimize was run with jacobian=TRUE

Not an informative error message from {cmdstanr}, I should say.

After playing around with the example above with different combinations of mode=NULL or unspecified and jacobian, and the example in laplace(), there seems to be a weird/erroneous interaction between the two arguments. It seems that the option to set mode = NULL may not be implemented? (I am yet to confirm this).

Moreover, according to the documentation of laplace(), setting mode = NULL and jacobian = TRUE/FALSE (through stan_opts() in EpiNow2) should work but I get the same error. It seems we may have to run optimise() first and pass the output to laplace().

Am I missing anything? Thoughts?? @sbfnk @seabbs

sbfnk · 2024-07-12T09:40:13Z

I can't reproduce this - do you need to update cmdstanr or cmdstan (using cmdstanr::install_cmdstan()) possibly?

My versions are:

devtools::package_info("cmdstanr") |>
  dplyr::filter(package == "cmdstanr")
#>  package  * version    date (UTC) lib source
#>  cmdstanr   0.8.1.9000 2024-06-23 [1] Github (stan-dev/cmdstanr@9878dda)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
cmdstanr::cmdstan_version()
#> [1] "2.35.0"

^{Created on 2024-07-12 with reprex v2.1.0}

jamesmbaazam · 2024-07-12T15:15:43Z

Thanks. It's fixed now after updating.

vignettes/speedup_options.Rmd.orig

vignettes/speedup_options.Rmd

jamesmbaazam · 2024-07-15T12:30:38Z

I have now pushed the vignette with the run of all the models using MCMC (a472439).

To do:

Add a model using "vb" from cmdstanr.
- UPDATE: This now runs and I've added a custom function to extract the samples for downstream analyses.
Run the {cmdstanr} approximate methods ("pathfinder" and "laplace").
- Blockers: the two methods are not even initialising. (cc: @sbfnk @seabbs)
- UPDATE: Have now gotten to the root of one of the issues and created an issue here pathfinder fails for large case reports #728.

Settings and errors so far

Pathfinder errors

`num_paths = 1`

stan = stan_opts(
      method = "pathfinder",
      backend = "cmdstanr",
      num_paths = 1,
    )
# Error in fit_model_approximate(args, id = id) : 
#   Approximate inference failed due to: Optimization terminated with error: Line search failed to achieve a sufficient decrease, no more progress can be made Optimization failed to start, pathfinder cannot be run.

`num_paths > 1`


stan = stan_opts(
      method = "pathfinder",
      backend = "cmdstanr",
      num_paths = 2,
)

# Error in fit_model_approximate(args, id = id) : 
# Approximate inference failed due to: No pathfinders ran successfully

Increase `trials` from default of 10 to 50 (with multipathfinder default)

stan = stan_opts(
      method = "pathfinder",
      backend = "cmdstanr",
      trials = 50
)

# Error in fit_model_approximate(args, id = id) : 
# Approximate inference failed due to: No pathfinders ran successfully

jamesmbaazam · 2024-07-24T17:11:14Z

I noticed something interesting about pathfinder struggling to fit large case reports and have created a separate issue for it here #728.

@sbfnk The only thing in the way of this vignette getting merged is the struggle to get pathfinder and laplace working. If we don't want to delay this any further, we could finalise this version, mention that a future enhancement will include those two methods, and get this merged, then add them as an enhancement when I've figured them out. All the models including vb from {rstan} and {cmdstanr} are working currently. I will precompile them all and push.

jamesmbaazam · 2024-07-25T13:04:27Z

From meeting with Seb today:

Finalise current vignette by running all models, leaving out Pathfinder and Laplace for future enhancement.
Add more metrics alongside run time:
- total CRPS for estimates on complete data, partial data, and forecasting.
- CRPS at the last time point (real-time performance)

vignettes/speedup_options.Rmd

kaitejohnson · 2025-01-30T16:14:50Z

@jamesmbaazam Would it be possible to share a rendered html, since I don't think we've set anything up yet to post a preview of the articles as a comment (though we could)

jamesmbaazam · 2025-01-30T17:47:58Z

@jamesmbaazam Would it be possible to share a rendered html, since I don't think we've set anything up yet to post a preview of the articles as a comment (though we could)

We usually pre-compile the vignettes from the Rmd.orig version to Rmd, which can be knit locally for review.

sbfnk

Wow, this is a lot of great work!

General comment is that it has elements of a vignette and elements of a paper so it's not quite clear where it stands on the distinction between the two. I would say for a vignette it's a bit too complex defining functions calling other functions etc. - for a paper it definitely has some great new content but the style is more like a walkthrough/vignette.

With that in mind I have focused on code correctness / implementation in my review so far. We can discuss the text later perhaps.

Anyway, very cool stuff and really interesting!

vignettes/benchmarks.Rmd.orig

sbfnk · 2025-01-31T08:57:34Z

vignettes/benchmarks.Rmd.orig

+- `infections_true`: the infections by date of infection, and
+- `reported_cases_true`: the reported cases by date of report.
+```{r extract-true-data}
+R_true <- forecast$summarised[variable == "R"]$median


Why do you do 10 samples and the median for R but not infections? I think for a fair comparison of the methods you'd just sample 1 and then take this noisy trajectory as truth data (possibly doing it 10 times overall but this requires 10 times the computation).

I will update but just to note that this is from the synthetic evaluation scripts so might have to update there too?

Also, just to understand, are some of the methods/models more sensitive to the number of samples?

vignettes/benchmarks.Rmd.orig

sbfnk · 2025-01-31T09:09:10Z

vignettes/benchmarks.Rmd.orig

+
+These are the components of each model.
+```{r model-components,echo = FALSE}
+model_components <- dplyr::tribble(


same as above for constructing this

vignettes/benchmarks.Rmd.orig

sbfnk · 2025-01-31T09:15:23Z

vignettes/benchmarks.Rmd.orig

+  if (!inherits(estimates, c("matrix"))) return(rep(NA_real_, length(truth)))
+  # Assumes that the estimates object is structured with the samples as rows
+  shortest_obs_length <- min(ncol(estimates), length(truth))
+  reduced_truth <- tail(truth, shortest_obs_length)


shouldn't this be head? I.e. your estimates correspond to the initial snapshot of the data (starting at 1) not the end (at least for now).

sbfnk · 2025-01-31T09:15:30Z

vignettes/benchmarks.Rmd.orig

+  shortest_obs_length <- min(ncol(estimates), length(truth))
+  reduced_truth <- tail(truth, shortest_obs_length)
+  estimates_transposed <- t(estimates) # transpose to have samples as columns
+  reduced_estimates <- tail(estimates_transposed, shortest_obs_length)


Same here head vs tail

sbfnk · 2025-01-31T09:15:52Z

vignettes/benchmarks.Rmd.orig

+      log10(reduced_truth),
+      log10(reduced_estimates)


only works if not ever zero (so perhaps fine if it works)

sbfnk · 2025-01-31T09:16:29Z

vignettes/benchmarks.Rmd.orig

+  estimates_transposed <- t(estimates) # transpose to have samples as columns
+  reduced_estimates <- tail(estimates_transposed, shortest_obs_length)
+  return(
+    crps_sample(


if you're only using crps_sample() from scoringutils would it make sense to use scoringRules instead? As the scoringutils version is just a thin wrapper.

(how are we going to goose downloads with that approach ;) )

More seriously I think this indicates we should just use more of the features of scoringutils

Yep, ultimately related also to #618

sbfnk · 2025-01-31T09:20:00Z

vignettes/benchmarks.Rmd.orig

+
+```{r process-crps}
+# Process CRPS for Rt
+rt_crps <- process_crps(results, "R", R)


I wonder if it would be good to look at the estimate on the day of the inference for a real-time measure and then e.g. the 7-day forecast rather than the whole "estimate from partial data" / "forecast" time series. Debatable perhaps but if you use the whole series I think it might make more sense to use the multivariate CRPS / energy score (which is not yet implemented in scoringutils but is available in scoringRules::es_sample().

sbfnk · 2025-01-31T09:24:25Z

I think we can push this to the next release to give us a bit more time to review content and scope - no need to rush it into 1.7.0 (which is otherwise pretty much ready to go).

kaitejohnson

@jamesmbaazam Mostly a lot of questions, feel free to ignore any of these as I am sure you have already discussed a lot of it.

Overall this is a really nice analysis. From an outside perspective, I am really interested in how the approximate methods compare in terms of forecast performance to MCMC. Also, it wasn't clear that these were evaluated only on the out-of-sample data, and if not, separately evaluating the out-of-sample forecast performance seems important (but I may have missed this).

vignettes/benchmarks.Rmd.orig

kaitejohnson · 2025-01-31T09:34:47Z

vignettes/benchmarks.Rmd.orig

+# Rt prior
+rt_prior_default <- Normal(2, 0.1)
+# Number of cores
+options(mc.cores = 6)


Add a comment that you set to 6 bc your computer has 8 cores or something along those lines

or do some kind of core detection probably.

This is addressed in the considerations section but happy to repeat it here. See 655148a.

I think if we can avoid hard coding this it would be good.

kaitejohnson · 2025-01-31T09:49:27Z

vignettes/benchmarks.Rmd.orig

+  R = R_noisy,
+  samples = 10
+)
+```


From a very outside perspective, it strikes me as a little bit odd to do a fit to the example_confirmed, and then create a separate true R(t), and then generate new simulated data, rather than just doing a forward simulation.

But I assume there are reasons for this choice that have already been discussed.

Aha we are basically being extremely clever and pragmatic but I agree we should tell people this

The idea is that by using the posterior for everything but the Rt trajectory you are getting a simulation that very accurately reflects a real world setting whilst still being able to know and change the true Rt trajectory

The idea is that by using the posterior for everything but the Rt trajectory you are getting a simulation that very accurately reflects a real world setting whilst still being able to know and change the true Rt trajectory

Stealing this explanation.

kaitejohnson · 2025-01-31T09:53:54Z

vignettes/benchmarks.Rmd

+cases_traj
+```
+
+![plot of chunk plot-cases](benchmarks-plot-cases-1.png)


This comes up in the knitted Rmd under the plot as plot of chunk plot-cases, perhaps just want to label them as Figure. # with a caption?

Yeah, that's weird. I'll remove all the captions b327642.

kaitejohnson · 2025-01-31T10:14:05Z

vignettes/benchmarks.Rmd

+```
+
+![plot of chunk crps-plotting-rt-total](benchmarks-crps-plotting-rt-total-1.png)
+


Might be easier to read if you made this into a bar chart since we were just looking at these as colored lines.

or a point plot I am irationally anti bar

kaitejohnson · 2025-01-31T10:15:14Z

vignettes/benchmarks.Rmd

+```
+
+![plot of chunk plot-rt-tv-crps-approx](benchmarks-plot-rt-tv-crps-approx-1.png)
+


Would be nice to have some explanation of why variational basyes failed for nearly all of the models in the decline phase/generally why they are failing

it would but I think this might be out of scope here (i.e its a new research/doc issue IMO)

kaitejohnson · 2025-01-31T10:40:21Z

vignettes/benchmarks.Rmd

+Estimation in `{EpiNow2}` using the semi-mechanistic approaches (putting a prior on $R_t$) is often much slower than the non-mechanistic approach. The mechanistic model is slower because it models aspects of the processes and mechanisms that drive $R_t$ estimates using the renewal equation. The non-mechanistic model, on the other hand, runs much faster but does not use the renewal equation to generate infections. Because of this none of the options defining the behaviour of the reproduction number are available in this case, limiting its flexibility.
+
+It's worth noting that the non-mechanistic model in `{EpiNow2}` is equivalent to that used in the [`{EpiEstim}`](https://mrc-ide.github.io/EpiEstim/index.html) R package as they both estimate $R_t$ from case time series and the generation interval distribution.
+


Might be worth noting that its likely that this is more sensitive to GI misspecification (I'd assume?). Also I am guessing the difference is that EpiEstim would require specifying just the mean delay, where the non-mechanistic EpiNow2 is able to incorporate the full delay distribution, which is likely to be important for fitting to more delayed data

Could you clarify a bit what you mean bit by this there? The renewal?

Also I am guessing the difference is that EpiEstim would require specifying just the mean delay, where the non-mechanistic EpiNow2 is able to incorporate the full delay distribution, which is likely to be important for fitting to more delayed data

It doesn't take any delays at all and the approach to smoothing is different (here it is being fit). We should probably talk about this somewhere.

My understanding of EpiEstim is that it does a direct computation of the posterior estimate of R(t) based on the cases and the GI, assuming cases are just latent incident infections shifted by some mean delay.

So when its written that the non-mechanistic one is like EpiEstim, I assume it does something similar, but since the input has the user specify a full delay distribution, I (perhaps incorrectly) assumed it is using that full distribution (rather than just the mean).

What do you mean it doesn't take any delays?

EpiEstim doesnt do anything with delays and uses the serial interval not the GT. I think what the text means here is that underlying infection generating process Epinow2 and EpiEstim uses are the same but not the rest of the model and that is infact only partially true because of differences in the latent model and differences in the uncertainty. Personally I would just nuke this paragraph.

Ok yeah I don't think I understand how the non-mechanistic model works then!

kaitejohnson · 2025-01-31T10:44:28Z

vignettes/benchmarks.Rmd

+  labs(caption = "Where a model is not shown, it means it failed to run")
+```
+
+![plot of chunk plot-infections-total-crps-approx](benchmarks-plot-infections-total-crps-approx-1.png)


Can you also plot a comparison of the MCMC vs approximate methods for R(t) estimation and infections? There is a lot of language in here around the approximate methods not being well-tested, but this exercise seems like an evaluation of them! Would be good to show those results directly I think

kaitejohnson · 2025-01-31T10:48:06Z

vignettes/benchmarks.Rmd.orig

+infections_crps_mcmc_plot +
+    facet_wrap(~epidemic_phase, ncol = 1)
+```
+


It isn't clear to me that all of these are being evaluated in both the calibration period and the forecast period, or just in the 7 day forecast period?

Doesn't this mix in-sample and out-of-sample evaluation? I am most interested in the out of sample evaluation performance (specifically for the infections)

Co-authored-by: Sebastian Funk <sebastian.funk@lshtm.ac.uk>

jamesmbaazam · 2025-02-11T15:39:54Z

Agreement from today's f2f meeting with Seb

Scope

To evaluate the retrospective and real-time performance of select models using timing and proper scoring rules

Evaluations

Real-time performance - Choose a time point:
- For infections, use last day of the forecast 7-day horizon (7-day ahead performance)
- For Rt, use the 0-day forecast (interpreted as the nowcast estimate of the Rt, given the delay)
Retrospective estimation performance (excluding forecasts):
- Time-varying performance
- Total performance
Miscellaneous
- Reduce focus on code setup and hide code/functions.

I'll work on this gradually this week, if time permits as I will be attending an internal event this week. This should be completed by early next week, if all goes well.

jamesmbaazam mentioned this pull request Jul 12, 2024

Issue 44: Approximate inference vignette epinowcast/epidist#69

Merged

14 tasks

jamesmbaazam commented Jul 12, 2024

View reviewed changes

vignettes/speedup_options.Rmd.orig Outdated Show resolved Hide resolved

jamesmbaazam commented Jul 12, 2024

View reviewed changes