Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve simulate_*() functions: Rename columns + improve object names + comments #242

Merged
merged 56 commits into from
May 15, 2024

Conversation

jamesmbaazam
Copy link
Member

@jamesmbaazam jamesmbaazam commented May 4, 2024

This PR closes #238 and #175.

  • Please check if the PR fulfills these requirements
  • I have read the CONTRIBUTING guidelines
  • A new item has been added to NEWS.md
  • Tests for the changes have been added (for bug fixes / features)
  • Docs have been added / updated (for bug fixes / features)
  • Checks have been run locally and pass
  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

Function enhancements + Documentation improvements

  • What is the current behavior? (You can also link to an open issue here)

Currently, the <epichains> object returns columns with names infectee_id, sim_id, infector_id, and generation, and optionally, susc_pop and time (if pop and generation_time are specified respectively). However, these columns are confusing, swapped in interpretation, and not straightforward to explain as noted in the linked issues. The sim_id column is also not unique across the dataset, making it hard to interpret.

  • What is the new behavior (if this is a feature change)?

User-facing changes

  • The <epichains> object now returns columns with names chain, infector, infectee, generation, and optionally, time, if generation_time is specified. The susc_pop column has been removed as it was not deemed necessary to return.

  • The help file of simulate_chains() and simulate_chain_stats() also gain a new section providing a clear definition of what a "chain" is as used in the function.

  • The infectee column now contains a unique id for each infectee, which can link them to their infector and seeding index case.

  • The index_cases argument of simulate_chains() and simulate_chain_stats() has been renamed to n_chains to reflect the fact that the supplied number will simulate n independent chains, each starting with 1 individual.

Non-user-facing changes

Additionally, some of the objects in the code have been renamed and comments have been improved to make the code (hopefully) easier to read.

  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

NA (package unpublished).

  • Other information:

@jamesmbaazam
Copy link
Member Author

jamesmbaazam commented May 4, 2024

This PR also creates an opportunity to plot the network contained the returned <epichains> object using {epicontacts}. Here's a reprex.

reprex::reprex({
  library(epicontacts)
  # Install this PR and load the package
  pak::pkg_install("epiverse-trace/epichains#242")
  library(epichains)
  # Simulate an outbreak
  set.seed(32)
  outbreak <- simulate_chains(
    index_cases = 5,
    statistic = "size",
    offspring_dist = rpois,
    generation_time = function(n) rlnorm(n, meanlog = 0.58, sdlog = 1.58),
    lambda = 1.5,
    stat_max = 30
  )
 # Create an epicontacts object
  plot_df <- make_epicontacts(
    linelist = outbreak,
    contacts = outbreak,
    id = "infectee",
    from = "infector",
    to = "infectee",
    directed = TRUE
  )
  # Plot the epicontacts object
  plot(plot_df)
})

Screenshot 2024-05-04 at 15 23 22

The code used above can be converted to a simple S3 plot.epichains() method that takes an <epichains> object and extracts the right columns, passes them to epicontacts to do the plotting. This might be useful to the user.

@sbfnk
Copy link
Contributor

sbfnk commented May 8, 2024

Before I go into detailed review can I just ask if one change I spotted was intentional, as it affects interpretation of the results and what is the "correct" way of accounting.

Before (e.g. in bpmodels) each index case (/simulation) generated their own infectee ids, i.e. we could have

sim_id   infectee_id
     1             1
     1             2
     1             3
     2             1
     2             2

etc.

Now the infectee_id is shared across all index cases (/simulations)

index_case   infectee_id
         1             1
         1             2
         1             3
         2             4
         2             5

I think the way we want this relates to how we interpret the simulations (this also affects the sim_id vs index_case naming):

  • Option 1: The simulations are replicates of the same situation. The population size / depletion of susceptibles affects each of the simulations separately, and the stat_max affects each independently. This is what we want if using the simulations for inference (i.e. for estimation in the likelihood function) and is the existing set up. In that case I think it might make sense to keep the column name as sim_id
  • Option 2: The simulations are independent trees from different index cases in the same population. In that case I think population size / depletion of susceptibles affects the simulations concurrently, and the stat_max should probably apply to the sum of stats from each index case. This is a totally sensible set up for simulations but would, I think, need some more code changes. In that case it makes sense to call the column name index_case

It only occurred to me when reviewing this PR that we're conflating these two concepts (also noting that the first argument is called index_cases). We could support one of the two options (where option 1 requires least effort as it doesn't require any updating of the indexing) or perhaps both, but we should delineate clearly between them.

@jamesmbaazam
Copy link
Member Author

Now the infectee_id is shared across all index cases (/simulations)

Yes, this change was in response to your suggestions in the linked issues and my summary of how I understood it here #238 (comment).

For now, I'll revert to option 1, i.e., using the sim_id column name (so as not to delay the version release). We can revisit the second option in the future.

@jamesmbaazam
Copy link
Member Author

jamesmbaazam commented May 9, 2024

@sbfnk Here is what a reprex of what your comments here + in #238 (i.e., rename sim_id to infectee) would look like

library(epichains)
  set.seed(123)
  epc_out <- simulate_chains(
    index_cases = 5,
    statistic = "length",
    offspring_dist = rpois,
    stat_max = 100,
    lambda = 0.5
  )
  epc_out
#>    index_case infector infectee generation
#> 1           1       NA        1          1
#> 2           2       NA        1          1
#> 3           3       NA        1          1
#> 4           4       NA        1          1
#> 5           5       NA        1          1
#> 6           2        1        2          2
#> 7           4        1        2          2
#> 8           5        1        2          2
#> 9           5        1        3          2
#> 10          5        2        4          3
Created on 2024-05-09 with [reprex v2.1.0](https://reprex.tidyverse.org/)

The index_case column refers to the seeding index cases that remain active through the generations, and the previous sim_id column is renamed to infectee (in response to #238) with unique IDs within simulations but not shared across.

Does this reflect what you were suggesting above?

@sbfnk
Copy link
Contributor

sbfnk commented May 9, 2024

Does this reflect what you were suggesting above?

Yes except that I'd revert index_case to be called sim_id (and probably also rename the corresponding argument to nsims or similar - so they're not index cases within the same epidemic, but different epidemics. Does that make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Column names switched in return value of simulate_chains
3 participants