Improve `simulate_*()` functions: Rename columns + improve object names + comments #242

jamesmbaazam · 2024-05-04T14:12:54Z

This PR closes #238 and #175.

Please check if the PR fulfills these requirements

I have read the CONTRIBUTING guidelines
A new item has been added to NEWS.md
Tests for the changes have been added (for bug fixes / features)
Docs have been added / updated (for bug fixes / features)
Checks have been run locally and pass

What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

Function enhancements + Documentation improvements

What is the current behavior? (You can also link to an open issue here)

Currently, the <epichains> object returns columns with names infectee_id, sim_id, infector_id, and generation, and optionally, susc_pop and time (if pop and generation_time are specified respectively). However, these columns are confusing, swapped in interpretation, and not straightforward to explain as noted in the linked issues. The sim_id column is also not unique across the dataset, making it hard to interpret.

What is the new behavior (if this is a feature change)?

User-facing changes

The <epichains> object now returns columns with names chain, infector, infectee, generation, and optionally, time, if generation_time is specified. The susc_pop column has been removed as it was not deemed necessary to return.
The help file of simulate_chains() and simulate_chain_stats() also gain a new section providing a clear definition of what a "chain" is as used in the function.
The infectee column now contains a unique id for each infectee, which can link them to their infector and seeding index case.
The index_cases argument of simulate_chains() and simulate_chain_stats() has been renamed to n_chains to reflect the fact that the supplied number will simulate n independent chains, each starting with 1 individual.

Non-user-facing changes

Additionally, some of the objects in the code have been renamed and comments have been improved to make the code (hopefully) easier to read.

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

NA (package unpublished).

Other information:

jamesmbaazam · 2024-05-04T14:24:50Z

This PR also creates an opportunity to plot the network contained the returned <epichains> object using {epicontacts}. Here's a reprex.

reprex::reprex({
  library(epicontacts)
  # Install this PR and load the package
  pak::pkg_install("epiverse-trace/epichains#242")
  library(epichains)
  # Simulate an outbreak
  set.seed(32)
  outbreak <- simulate_chains(
    index_cases = 5,
    statistic = "size",
    offspring_dist = rpois,
    generation_time = function(n) rlnorm(n, meanlog = 0.58, sdlog = 1.58),
    lambda = 1.5,
    stat_max = 30
  )
 # Create an epicontacts object
  plot_df <- make_epicontacts(
    linelist = outbreak,
    contacts = outbreak,
    id = "infectee",
    from = "infector",
    to = "infectee",
    directed = TRUE
  )
  # Plot the epicontacts object
  plot(plot_df)
})

The code used above can be converted to a simple S3 plot.epichains() method that takes an <epichains> object and extracts the right columns, passes them to epicontacts to do the plotting. This might be useful to the user.

sbfnk · 2024-05-08T12:40:25Z

Before I go into detailed review can I just ask if one change I spotted was intentional, as it affects interpretation of the results and what is the "correct" way of accounting.

Before (e.g. in bpmodels) each index case (/simulation) generated their own infectee ids, i.e. we could have

sim_id   infectee_id
     1             1
     1             2
     1             3
     2             1
     2             2

etc.

Now the infectee_id is shared across all index cases (/simulations)

index_case   infectee_id
         1             1
         1             2
         1             3
         2             4
         2             5

I think the way we want this relates to how we interpret the simulations (this also affects the sim_id vs index_case naming):

Option 1: The simulations are replicates of the same situation. The population size / depletion of susceptibles affects each of the simulations separately, and the stat_max affects each independently. This is what we want if using the simulations for inference (i.e. for estimation in the likelihood function) and is the existing set up. In that case I think it might make sense to keep the column name as sim_id
Option 2: The simulations are independent trees from different index cases in the same population. In that case I think population size / depletion of susceptibles affects the simulations concurrently, and the stat_max should probably apply to the sum of stats from each index case. This is a totally sensible set up for simulations but would, I think, need some more code changes. In that case it makes sense to call the column name index_case

It only occurred to me when reviewing this PR that we're conflating these two concepts (also noting that the first argument is called index_cases). We could support one of the two options (where option 1 requires least effort as it doesn't require any updating of the indexing) or perhaps both, but we should delineate clearly between them.

jamesmbaazam · 2024-05-08T21:24:04Z

Now the infectee_id is shared across all index cases (/simulations)

Yes, this change was in response to your suggestions in the linked issues and my summary of how I understood it here #238 (comment).

For now, I'll revert to option 1, i.e., using the sim_id column name (so as not to delay the version release). We can revisit the second option in the future.

jamesmbaazam · 2024-05-09T13:22:58Z

@sbfnk Here is what a reprex of what your comments here + in #238 (i.e., rename sim_id to infectee) would look like

library(epichains)
  set.seed(123)
  epc_out <- simulate_chains(
    index_cases = 5,
    statistic = "length",
    offspring_dist = rpois,
    stat_max = 100,
    lambda = 0.5
  )
  epc_out
#>    index_case infector infectee generation
#> 1           1       NA        1          1
#> 2           2       NA        1          1
#> 3           3       NA        1          1
#> 4           4       NA        1          1
#> 5           5       NA        1          1
#> 6           2        1        2          2
#> 7           4        1        2          2
#> 8           5        1        2          2
#> 9           5        1        3          2
#> 10          5        2        4          3
Created on 2024-05-09 with [reprex v2.1.0](https://reprex.tidyverse.org/)

The index_case column refers to the seeding index cases that remain active through the generations, and the previous sim_id column is renamed to infectee (in response to #238) with unique IDs within simulations but not shared across.

Does this reflect what you were suggesting above?

sbfnk · 2024-05-09T19:36:39Z

Does this reflect what you were suggesting above?

Yes except that I'd revert index_case to be called sim_id (and probably also rename the corresponding argument to nsims or similar - so they're not index cases within the same epidemic, but different epidemics. Does that make sense?

jamesmbaazam added this to the v0.1.0 - First minor release milestone May 4, 2024

jamesmbaazam requested a review from sbfnk May 6, 2024 21:37

jamesmbaazam added 11 commits May 8, 2024 22:26

Update returned columns and clarify definitions

cdb07f3

Remove susceptible population column

b371b81

Don't sort output by sim_id and infector_id (old names)

7b806fc

Rearrange the final columns

353385f

Give infectees unique IDs

7bd0adc

Remove sim_id column

9e2799a

Rename objects and add improve comments

9af4ce2

Update function name in error message

e81f9e7

Fix tests: use new column names

7be433d

Remove susc_pop column-related tests

f53877e

Update snapshots

3d8950f

jamesmbaazam force-pushed the rename-columns branch from e43e2ff to 3d8950f Compare May 8, 2024 21:27

Automatic readme update

8526c49

jamesmbaazam added 10 commits May 10, 2024 13:05

Rename index_cases to nsims

3aa98ff

Reword function title

519ae20

Clean up description tag of simulate_chain_stats

70dbf75

Clean up description of simulate_chains()

5bc7379

Generate new snapshops

ddee692

Rename index_case_active column to sim_id

7fef0dc

Improve print message to align "simulations" wording

a0a7db3

Document nsims

c488cde

Replace index_cases with nsims

a7f78d5

Rename variables with index_case_* style to sim_*

bcdd7bb

jamesmbaazam and others added 14 commits May 14, 2024 22:58

Rename active_sim_ids to active_chain_ids

6f7db63

Inherit transmission chain definition section

0cd8af4

Improve definition of chain length

dfe3beb

Improve definition of nchains

e310184

Improve definition of chains

7d4acfe

Improve definition of chain length

ecdcd5c

Replace "index cases" with chains

77d8c48

Improve the function description section

a093ba5

Improve definition of the "chain" column

0c3802b

Improve comments to align with "chains" wording

fb4eb1c

Break long line

68278dd

Rename nchains to n_chains

2c72481

Use n_chains

0dfdd8c

Automatic readme update

1fbfda0

jamesmbaazam closed this May 14, 2024

jamesmbaazam reopened this May 14, 2024

jamesmbaazam added 10 commits May 15, 2024 12:58

Replace serial_dist with generation_time

7687640

Rename tree_df argument to sim_df

aa01bd7

Fix reference to old function

98628a6

Replace wording around "index cases" with "chains"

cee9084

Improve comment on size calculation

4b7c890

Improve details of head and tail function details

504e20d

serials_dist should be generation_time

f30d4b6

Fix swapped comments

3545c9f

Remove wording around "tree"; use "chains"

e94c1c1

New snapshots

c07c273

jamesmbaazam merged commit 51c89c0 into main May 15, 2024
8 checks passed

jamesmbaazam deleted the rename-columns branch May 15, 2024 12:30

jamesmbaazam mentioned this pull request May 16, 2024

columns in return values of simulate_tree #175

Closed

jamesmbaazam mentioned this pull request Jun 28, 2024

Clarify what n_chains > 1 means in context of susceptible depletion vs lapply() over simulate_chains() with n_chains > 1 #270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `simulate_*()` functions: Rename columns + improve object names + comments #242

Improve `simulate_*()` functions: Rename columns + improve object names + comments #242

jamesmbaazam commented May 4, 2024 •

edited

Loading

jamesmbaazam commented May 4, 2024 •

edited

Loading

sbfnk commented May 8, 2024

jamesmbaazam commented May 8, 2024

jamesmbaazam commented May 9, 2024 •

edited

Loading

sbfnk commented May 9, 2024

Improve simulate_*() functions: Rename columns + improve object names + comments #242

Improve simulate_*() functions: Rename columns + improve object names + comments #242

Conversation

jamesmbaazam commented May 4, 2024 • edited Loading

User-facing changes

Non-user-facing changes

jamesmbaazam commented May 4, 2024 • edited Loading

sbfnk commented May 8, 2024

jamesmbaazam commented May 8, 2024

jamesmbaazam commented May 9, 2024 • edited Loading

sbfnk commented May 9, 2024

Improve `simulate_*()` functions: Rename columns + improve object names + comments #242

Improve `simulate_*()` functions: Rename columns + improve object names + comments #242

jamesmbaazam commented May 4, 2024 •

edited

Loading

jamesmbaazam commented May 4, 2024 •

edited

Loading

jamesmbaazam commented May 9, 2024 •

edited

Loading