Changes to `generate_sim_data` #169

ben18785 · 2024-03-19T21:50:43Z

Sorry @ntorresd -- I'm revisiting this one as I've genuinely found it hard to use this function.

To give an example, I am trying to replace the reliance on the simdata_large_epi file in a vignette by replacing it with a simulated dataset as per #160. I tried:

df <- generate_sim_data( sim_data=data.frame(age_mean_f=c(2, 7, 12, 17, 22, 27, 32, 37, 42, 47), tsur=2050), foi=rep(1.5, 47), sample_size_by_age = c(20, 25, 25, 25, 25, 25, 25, 25, 25, 30) )

which I thought should work since it allows 47 FOIs (which seems sensible...from 1 up until a max age of 47). From reading the function documentation, it's still not clear to me how to get this to work, and I think we should make various changes.

generate_sim_data is clunky and vague; simulate_serosurvey uses full words and is immediately understandable, and I propose this change of name.

In addition, I think we should consider changes to its arguments; current arguments for this include:

sim_data which is a vague name and I would propose changing it to serosurvey_characteristics. This currently has a column called age_mean_f which specifies "Age group markers" -- I am not sure what this means. I'd suggest we change this column to be named ages_surveyed; I would suggest changing tsur to year_survey. sample_size_by_age would actually work better as a column in this data frame since then it's guaranteed to be of the right length, and I would suggest renaming it n_sample
foi works fine as it is for either time- or age-varying FOIs, but it won't generalise to time- and age-varying FOIs. To handle this, I propose a change to it: we require that users supply a data frame with columns: age and foi for age-varying FOIs; year and foi for time-varying FOIs; and age, year and foi for age- and time-varying FOIs. This has the added benefit that we can check the users are supplying the right inputs for whichever type of model they have.

We also need to check the inputs to the function and give the user better error handling when they have supplied inappropriate arguments:

We should check whether FOI is of right size (much easier to do when we have a data frame input)
sample sizes by age (which I think can be handled much more nicely as described above as an input to the old sim_data argument)

I also propose changes to the output of the function:

I don't understand why we return a data frame that has: age_mean_f, age_min and age_max since, to the best of my knowledge, we don't have a way of simulating data for heterogeneous age cohorts (e.g. a group of individuals with ages between 10 and 20); we probably should have this
tsur could be year_survey
counts could be n_seropositive
total should be n_sample to mirror the function inputs
survey can just be removed (and I'd suggest removing it from the inputs)

The text was updated successfully, but these errors were encountered:

ntorresd · 2024-03-21T17:32:37Z

It's fine @ben18785, I also think we can improve these functions a lot to make them easier and more intuitive to use.

Just to clarify some points. As stated in the simdata_large_epi documentation (which is very incomplete, my bad), this serosurvey is intended to emulate a large epidemic ocurring during 3 years at some point in time. The right FoI to use in this case is thus something like:

foi_sim_large_epi <- c(rep(0, 32), rep(1.5, 3), rep(0, 15))

This is the red line in the corresponding example on the vignettes.

Another important point to note is the age structure of the serosurvey:

 $ total   : num  20 25 25 25 25 25 25 25 25 30
 $ counts  : int  0 3 2 12 6 11 10 10 18 17
 $ age_min : num  2 7 12 17 22 27 32 37 42 47
 $ age_max : num  2 7 12 17 22 27 32 37 42 47
 $ tsur    : num  2050 2050 2050 2050 2050 2050 2050 2050 2050 2050

At the time I generated the simulated datasets I didn't realized that the ages were not saved correctly; it should read:

 $ total   : num  20 25 25 25 25 25 25 25 25 30
 $ counts  : int  0 3 2 12 6 11 10 10 18 17
 $ age_min : num  1 6 11 16 21 26 31 36 41 46
 $ age_max : num  5 10 15 20 25 30 35 40 45 50
 $ tsur    : num  2050 2050 2050 2050 2050 2050 2050 2050 2050 2050

This doesn't affect the rest of the calculation because when computing the age group markers age_mean_f (which stands for the mean floor between age_min and age_max) in prepare_serodata it yields a valid result anyway.

On the other hand, note that this serosurvey is grouped by age. As you pointed out, we don't have a direct way to simulate heterogeneous age cohorts. The way we bypass this is simulating data for each age between min(age_min) and max(age_max) and then grouping the data. This looks like:

serodata_sim <- generate_sim_data(
  sim_data = data.frame(
    age=seq(1,50),
    tsur=2050),
  foi=foi_sim_large_epi,
  sample_size_by_age = c(rep(4,5), rep(5,40), rep(6,5))
)

'data.frame':	50 obs. of  7 variables:
 $ age    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ tsur   : num  2050 2050 2050 2050 2050 2050 2050 2050 2050 2050 ...
 $ age_min: int  1 2 3 4 5 6 7 8 9 10 ...
 $ age_max: int  1 2 3 4 5 6 7 8 9 10 ...
 $ counts : int  0 0 0 0 0 0 0 0 0 0 ...
 $ total  : num  4 4 4 4 4 5 5 5 5 5 ...
 $ survey : chr  "sim_data" "sim_data" "sim_data" "sim_data" ...

And then:

serodata_sim <- group_sim_data(serodata_sim, step=5)

which returns a prepared serosurvey with the characteristics we wanted (in retrospective I think we shouldn't use prepare_serodata by default here, but return a simpler output):

    age_group total counts tsur country   survey age_min age_max age_mean_f sample_size birth_year prev_obs prev_obs_lower prev_obs_upper
X       (0,5]    20      0 2050    None sim_data       1       5          3         250       2047     0.00      0.0000000      0.1684335
X.1    (5,10]    25      0 2050    None sim_data       6      10          8         250       2042     0.00      0.0000000      0.1371852
X.2   (10,15]    25      0 2050    None sim_data      11      15         13         250       2037     0.00      0.0000000      0.1371852
X.3   (15,20]    25     23 2050    None sim_data      16      20         18         250       2032     0.92      0.7396942      0.9901604
X.4   (20,25]    25     25 2050    None sim_data      21      25         23         250       2027     1.00      0.8628148      1.0000000
X.5   (25,30]    25     25 2050    None sim_data      26      30         28         250       2022     1.00      0.8628148      1.0000000
X.6   (30,35]    25     25 2050    None sim_data      31      35         33         250       2017     1.00      0.8628148      1.0000000
X.7   (35,40]    25     24 2050    None sim_data      36      40         38         250       2012     0.96      0.7964831      0.9989878
X.8   (40,45]    25     25 2050    None sim_data      41      45         43         250       2007     1.00      0.8628148      1.0000000
X.9   (45,50]    30     30 2050    None sim_data      46      50         48         250       2002     1.00      0.8842967      1.0000000

Baring this in mind:

generate_sim_data is clunky and vague; simulate_serosurvey uses full words and is immediately understandable, and I propose this change of name.

In addition, I think we should consider changes to its arguments; current arguments for this include:

sim_data which is a vague name and I would propose changing it to serosurvey_characteristics. This currently has a column called age_mean_f which specifies "Age group markers" -- I am not sure what this means. I'd suggest we change this column to be named ages_surveyed; I would suggest changing tsur to year_survey. sample_size_by_age would actually work better as a column in this data frame since then it's guaranteed to be of the right length, and I would suggest renaming it n_sample

foi works fine as it is for either time- or age-varying FOIs, but it won't generalise to time- and age-varying FOIs. To handle this, I propose a change to it: we require that users supply a data frame with columns: age and foi for age-varying FOIs; year and foi for time-varying FOIs; and age, year and foi for age- and time-varying FOIs. This has the added benefit that we can check the users are supplying the right inputs for whichever type of model they have.

I agree on changing the names of the functions and variables as you suggest. Some of them may be more troublesome than others, since we have to change them as well in the preloaded datasets in order for the R-CMD checks to pass (just as we need to do for #112). I suggest we open a separate issue and PR to address these changes of names.

We also need to check the inputs to the function and give the user better error handling when they have supplied inappropriate arguments:

We should check whether FOI is of right size (much easier to do when we have a data frame input)

sample sizes by age (which I think can be handled much more nicely as described above as an input to the old sim_data argument)

This was recently addressed by @jpavlich in #168. I merged it today, so please rebase your branch before continuing.

I also propose changes to the output of the function:

I don't understand why we return a data frame that has: age_mean_f, age_min and age_max since, to the best of my knowledge, we don't have a way of simulating data for heterogeneous age cohorts (e.g. a group of individuals with ages between 10 and 20); we probably should have this

tsur could be year_survey

counts could be n_seropositive

total should be n_sample to mirror the function inputs

survey can just be removed (and I'd suggest removing it from the inputs)

I think we should keep at least age_min and age_max since we can use them to generate grouped surveys as I explained above. I agree with you on the others, we should simplify the output as much as possible.

ben18785 · 2024-05-15T14:28:25Z

As discussed in in-person meeting, we will change this to be of the form:

feature_df <- data.frame(
    age_min=c(1, 6, 11), 
    age_max=c(5, 10, 20),
    sample_size=c(10, 15, 20),
    year_survey=c(2010, 2010, 2010)
)

serosurvey_time_example <- simulate_serosurvey(
  model="time",
  foi = data.frame(year=c(1990,1991,...,2009), foi=c(0.1, 0.2, ..., 0.3)),
  seroreversion=0,
  survey_features=feature_df
)

serosurvey_age_example <- simulate_serosurvey(
  model="age",
  foi = data.frame(age=c(1,2,...,20), foi=c(0.1, 0.2, ..., 0.3)),
  seroreversion=0.2,
  survey_features=feature_df
)

ages <- seq(1, 10, 1)
years <- seq(1990, 2000, 1)
foi_age_and_time <- expand_grid(year=years, age=ages) %>% 
  mutate(foi=0.1)

serosurvey_age_and_time_example <- simulate_serosurvey(
  model="age-time",
  foi = foi_age_and_time,
  seroreversion=0.2,
  survey_features=feature_df
)

But internally & externally we will have e.g. simulate_serosurvey_age_model which will be called by the helper function simulate_serosurvey.

…draft

…th the age-varying probs

…l_age_random needs sorting

…s not use an exposure matrix)

…ropositivities were the same

…draft

…th the age-varying probs

…l_age_random needs sorting

…s not use an exposure matrix)

…ropositivities were the same

ben18785 self-assigned this Mar 19, 2024

ben18785 added a commit that referenced this issue May 15, 2024

#169 move probability functionality

e4d7f01

ben18785 added a commit that referenced this issue May 15, 2024

#169 corrections to exposure matrix and simulating time models first …

e3391df

…draft

ben18785 added a commit that referenced this issue May 15, 2024

#169 chatgpt gave a useful tip about refactorising

e577a40

ben18785 added a commit that referenced this issue May 15, 2024

#169 removed redundant functions

1f14a31

ben18785 added a commit that referenced this issue May 15, 2024

#169 added all tests for simulate_serosurvey_time_model

0ad6e1b

ben18785 added a commit that referenced this issue May 16, 2024

#169 started to include docstrings

a480a7c

ben18785 added a commit that referenced this issue May 16, 2024

#169 added further docstrings and more tests incl seroreversion

f1fa164

ben18785 added a commit that referenced this issue May 16, 2024

#169 first draft of time-varying foi model with comprehensive unit tests

2ccf6ee

ben18785 added a commit that referenced this issue May 16, 2024

#169 started to add in age-varying simulation but there's an error wi…

338c6d1

…th the age-varying probs

ben18785 added a commit that referenced this issue May 16, 2024

#169 tests corrected for probability exact calcs

d4a3f37

ben18785 added a commit that referenced this issue May 16, 2024

#169 first draft of age-varying simulation model with full tests

2c31fb8

ben18785 added a commit that referenced this issue May 16, 2024

#169 added master simulate function. Bug for sample_size_by_individua…

b15dae2

…l_age_random needs sorting

ben18785 added a commit that referenced this issue May 16, 2024

#169 corrected probability_exact_time_varying function (which now doe…

30cef67

…s not use an exposure matrix)

ben18785 added a commit that referenced this issue May 16, 2024

#169 removed redundant age-varying probability calculator

bc5c705

ben18785 added a commit that referenced this issue May 16, 2024

#169 added validation functions for serosurveys

260d6bd

ben18785 added a commit that referenced this issue May 16, 2024

#169 corrected mistake which meant (previously) age and time model se…

445baa1

…ropositivities were the same

ben18785 added a commit that referenced this issue May 16, 2024

#169 working draft of age and time model

3ce4bbf

ben18785 added a commit that referenced this issue May 16, 2024

#169 now passing all tests for simulate_serosurvey

29981ec

ben18785 added a commit that referenced this issue May 16, 2024

#169 added tidyr calls to expand_grid

af6633b

ben18785 added a commit that referenced this issue May 16, 2024

#169 corrected partial argument match

8bb65e5

ben18785 added a commit that referenced this issue May 16, 2024

#169 added validation error for simulation

ef6b796

ben18785 linked a pull request Jul 19, 2024 that will close this issue

Simulation functions and vignette plus other things #199

Merged

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 move probability functionality

7a59136

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 corrections to exposure matrix and simulating time models first …

d79ca43

…draft

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 chatgpt gave a useful tip about refactorising

96ac735

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 removed redundant functions

f7466d6

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 added all tests for simulate_serosurvey_time_model

b40f33e

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 started to include docstrings

63f6795

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 added further docstrings and more tests incl seroreversion

06b4283

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 first draft of time-varying foi model with comprehensive unit tests

b401f1d

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 started to add in age-varying simulation but there's an error wi…

1de345f

…th the age-varying probs

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 tests corrected for probability exact calcs

48da875

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 first draft of age-varying simulation model with full tests

154aa73

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 added master simulate function. Bug for sample_size_by_individua…

360a9f5

…l_age_random needs sorting

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 corrected probability_exact_time_varying function (which now doe…

d9cab53

…s not use an exposure matrix)

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 removed redundant age-varying probability calculator

a64b8b7

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 added validation functions for serosurveys

fac653f

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 corrected mistake which meant (previously) age and time model se…

57c1442

…ropositivities were the same

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 working draft of age and time model

e73032f

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 now passing all tests for simulate_serosurvey

64542e8

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 added tidyr calls to expand_grid

b3fdf15

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 corrected partial argument match

ce42933

ntorresd pushed a commit that referenced this issue Jul 31, 2024

#169 added validation error for simulation

de14b10

ntorresd mentioned this issue Jul 31, 2024

Simulation functions and vignette plus other things #199

Merged

ntorresd closed this as completed Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to `generate_sim_data` #169

Changes to `generate_sim_data` #169

ben18785 commented Mar 19, 2024 •

edited

Loading

ntorresd commented Mar 21, 2024

ben18785 commented May 15, 2024 •

edited

Loading

Changes to generate_sim_data #169

Changes to generate_sim_data #169

Comments

ben18785 commented Mar 19, 2024 • edited Loading

ntorresd commented Mar 21, 2024

ben18785 commented May 15, 2024 • edited Loading

Changes to `generate_sim_data` #169

Changes to `generate_sim_data` #169

ben18785 commented Mar 19, 2024 •

edited

Loading

ben18785 commented May 15, 2024 •

edited

Loading