Package structure #127

Karim-Mane · 2024-05-31T13:18:22Z

Looking at sim_linelist() , I think it might be worth considering the following format:

sim_linelist() |>
  add_date_contact() |>
  add_age() |>
  add_sex() |>
  add_hospitalisation() |>
  add_outcome() |>
  add_names()

add_age() and add_sex() could be combined into the same function like add_socio_demography(). This will make it easy to add more socio-demographics later on.

The advantage of this approach will be:

reduce the number of arguments in sim_linelist() and other functions
ability to choose columns adds flexibility in that only columns that are needed will be generated
easy maintenance for the future when adding new columns and/or more socio_demographic info for example. But also when updating/removing some features .

The text was updated successfully, but these errors were encountered:

joshwlambert · 2024-06-03T10:01:40Z

Thanks for raising @Karim-Mane and giving your reasoning on why this change would be useful.

I'm personally not in favour of making this change as I believe it will increase the complexity of the package, but not provide extra functionality. The user will be required to carry out more steps in construct a line list, which currently can be done with a single function sim_linelist().

On some the advantages mentioned:

reduce the number of arguments in sim_linelist() and other functions

This statement is true, but the sim_linelist() function in the new pipeline example is no longer simulating a line list but rather simulation a transmission process (i.e. branching process), so would need to be renamed. Also, the number of arguments would be decreased in the sim_linelist() function, but the number of arguments required for the entire pipeline would be the same. So in essence it is just distributing the arguments.

ability to choose columns adds flexibility in that only columns that are needed will be generated

This has been discussed previously (see #41) and it was decided that the output of sim_linelist() should consistently produce the same number of columns. This is so that the function can be predictably built on in pipelines where the dimensionality of the output is known.

This pipeline approach to building a line list was also mentioned in the GitHub discussion that started {simulist}. @jamesmbaazam stated on that discussion that going for single function calls would be preferable to chaining functions together with pipes.

I see three viable development paths:

Keep the current implementation and specify in the design-principles.Rmd vignette why we do not offer the pipeable functions.
Keep the simulation function as they are now, but also export add_cols functions (e.g. add_hospitalisation(), add_outcome()) to allow users to build custom line lists. This has clear benefits on flexibility, but has downsides in package complexity and maintenance burden.
Move completely over the pipeable structure outlined by @Karim-Mane and change sim_*() functions to be the transmission process.

I'm personally in favour of option 1 or 2.

A last point to consider. The example pipeline outlined above is for sim_linelist() to produce a line list, we would also need equivalent pipelines and thus exported functions for sim_outbreak() and sim_contacts().

Depending on what option is selected it may influence the ability to use other transmission models in future development (see #34 & #43).

Tagging @adamkucharski, @sbfnk, @CarmenTamayo, @avallecam, @Degoot-AM to get their opinions on this from an epidemiological user perspective first to see what is desirable functionality and UX/UI. Then if we decide to make a change from the current package structure I will tag developers to discuss how to implement such a change.

Degoot-AM · 2024-06-06T08:22:46Z

I favour option 2 ...!

avallecam · 2024-06-07T22:51:48Z

From a user perspective, I see the most immediate benefit in path number 1, to keep current implementation, make that explicit in the design document, and try not to affect package complexity and maintenance burden.

Additionally, I would say that if a user requires fewer columns, it should be feasible to drop the columns instead. (using dplyr::select() or other). My interpretation is that current arguments in sim_linelist() try to depict a generalizable natural history of a disease, with a minimal set of parameters, which is a great contribution.

Another feature wanted to consider for this evaluation was the possibility of generating simulations in different country settings: different population pyramid features from {epiCo}, population contact matrix from {socialmixr}, and the disease susceptibility profiles as we do in {finalsize}, and similarly with hospitalization. The vignettes in age-structured and age-stratified hospitalization risk and deaths seem to cover these features. I have not explored them yet, but the package may already have some infrastructure to include additional features that describe the population.

Karim-Mane · 2024-06-28T12:31:51Z

My suggestions would have resulted in updating the .sim_internal() function. The changes will include:

limiting the function at line 83 of the sim_internal.R file,
export the rest of the functions that will be called to complete the creation of the simulated data.

After reviewing @joshwlambert comments above, and taking other rounds of look into the function, the following argument seems to have more importance than the function structure:

the new pipeline example is no longer simulating a line list but rather simulation a transmission process (i.e. branching process)

This is because the patient and the disease outcome data are not accounted at this level, hence the function will need to be renamed.

For the output to reflect the definition of a linelist, I agree that it's better to leave the function implementation as is currently.

However, if more data (columns) will be added to the simulated linelist in the future, I believe it is worth restructuring the function by exporting some internal building blocks of the linelist. Otherwise, the complexity that my suggestion would have introduced on the package will reflect on the function complexity.

joshwlambert · 2024-07-01T12:23:50Z

Thanks for the feedback @Karim-Mane. I agree with your points. I will leave this issue open as I think we can explore adding functions to iteratively build line lists with pipeable functions in a future version of {simulist}, but I will not focus on this until other higher priority features have been included.

joshwlambert added the discussion label May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Package structure #127

Package structure #127

Karim-Mane commented May 31, 2024

joshwlambert commented Jun 3, 2024

Degoot-AM commented Jun 6, 2024

avallecam commented Jun 7, 2024

Karim-Mane commented Jun 28, 2024

joshwlambert commented Jul 1, 2024

Package structure #127

Package structure #127

Comments

Karim-Mane commented May 31, 2024

joshwlambert commented Jun 3, 2024

Degoot-AM commented Jun 6, 2024

avallecam commented Jun 7, 2024

Karim-Mane commented Jun 28, 2024

joshwlambert commented Jul 1, 2024