Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package structure #127

Open
Karim-Mane opened this issue May 31, 2024 · 5 comments
Open

Package structure #127

Karim-Mane opened this issue May 31, 2024 · 5 comments

Comments

@Karim-Mane
Copy link
Member

Looking at sim_linelist() , I think it might be worth considering the following format:

sim_linelist() |>
  add_date_contact() |>
  add_age() |>
  add_sex() |>
  add_hospitalisation() |>
  add_outcome() |>
  add_names()

add_age() and add_sex() could be combined into the same function like add_socio_demography(). This will make it easy to add more socio-demographics later on.

The advantage of this approach will be:

  1. reduce the number of arguments in sim_linelist() and other functions
  2. ability to choose columns adds flexibility in that only columns that are needed will be generated
  3. easy maintenance for the future when adding new columns and/or more socio_demographic info for example. But also when updating/removing some features .
@joshwlambert
Copy link
Member

Thanks for raising @Karim-Mane and giving your reasoning on why this change would be useful.

I'm personally not in favour of making this change as I believe it will increase the complexity of the package, but not provide extra functionality. The user will be required to carry out more steps in construct a line list, which currently can be done with a single function sim_linelist().

On some the advantages mentioned:

reduce the number of arguments in sim_linelist() and other functions

This statement is true, but the sim_linelist() function in the new pipeline example is no longer simulating a line list but rather simulation a transmission process (i.e. branching process), so would need to be renamed. Also, the number of arguments would be decreased in the sim_linelist() function, but the number of arguments required for the entire pipeline would be the same. So in essence it is just distributing the arguments.

ability to choose columns adds flexibility in that only columns that are needed will be generated

This has been discussed previously (see #41) and it was decided that the output of sim_linelist() should consistently produce the same number of columns. This is so that the function can be predictably built on in pipelines where the dimensionality of the output is known.

This pipeline approach to building a line list was also mentioned in the GitHub discussion that started {simulist}. @jamesmbaazam stated on that discussion that going for single function calls would be preferable to chaining functions together with pipes.

I see three viable development paths:

  1. Keep the current implementation and specify in the design-principles.Rmd vignette why we do not offer the pipeable functions.
  2. Keep the simulation function as they are now, but also export add_cols functions (e.g. add_hospitalisation(), add_outcome()) to allow users to build custom line lists. This has clear benefits on flexibility, but has downsides in package complexity and maintenance burden.
  3. Move completely over the pipeable structure outlined by @Karim-Mane and change sim_*() functions to be the transmission process.

I'm personally in favour of option 1 or 2.

A last point to consider. The example pipeline outlined above is for sim_linelist() to produce a line list, we would also need equivalent pipelines and thus exported functions for sim_outbreak() and sim_contacts().

Depending on what option is selected it may influence the ability to use other transmission models in future development (see #34 & #43).

Tagging @adamkucharski, @sbfnk, @CarmenTamayo, @avallecam, @Degoot-AM to get their opinions on this from an epidemiological user perspective first to see what is desirable functionality and UX/UI. Then if we decide to make a change from the current package structure I will tag developers to discuss how to implement such a change.

@Degoot-AM
Copy link

I favour option 2 ...!

@avallecam
Copy link
Member

From a user perspective, I see the most immediate benefit in path number 1, to keep current implementation, make that explicit in the design document, and try not to affect package complexity and maintenance burden.

Additionally, I would say that if a user requires fewer columns, it should be feasible to drop the columns instead. (using dplyr::select() or other). My interpretation is that current arguments in sim_linelist() try to depict a generalizable natural history of a disease, with a minimal set of parameters, which is a great contribution.

Another feature wanted to consider for this evaluation was the possibility of generating simulations in different country settings: different population pyramid features from {epiCo}, population contact matrix from {socialmixr}, and the disease susceptibility profiles as we do in {finalsize}, and similarly with hospitalization. The vignettes in age-structured and age-stratified hospitalization risk and deaths seem to cover these features. I have not explored them yet, but the package may already have some infrastructure to include additional features that describe the population.

@Karim-Mane
Copy link
Member Author

My suggestions would have resulted in updating the .sim_internal() function. The changes will include:

  1. limiting the function at line 83 of the sim_internal.R file,
  2. export the rest of the functions that will be called to complete the creation of the simulated data.

After reviewing @joshwlambert comments above, and taking other rounds of look into the function, the following argument seems to have more importance than the function structure:

the new pipeline example is no longer simulating a line list but rather simulation a transmission process (i.e. branching process)

This is because the patient and the disease outcome data are not accounted at this level, hence the function will need to be renamed.

For the output to reflect the definition of a linelist, I agree that it's better to leave the function implementation as is currently.

However, if more data (columns) will be added to the simulated linelist in the future, I believe it is worth restructuring the function by exporting some internal building blocks of the linelist. Otherwise, the complexity that my suggestion would have introduced on the package will reflect on the function complexity.

@joshwlambert
Copy link
Member

Thanks for the feedback @Karim-Mane. I agree with your points. I will leave this issue open as I think we can explore adding functions to iteratively build line lists with pipeable functions in a future version of {simulist}, but I will not focus on this until other higher priority features have been included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants