Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a flowchart of computation #61

Open
swo opened this issue Nov 25, 2024 · 17 comments
Open

Write a flowchart of computation #61

swo opened this issue Nov 25, 2024 · 17 comments
Assignees

Comments

@swo
Copy link
Collaborator

swo commented Nov 25, 2024

  • Inputs (and their types)
  • Where do validation happen
  • Which scripts are involved, when
  • Add validations where needed (cf Data refactor #65 (comment))

Cf. https://github.com/cdcent/fall-virus-model?tab=readme-ov-file#file-and-script-organization for an example

@swo swo mentioned this issue Dec 4, 2024
@Fuhan-Yang
Copy link
Contributor

Fuhan-Yang commented Dec 17, 2024

flowchart TB

raw_data[raw data.parquet] 
class raw_data dataNode; 

cached_data[.cache/nisapi/clean]
class cached_data dataNode;

clean_data[clean data as lazy data frame]
class clean_data dataNode;

cached_all_datasets.py
class cached_all_datasets.py funcNode;

get_nis.py
class get_nis.py funcNode;

preprocess.py
class preprocess.py funcNode;

ready_data[clean data as data frame]
class ready_data dataNode;

data_validate1[validate schema]
class data_validate1 validateNode;

data_validate2[validate schema]
class data_validate2 validateNode;

pred_validate[validate schema and quantile]
class pred_validate validateNode;

config([config])
class config configNode;

train_data[train data as IncidentUptakeData]
class train_data objectNode;

incident_predict[incident prediction as PointForecast]
class incident_predict objectNode;

test_data[test data as IncidentUptakeData] 
class test_data objectNode;

eval[eval.py: mspe, bias, end-of-season uptake error]
class eval funcNode;

model[model.py:LinearIncidentUptakeModel]
class model funcNode;

subgraph NIS-API
raw_data --> cached_all_datasets.py --> cached_data --> get_nis.py --> clean_data
end

clean_data --> preprocess.py -->  ready_data

subgraph main1.py["main.py 'projection' "]
ready_data --> train_data --> model --> incident_predict
data_validate1 --> train_data
pred_validate --> incident_predict
end

config --> main1.py 

subgraph main2.py["main.py 'evaluation' "]
incident_predict --> eval
ready_data --> test_data --> eval
data_validate2 --> test_data
end

config --> main2.py

classDef dataNode fill:#00cfcc;
classDef funcNode fill:#42b3f5;
classDef configNode fill:#f58742;
classDef objectNode fill:#bac5e8;
classDef validateNode fill:#b00afc;





Loading

@swo
Copy link
Collaborator Author

swo commented Dec 17, 2024

  • The big question is whether you want a single main.py to do everything, or if you want to break this up into different pieces. I guess it's fine to start with everything all in once place, and then start splitting it up, as you find that it slows you down, to need to re-run everything. Do you have a sense of how long it takes main.py to run now?
  • You can drop everything that's inside of nisapi; you can just start from nisapi.get_nis()
  • I find it hard to read some of these colors: image

@eschrom
Copy link
Collaborator

eschrom commented Dec 17, 2024

My vote is to keep the entire pipeline in main.py for the time being. Before PR #82, main.py takes < 1 second. That said, several upcoming changes will cause this to increase:

  • Replacing scikit-learn models that generate point estimates with numpyro models that generate full posteriors
  • Using multiple models rather than just the one
  • Refitting the model each time the forecast date changes, to include all available data, as discussed in PR Add evaluation to configuration #82

If/when these changes make the run-time of main.py intolerably long, that's when I suggest we switch to a Makefile to compartmentalize the steps. But personally, for the sake of scientific progress, I'd prefer to focus on these three items themselves before refactoring our scripts.

@eschrom
Copy link
Collaborator

eschrom commented Dec 17, 2024

As for the flow diagram, I agree with @swo that the granular steps inside NIS_API can be left out, and that the low contrast between font color and compartment color is hard to read. Additionally, removing the "projection" vs. "evaluation" to make [model fitting -> making projections -> evaluation metrics] a single linear process, as discussed in PR #82, will change the structure of the flow diagram a bit.

But I think this is a very good start, and I like your color-coding of data objects vs. key operations vs. validation vs. etc. Maybe you can teach me how to make these flowcharts sometime?

@swo
Copy link
Collaborator Author

swo commented Dec 20, 2024

My vote is to keep the entire pipeline in main.py for the time being. Before PR #82, main.py takes < 1 second

After our discussion, I agree. It's OK to keep everything in one script for now, if it's fast. Once it starts slowing down, think harder about breaking up computational actions into functions, splitting them across scripts, and using a Makefile to manage the pipeline.

@cherz4
Copy link

cherz4 commented Dec 20, 2024

Lurking here.... in R the targets package (and drake before that) helped with pipeline running. For my own understanding is there something similar to that in Python, or is the Makefile approach a make way to go?

@swo
Copy link
Collaborator Author

swo commented Dec 20, 2024

Lurking here.... in R the targets package (and drake before that) helped with pipeline running. For my own understanding is there something similar to that in Python, or is the Makefile approach a make way to go?

target is unusual in that it's R code running R functions, which introduces a number of limitations (e.g., it makes parallelization or cloud computation really challenging).

Make runs command line commands (which, in this case, will be python scripts).

Python has Snakemake, which lets you write makefile-like "snakefiles" that can take advantage of full Python logic (e.g., read in a list of targets from a yaml), but it still ultimately issues command line commands, so you could eg use a Snakefile to run an R pipeline (if you had R scripts that accepted command line arguments).

@cherz4
Copy link

cherz4 commented Dec 20, 2024

@swo Thanks! that is cool to learn about. I haven't heard about Snakemake!

@swo
Copy link
Collaborator Author

swo commented Jan 27, 2025

Can keep this pretty high level: data comes from NIS, gets pre-processed, etc., where the "model" is, etc.

@Fuhan-Yang
Copy link
Contributor

Please see the update!

flowchart TB

nis_data(nis_raw.parquet)
forecast(forecasts.parquet)
score(scores.parquet)
config{{config.yaml}}

subgraph preprocess.py
NIS-API  -->|data clean| nis_data
end

subgraph forecast.py
nis_data -->|fit| iup.models -->|predict| forecast
end

subgraph eval.py
forecast --> iup.eval --> score
end

config --> preprocess.py
config --> forecast.py
config --> eval.py


style nis_data fill:#00cfcc
style forecast fill:#00cfcc
style score fill:#00cfcc
style config fill:#f58742;

Loading

@eschrom
Copy link
Collaborator

eschrom commented Jan 27, 2025

This looks really nice! The only thing I would change is to put fit with predict, since both of those happen during forecast.py using iup.models. And the white text on top of the cyan boxes is tough to read, if your screen defaults to a black background - perhaps a more neutral blue would work better.

But otherwise, I think this is very clear and just the right level of detail!

@Fuhan-Yang
Copy link
Contributor

Thanks for the suggestions! It's hard to adjust the position of fit on the line to put it in forecast, so I added split to train_data and similarly test_data. Also changed the color of iup to show that it is from the package. Let me know if the color still blurs the text!

@swo
Copy link
Collaborator Author

swo commented Jan 28, 2025

  • I think there's some pre-processing missing? Like, if we do do pre-processing, that should output a clean NIS dataset somewhere.
    • But maybe we don't actually need to do any pre-processing, if we can just pull from the NIS API?
    • I figures we would want to add "season", at least
  • I'm confused that the scripts are boxes that have files inside of them.
    • One solution is to have every box be a file, either a script or a data file. That's like what we do in https://github.com/cdcent/fall-virus-model
    • Another solution would be to have the script boxes contain functions, objects, etc. but not files

@Fuhan-Yang
Copy link
Contributor

Sorry I don't know why my previous revision didn't go through, and I didn't receive new comments. Here is the update:

flowchart TB

nis_data(nis_raw.parquet)
forecast(forecasts.parquet)
scores(scores.parquet)
config{{config.yaml}}
proj_plot[projections.png]
score_plot[scores.png]

subgraph raw data
NIS
end

subgraph clean data with seasons
nis_data
end

subgraph prediction by seasons
forecast
end

subgraph evaluation scores
scores
end

subgraph prediction plots
proj_plot
end

subgraph score plots
score_plot
end

NIS -->preprocess.py --> nis_data
nis_data --> forecast.py --> forecast
forecast --> eval.py --> scores
scores --> score_plot.py --> score_plot
nis_data --> proj_plot.py --> proj_plot
forecast --> proj_plot.py

config --> preprocess.py
config --> forecast.py
config --> eval.py
config --> proj_plot.py


style nis_data fill:#7f00ff
style forecast fill:#7f00ff
style scores fill:#7f00ff
style config fill:#f58742
style preprocess.py fill:#0080ff
style forecast.py fill:#0080ff
style eval.py fill:#0080ff
style proj_plot.py fill:#0080ff
style score_plot.py fill:#0080ff
style proj_plot fill:#ff6666
style score_plot fill: #ff6666


Loading

@eschrom
Copy link
Collaborator

eschrom commented Feb 4, 2025

I like this! The flow of information, the level of detail, the color-coding - all of it looks great!

@swo
Copy link
Collaborator Author

swo commented Feb 4, 2025

Seems sensible to me!

Also, re: how to track versions and comments on versions, I think that's a sign that we should have opened a PR about this, and iterated in that forum.

This could go into the README for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants