Write a flowchart of computation #61

swo · 2024-11-25T15:21:49Z

Inputs (and their types)
Where do validation happen
Which scripts are involved, when
Add validations where needed (cf Data refactor #65 (comment))

Cf. https://github.com/cdcent/fall-virus-model?tab=readme-ov-file#file-and-script-organization for an example

Fuhan-Yang · 2024-12-17T21:04:00Z

flowchart TB

raw_data[raw data.parquet] 
class raw_data dataNode; 

cached_data[.cache/nisapi/clean]
class cached_data dataNode;

clean_data[clean data as lazy data frame]
class clean_data dataNode;

cached_all_datasets.py
class cached_all_datasets.py funcNode;

get_nis.py
class get_nis.py funcNode;

preprocess.py
class preprocess.py funcNode;

ready_data[clean data as data frame]
class ready_data dataNode;

data_validate1[validate schema]
class data_validate1 validateNode;

data_validate2[validate schema]
class data_validate2 validateNode;

pred_validate[validate schema and quantile]
class pred_validate validateNode;

config([config])
class config configNode;

train_data[train data as IncidentUptakeData]
class train_data objectNode;

incident_predict[incident prediction as PointForecast]
class incident_predict objectNode;

test_data[test data as IncidentUptakeData] 
class test_data objectNode;

eval[eval.py: mspe, bias, end-of-season uptake error]
class eval funcNode;

model[model.py:LinearIncidentUptakeModel]
class model funcNode;

subgraph NIS-API
raw_data --> cached_all_datasets.py --> cached_data --> get_nis.py --> clean_data
end

clean_data --> preprocess.py -->  ready_data

subgraph main1.py["main.py 'projection' "]
ready_data --> train_data --> model --> incident_predict
data_validate1 --> train_data
pred_validate --> incident_predict
end

config --> main1.py 

subgraph main2.py["main.py 'evaluation' "]
incident_predict --> eval
ready_data --> test_data --> eval
data_validate2 --> test_data
end

config --> main2.py

classDef dataNode fill:#00cfcc;
classDef funcNode fill:#42b3f5;
classDef configNode fill:#f58742;
classDef objectNode fill:#bac5e8;
classDef validateNode fill:#b00afc;

swo · 2024-12-17T21:33:28Z

The big question is whether you want a single main.py to do everything, or if you want to break this up into different pieces. I guess it's fine to start with everything all in once place, and then start splitting it up, as you find that it slows you down, to need to re-run everything. Do you have a sense of how long it takes main.py to run now?
You can drop everything that's inside of nisapi; you can just start from nisapi.get_nis()
I find it hard to read some of these colors:

eschrom · 2024-12-17T23:13:50Z

My vote is to keep the entire pipeline in main.py for the time being. Before PR #82, main.py takes < 1 second. That said, several upcoming changes will cause this to increase:

Replacing scikit-learn models that generate point estimates with numpyro models that generate full posteriors
Using multiple models rather than just the one
Refitting the model each time the forecast date changes, to include all available data, as discussed in PR Add evaluation to configuration #82

If/when these changes make the run-time of main.py intolerably long, that's when I suggest we switch to a Makefile to compartmentalize the steps. But personally, for the sake of scientific progress, I'd prefer to focus on these three items themselves before refactoring our scripts.

eschrom · 2024-12-17T23:18:40Z

As for the flow diagram, I agree with @swo that the granular steps inside NIS_API can be left out, and that the low contrast between font color and compartment color is hard to read. Additionally, removing the "projection" vs. "evaluation" to make [model fitting -> making projections -> evaluation metrics] a single linear process, as discussed in PR #82, will change the structure of the flow diagram a bit.

But I think this is a very good start, and I like your color-coding of data objects vs. key operations vs. validation vs. etc. Maybe you can teach me how to make these flowcharts sometime?

swo · 2024-12-20T14:57:22Z

My vote is to keep the entire pipeline in main.py for the time being. Before PR #82, main.py takes < 1 second

After our discussion, I agree. It's OK to keep everything in one script for now, if it's fast. Once it starts slowing down, think harder about breaking up computational actions into functions, splitting them across scripts, and using a Makefile to manage the pipeline.

cherz4 · 2024-12-20T16:12:16Z

Lurking here.... in R the targets package (and drake before that) helped with pipeline running. For my own understanding is there something similar to that in Python, or is the Makefile approach a make way to go?

swo · 2024-12-20T17:11:56Z

Lurking here.... in R the targets package (and drake before that) helped with pipeline running. For my own understanding is there something similar to that in Python, or is the Makefile approach a make way to go?

target is unusual in that it's R code running R functions, which introduces a number of limitations (e.g., it makes parallelization or cloud computation really challenging).

Make runs command line commands (which, in this case, will be python scripts).

Python has Snakemake, which lets you write makefile-like "snakefiles" that can take advantage of full Python logic (e.g., read in a list of targets from a yaml), but it still ultimately issues command line commands, so you could eg use a Snakefile to run an R pipeline (if you had R scripts that accepted command line arguments).

cherz4 · 2024-12-20T22:17:51Z

@swo Thanks! that is cool to learn about. I haven't heard about Snakemake!

swo · 2024-12-27T17:12:44Z

See also https://teams.microsoft.com/l/message/19:d6a0b1ac75474c15ba1ed525ce05a963@thread.tacv2/1734462023465?tenantId=9ce70869-60db-44fd-abe8-d2767077fc8f&groupId=4f34eba7-8fcb-4a26-9b34-a434ea777f0c&parentMessageId=1734462023465&teamName=CFA-Predict&channelName=Topic%20-%20Dev&createdTime=1734462023465 for a discussion about similar tools; it seems like Nextflow and Metaflow are common alternatives

swo · 2025-01-27T15:32:26Z

Can keep this pretty high level: data comes from NIS, gets pre-processed, etc., where the "model" is, etc.

Fuhan-Yang · 2025-01-27T20:31:49Z

Please see the update!

flowchart TB

nis_data(nis_raw.parquet)
forecast(forecasts.parquet)
score(scores.parquet)
config{{config.yaml}}

subgraph preprocess.py
NIS-API  -->|data clean| nis_data
end

subgraph forecast.py
nis_data -->|fit| iup.models -->|predict| forecast
end

subgraph eval.py
forecast --> iup.eval --> score
end

config --> preprocess.py
config --> forecast.py
config --> eval.py


style nis_data fill:#00cfcc
style forecast fill:#00cfcc
style score fill:#00cfcc
style config fill:#f58742;

eschrom · 2025-01-27T22:03:41Z

This looks really nice! The only thing I would change is to put fit with predict, since both of those happen during forecast.py using iup.models. And the white text on top of the cyan boxes is tough to read, if your screen defaults to a black background - perhaps a more neutral blue would work better.

But otherwise, I think this is very clear and just the right level of detail!

Fuhan-Yang · 2025-01-27T23:04:32Z

Thanks for the suggestions! It's hard to adjust the position of fit on the line to put it in forecast, so I added split to train_data and similarly test_data. Also changed the color of iup to show that it is from the package. Let me know if the color still blurs the text!

swo · 2025-01-28T20:20:22Z

I think there's some pre-processing missing? Like, if we do do pre-processing, that should output a clean NIS dataset somewhere.
- But maybe we don't actually need to do any pre-processing, if we can just pull from the NIS API?
- I figures we would want to add "season", at least
I'm confused that the scripts are boxes that have files inside of them.
- One solution is to have every box be a file, either a script or a data file. That's like what we do in https://github.com/cdcent/fall-virus-model
- Another solution would be to have the script boxes contain functions, objects, etc. but not files

Fuhan-Yang · 2025-02-03T22:17:39Z

Sorry I don't know why my previous revision didn't go through, and I didn't receive new comments. Here is the update:

flowchart TB

nis_data(nis_raw.parquet)
forecast(forecasts.parquet)
scores(scores.parquet)
config{{config.yaml}}
proj_plot[projections.png]
score_plot[scores.png]

subgraph raw data
NIS
end

subgraph clean data with seasons
nis_data
end

subgraph prediction by seasons
forecast
end

subgraph evaluation scores
scores
end

subgraph prediction plots
proj_plot
end

subgraph score plots
score_plot
end

NIS -->preprocess.py --> nis_data
nis_data --> forecast.py --> forecast
forecast --> eval.py --> scores
scores --> score_plot.py --> score_plot
nis_data --> proj_plot.py --> proj_plot
forecast --> proj_plot.py

config --> preprocess.py
config --> forecast.py
config --> eval.py
config --> proj_plot.py


style nis_data fill:#7f00ff
style forecast fill:#7f00ff
style scores fill:#7f00ff
style config fill:#f58742
style preprocess.py fill:#0080ff
style forecast.py fill:#0080ff
style eval.py fill:#0080ff
style proj_plot.py fill:#0080ff
style score_plot.py fill:#0080ff
style proj_plot fill:#ff6666
style score_plot fill: #ff6666

eschrom · 2025-02-04T14:13:49Z

I like this! The flow of information, the level of detail, the color-coding - all of it looks great!

swo · 2025-02-04T15:23:51Z

Seems sensible to me!

Also, re: how to track versions and comments on versions, I think that's a sign that we should have opened a PR about this, and iterated in that forum.

This could go into the README for now

swo mentioned this issue Dec 4, 2024

Data refactor #65

Merged

swo assigned Fuhan-Yang Jan 27, 2025

swo mentioned this issue Feb 5, 2025

Plot projections and eval scores #115

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a flowchart of computation #61

Write a flowchart of computation #61

swo commented Nov 25, 2024 •

edited

Loading

Fuhan-Yang commented Dec 17, 2024 •

edited

Loading

swo commented Dec 17, 2024

eschrom commented Dec 17, 2024

eschrom commented Dec 17, 2024

swo commented Dec 20, 2024

cherz4 commented Dec 20, 2024

swo commented Dec 20, 2024

cherz4 commented Dec 20, 2024

swo commented Dec 27, 2024

swo commented Jan 27, 2025

Fuhan-Yang commented Jan 27, 2025

eschrom commented Jan 27, 2025

Fuhan-Yang commented Jan 27, 2025

swo commented Jan 28, 2025

Fuhan-Yang commented Feb 3, 2025

eschrom commented Feb 4, 2025

swo commented Feb 4, 2025

Write a flowchart of computation #61

Write a flowchart of computation #61

Comments

swo commented Nov 25, 2024 • edited Loading

Fuhan-Yang commented Dec 17, 2024 • edited Loading

swo commented Dec 17, 2024

eschrom commented Dec 17, 2024

eschrom commented Dec 17, 2024

swo commented Dec 20, 2024

cherz4 commented Dec 20, 2024

swo commented Dec 20, 2024

cherz4 commented Dec 20, 2024

swo commented Dec 27, 2024

swo commented Jan 27, 2025

Fuhan-Yang commented Jan 27, 2025

eschrom commented Jan 27, 2025

Fuhan-Yang commented Jan 27, 2025

swo commented Jan 28, 2025

Fuhan-Yang commented Feb 3, 2025

eschrom commented Feb 4, 2025

swo commented Feb 4, 2025

swo commented Nov 25, 2024 •

edited

Loading

Fuhan-Yang commented Dec 17, 2024 •

edited

Loading