-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data-hotfixing without master branch commits #10
Comments
Not sure what this is about. |
When the model runs in production, it often fails for some states because of artifacts in the data (#8). https://github.com/rtcovidlive/covid-model/blob/master/covid/data.py#L27:L81 This hotfixing makes master-commits that bypass a code-review process and are one reason why the latest master runs in production. To uncouple production from latest master (#14) we should find a way to do the manual outlier correction without changing the code on the master branch all the time. One option would be to read the corrections from a GitHub Gist CSV file or Google Sheet. |
I like the gist CSV idea; something like
|
Looks like edit permissions on gists can't be shared. Including the date of making the correction would be good for realistic backtesting. |
I'm going to have to think about this a little because sometimes the corrections are simple as a line of code, but not as simple if left to a CSV. For instance: I'm wondering if we're trying to fix something we don't actually need to fix. I think we can sort out the "which version runs in production" by simply tagging a production release. Hotfixes happen and I don't know that they're necessarily "bad". Also, if we move this to a CSV, it can become quite brittle because we have no history of the changes to the CSV and if anyone else is using this model then their results can change without warning which seems problematic. An additional reason to keep in code is that I think changes to the data going forward may be algorithmic rather than scalar. For instance, I posted the following in the slack channel today:
There's no way changes like this will fit into a CSV :) I'd recommend making data changes transparent, in code, and as part of the git history for these reasons. Additionally, we may actually want a per-region sanitizer file (eg. US_OK.py, US_MT.py) where we can more clearly/cleanly show exactly what we're doing to the data of each state. |
Hi! I'd like to contribute to this issue if possible!
Referring to this, perhaps you have one file that contains separate cleaning functions (one per state). In addition we could use a dictionary mapping state identifiers (e.g. "US_MI") to the appropriate cleaning function, and have an overarching function that iteratively calls the clean function for each state. In I don't think creating a per-region file is bad (rather I prefer each region knowing how to clean itself), but feel that it could cause clutter, even if we were to put those region-specific files in their own submodule! EDIT: Perhaps something like this, and we call all of that in |
Not to double post, but noticing that some PRs mentioned adding If we take this a bit further, and put all the data processing code in its own repository & python package: then we wouldn't need to update this repo every time we add more cleaning/processing steps. We could also a function to get the Loaders dict, and then copy key/value pairs here (allowing cloners of the repo to still add their own functions from third party modules if they choose to). One issue that I see with this approach is that cloners have less visibility over the data processing steps and would have to go to another repo to see what's being done to the data (our documentation would have to be clear about how to find this information). On the other side, people who are building their own models but want to use our data cleaning steps can now import that package alone. |
No description provided.
The text was updated successfully, but these errors were encountered: