Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data-hotfixing without master branch commits #10

Open
k-sys opened this issue Jul 3, 2020 · 7 comments
Open

Data-hotfixing without master branch commits #10

k-sys opened this issue Jul 3, 2020 · 7 comments

Comments

@k-sys
Copy link

k-sys commented Jul 3, 2020

No description provided.

@twiecki
Copy link
Contributor

twiecki commented Jul 3, 2020

Not sure what this is about.

@michaelosthege
Copy link
Collaborator

When the model runs in production, it often fails for some states because of artifacts in the data (#8).
The approach so far is fixing these data points in the code:

https://github.com/rtcovidlive/covid-model/blob/master/covid/data.py#L27:L81

This hotfixing makes master-commits that bypass a code-review process and are one reason why the latest master runs in production.

To uncouple production from latest master (#14) we should find a way to do the manual outlier correction without changing the code on the master branch all the time.

One option would be to read the corrections from a GitHub Gist CSV file or Google Sheet.

@mikeyk
Copy link
Contributor

mikeyk commented Jul 3, 2020

I like the gist CSV idea; something like

region | start date | end date | override_value for the headers?

@michaelosthege
Copy link
Collaborator

I like the gist CSV idea; something like

region | start date | end date | override_value for the headers?

Looks like edit permissions on gists can't be shared.
A wiki page?

Including the date of making the correction would be good for realistic backtesting.

@k-sys
Copy link
Author

k-sys commented Jul 6, 2020

I'm going to have to think about this a little because sometimes the corrections are simple as a line of code, but not as simple if left to a CSV. For instance:
data.loc[idx["LA", pd.Timestamp("2020-06-19") :], :] += 1666
How would one do that in a CSV? (add 1666 to every value after 2020-06-19?) We can, of course, create our own DSL but that seems like overkill too.

I'm wondering if we're trying to fix something we don't actually need to fix. I think we can sort out the "which version runs in production" by simply tagging a production release. Hotfixes happen and I don't know that they're necessarily "bad". Also, if we move this to a CSV, it can become quite brittle because we have no history of the changes to the CSV and if anyone else is using this model then their results can change without warning which seems problematic.

An additional reason to keep in code is that I think changes to the data going forward may be algorithmic rather than scalar. For instance, I posted the following in the slack channel today:

def adjust_totals(model_data, threshold=10):
    excess = model_data.total-model_data.positive
    adjusted = pd.Series(0, index=model_data.index)
    zero_days = []
    for idx, row in model_data.iterrows():
        if excess[idx] <= threshold:
            zero_days.append(idx)
        elif len(zero_days):
            share = excess[idx]/(len(zero_days)+1)
            for day in zero_days:
                adjusted[day] = share
            adjusted[idx] = share
            zero_days=[]
        else:
            adjusted[idx] = row['total']
    return adjusted

There's no way changes like this will fit into a CSV :)

I'd recommend making data changes transparent, in code, and as part of the git history for these reasons. Additionally, we may actually want a per-region sanitizer file (eg. US_OK.py, US_MT.py) where we can more clearly/cleanly show exactly what we're doing to the data of each state.

@raghavp96
Copy link

raghavp96 commented Jul 15, 2020

Hi! I'd like to contribute to this issue if possible!

I'd recommend making data changes transparent, in code, and as part of the git history for these reasons. Additionally, we may actually want a per-region sanitizer file (eg. US_OK.py, US_MT.py) where we can more clearly/cleanly show exactly what we're doing to the data of each state.

Referring to this, perhaps you have one file that contains separate cleaning functions (one per state). In addition we could use a dictionary mapping state identifiers (e.g. "US_MI") to the appropriate cleaning function, and have an overarching function that iteratively calls the clean function for each state. In data_US.py we could call the top function.

I don't think creating a per-region file is bad (rather I prefer each region knowing how to clean itself), but feel that it could cause clutter, even if we were to put those region-specific files in their own submodule!

EDIT: Perhaps something like this, and we call all of that in data_US.py like this?

@raghavp96
Copy link

raghavp96 commented Jul 23, 2020

Not to double post, but noticing that some PRs mentioned adding data_XX.py for other countries (Israel & Germany), it might be a good idea to package all the data processing files into a submodule and then pre-register those countries' loaders in data.py.

If we take this a bit further, and put all the data processing code in its own repository & python package: then we wouldn't need to update this repo every time we add more cleaning/processing steps. We could also a function to get the Loaders dict, and then copy key/value pairs here (allowing cloners of the repo to still add their own functions from third party modules if they choose to). One issue that I see with this approach is that cloners have less visibility over the data processing steps and would have to go to another repo to see what's being done to the data (our documentation would have to be clear about how to find this information). On the other side, people who are building their own models but want to use our data cleaning steps can now import that package alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants