This page is intended to provide teams with all the information they need to submit forecasts. We note that these instructions have been adapted from the COVID-19 Forecast Hub.
All forecasts should be submitted directly to the data-forecasts/ folder. Data in this directory should be added to the repository through a pull request so that automatic data validation checks are run.
These instructions provide detail about the data format as well as validation that you can do prior to this pull request. In addition, we describe metadata that each model should provide.
Table of Contents
- What is a forecast
- Gold standard data
- Data formatting
- Forecast file format
- Forecast data validation
- Weekly ensemble build
- Policy on late submissions
Models are asked to make specific quantitative forecasts about data that will be observed in the future. These forecasts are interpreted as "unconditional" predictions about the future. That is, they are not predictions only for a limited set of possible future scenarios in which a certain set of conditions (e.g. vaccination uptake is strong, or new social-distancing mandates are put in place) hold about the future -- rather, they should characterize uncertainty across all reasonable future scenarios. In practice, all forecasting models make some assumptions about how current trends in data may change and impact the forecasted outcome; some teams select a "most likely" scenario or combine predictions across multiple scenarios that may occur. Forecasts submitted to this repository will be evaluated against observed data.
We note that other modeling efforts, such as the COVID-19 Scenario Modeling Hub, have been launched to collect and aggregate model outputs from "scenario projection" models. These models create longer-term projections under a specific set of assumptions about how the main drivers of the pandemic (such as non-pharmaceutical intervention compliance, or vaccination uptake) may change over time.
This project treats hospitalization data reported from the HHS Protect system at HealthData.gov as "gold standard" data. We create processed versions of these data that are stored in this repository.
Details on how gold standard data are defined can be found in the data-truth folder README file.
The automatic checks in place for forecast files submitted to this repository validates both the filename and file contents to ensure the file can be used in the visualization and ensemble forecasting.
Each subdirectory within the data-forecasts/ directory has the format
team-model
where
team
is the teamname andmodel
is the name of your model.
Both team and model should be less than 15 characters and not include
hyphens Both team and model should be less than 15 characters and not
include hyphens or other special characters, with the exception of "_".
The model
should be unique from any other model in the project.
Note that teams that submitted forecasts during the 2021-2022 season should add new forecasts to the existing subdirectory, provided that the forecasts were generated using the same model. New teams or modeling groups that submit forecasts generated by a new model will need to add a subdirectory using the above conventions.
Within each subdirectory, there should be a metadata file, a license file (optional), and a set of forecasts.
The metadata file should have the following format
metadata-team-model.txt
and here is the structure of the metadata file. Note that returning teams should update the metadata file provided during the 2021-2022 season to document any changes that have been made to their model.
By default, forecasts are released under a CC-BY 4.0 license. If you
would like to release your forecasts under a different license, please
specify a standard license in the license
field of your metadata file. Alternatively, if you wish to use a license
that is not in the list of standard
licenses, you may include a
LICENSE.txt
file in your model directory.
Each forecast file within the subdirectory should have the following format
YYYY-MM-DD-team-model.csv
where
YYYY
is the 4 digit year,MM
is the 2 digit month,DD
is the 2 digit day,team
is the teamname, andmodel
is the name of your model.
The date YYYY-MM-DD is the forecast_date
. For this
project, the forecast_date
should always be the Monday of the week the
submission is due.
The team
and model
in this file must match the team
and model
in
the directory this file is in. Both team
and model
should be less
than 15 characters, alpha-numeric and underscores only, with no spaces
or hyphens.
The file must be a comma-separated value (csv) file with the following columns (in any order):
forecast_date
target
target_end_date
location
type
quantile
value
No additional columns are allowed.
Each row in the file is either a point or quantile forecast for a location on a particular date for a particular target.
Values in the forecast_date
column must be a date in the format
YYYY-MM-DD
This is the Forecast Date for the submission and will always be a
Monday (previously also the forecast due date until 1/6/2023). forecast_date
should correspond and be redundant with the date
in the filename, and is included here by request from some analysts.
Values in the target
column must be a character (string) and be one of
the following specific targets:
- “N wk ahead inc flu hosp” where N is a number between 1 and 4
For week-ahead forecasts, we will use the specification of epidemiological weeks (EWs) defined by the US CDC which run Sunday through Saturday. There are standard software packages to convert from dates to epidemic weeks and vice versa. E.g. MMWRweek for R and pymmwr and epiweeks for python.
For week-ahead forecasts with forecast_date
of Monday of EW12, a 1
week ahead forecast corresponds to EW12 and should have
target_end_date
of the Saturday of EW12.
This target is the number of new weekly hospitalizations predicted by
the model during the week that is N weeks after forecast_date
.
Values in the target_end_date
column must be a date in the format
YYYY-MM-DD
This is the date for the forecast target
. For “# wk” targets,
target_end_date
will be the Saturday at the end of the forecasted
week. As a reminder, the target_end_date
is the end date of the week
during which the admissions occur, not the date the admission is
reported (see the data processing section for more details).
Values in the location
column must be one of the “locations” in
this FIPS numeric code file which
includes numeric FIPS codes for U.S. states and selected jurisdictions
(Washington DC, Puerto Rico, and the US Virgin Islands) as well as
“US” for national forecasts.
Please note that when writing FIPS codes, they should be written in as a character string to preserve any leading zeroes.
Values in the type
column are either
- “point” or
- “quantile”.
This value indicates whether that row corresponds to a point forecast or a quantile forecast. Point forecasts may be used in visualization while quantile forecasts are used in visualization and in ensemble construction.
When point forecasts are not included, the median for every location-target pair will be interpreted as the point forecast.
Values in the quantile
column are either “NA” (if type
is
“point”) or a quantile in the format
0.###
For quantile forecasts, this value indicates the quantile for the
value
in this row.
Teams must provide the following 23 quantiles:
0.010, 0.025, 0.050, 0.100, 0.150, 0.200, 0.250, 0.300, 0.350, 0.400, 0.450, 0.500, 0.550, 0.600, 0.650, 0.700, 0.750, 0.800, 0.850, 0.900, 0.950, 0.975, and 0.990
R: c(0.01, 0.025, seq(0.05, 0.95, by = 0.05), 0.975, 0.99) Python: quantiles = np.append(np.append([0.01,0.025],np.arange(0.05,0.95+0.05,0.050)), [0.975,0.99])
Values in the value
column are non-negative numbers indicating the
“point” or “quantile” prediction for this row. For a “point”
prediction, value
is simply the value of that point prediction for the
target
and location
associated with that row. For a “quantile”
prediction, value
is the inverse of the cumulative distribution
function for the target
, location
, and quantile
associated with
that row. For example, the 2.5 and 97.5 quantiles for a given target and
location should capture 95% of the forecasted values and correspond to
the 95% Prediction Intervals.
To ensure proper data formatting, pull requests for new data in
data-forecasts/
will be automatically run.
When a pull request is submitted, the data are validated through Github Actions which runs the tests present in the validations repository. The intent for these tests are to validate the requirements above. Please let us know if you are facing issues while running the tests.
Every Wednesday morning, we will generate the ensemble forecast using a single valid forecast from each team that submitted in the current week by the Tuesday 11PM ET deadline.
In order to ensure that forecasting is done in real-time, all forecasts are required to be submitted to this repository by 11pm ET on Tuesdays each week. We do not accept late forecasts.