Forecasts SARS-CoV-2

⚠️ WARNING: This is an alpha release. Output file format and address may change at any time

This repo forms the basis of our continually-updated modelling of SARS-CoV-2 variant frequencies. Broadly speaking, the moving pieces in this repo are:

Data ingest, which produces TSV files of sequence counts. See ./ingest/README for more details.
Variant modelling, which is detailed in this README. The models themselves are defined in the evofr repo.
The ./viz/ directory contains a web-app which visualises the latest model outputs. See ./viz/README for more details. Currently this web-app is available at nextstrain.github.io/forecasts-ncov/.

Automated pipeline

The automated pipeline runs daily based on a scheduled jobs and triggers from upstream data ingests. We use GitHub actions to schedule these jobs, often with one job triggering another upon completion.

Case counts are fetched from external data sources daily at 8 AM PST
Raw metadata/sequences are fetched and cleaned via nextstrain/ncov-ingest.
- See GISAID and open data workflows for their daily scheduled times
The nextstrain/ncov-ingest pipelines trigger the clade counts jobs once the latest curated data has been uploaded to S3
- The GISAID and open data ingest pipelines have different run times, so their clade counts jobs are triggered at different times.
Clade counts jobs trigger the model runs once the counts data has been uploaded to S3
Model results are uploaded to S3 as dated files where the date indicates the run date

Inputs

See available counts files for the input case counts and clade counts files.

Outputs

The model results for GISAID data are stored at s3://nextstrain-data/files/workflows/forecasts-ncov/gisaid. The model results for open (GenBank) data are stored at s3://nextstrain-data/files/workflows/forecasts-ncov/open.

The latest results are stored as latest_results.json and previously uploaded results can be found as <YYYY-MM-DD>_results.json.

Summary of Available files

Data Provenance	Variant Classification	Geographic Resolution	Model	Address
GISAID	Nextstrain clades	Global	MLR	`https://data.nextstrain.org/files/workflows/forecasts-ncov/gisaid/nextstrain_clades/global/mlr/latest_results.json`
	Pango lineages			`https://data.nextstrain.org/files/workflows/forecasts-ncov/gisaid/pango_lineages/global/mlr/latest_results.json`
open (GenBank)	Nextstrain clades			`https://data.nextstrain.org/files/workflows/forecasts-ncov/open/nextstrain_clades/global/mlr/latest_results.json`
	Pango lineages			`https://data.nextstrain.org/files/workflows/forecasts-ncov/open/pango_lineages/global/mlr/latest_results.json`

Installation

Please follow installation instructions for Nextstrain's software tools.

Usage

To run pipeline for all available data generated by ingest:

nextstrain build .

To run the pipeline for specific data provenance, variant classification and geo resolution (e.g. gisaid, nextstrain_clades and global only):

nextstrain build . --configfile config/config.yaml --config data_provenances=gisaid variant_classification=nextstrain_clades geo_resolutions=global

Optional uploads

To run the pipeline that uploads the model results to S3 and sends Slack notifications:

nextstrain build . --configfile config/config.yaml config/optional.yaml

OR

Run the GitHub Action workflow named "Run models" to run the pipeline on AWS Batch.

Configuration

The data_provenances, variant_classifications and geo_resolutions are required configs for the pipeline.

The current available options for data_provenances are

gisaid
open

The current available options for variant_classifications are

nextstrain_clades
pango_lineages

The current available options for geo_resolutions are

global
usa

Data Prep Configurations

The prepare_data params in config/config.yaml are used to subset the full case counts and clades counts data to specific date range, locations, and clades.

As of 2023-04-04, the config for the automated pipeline is set to only include data from:

the past 150 days
- excluding sequences from the last 12 days since they may be overly enriched for variants
locations that have at least 500 sequences in the last 30 days
- excluding locations specifically listed in defaults/global_excluded_locations.txt
clades that have at least 5000 sequences in the last 150 days

Model configurations

The specific model configurations are housed in separate config YAML files or each model. These separate config files must be provided in the main config as mlr_config and renewal_config in order to run the models. By default, the model config files used are config/mlr-config.yaml and config/renewal-config.yaml. Note the inputs and outputs for the models are overridden in the Snakemake pipeline to conform to the Snakemake input/output framework.

Clade and Lineage colours

Model JSONs are post processed by ./scripts/modify-lineage-colours-and-order.py. For nextstrain_clades this sets the colours and display names. For pango_lineages this orders lineages based on their full (unaliased) pango designation, and sets colours based on the associated nextstrain clade.

When new clades are added please modify the CLADES definitions in the script accordingly.

Environment variables

No environment variables are required for open data. However, the following environment variables are required for the gisaid data:

AWS_DEFAULT_REGION
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

Uploads

If running pipeline with uploads to S3, the following environment variables are required (regardless of data provenance):

AWS_DEFAULT_REGION
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

Slack notifications

If running pipeline with Slack notifications, the following environment variables are required:

SLACK_CHANNELS
SLACK_TOKEN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Forecasts SARS-CoV-2

Automated pipeline

Inputs

Outputs

Summary of Available files

Installation

Usage

Optional uploads

Configuration

Data Prep Configurations

Model configurations

Clade and Lineage colours

Environment variables

Uploads

Slack notifications

Files

README.md

Latest commit

History

README.md

File metadata and controls

Forecasts SARS-CoV-2

Automated pipeline

Inputs

Outputs

Summary of Available files

Installation

Usage

Optional uploads

Configuration

Data Prep Configurations

Model configurations

Clade and Lineage colours

Environment variables

Uploads

Slack notifications