Skip to content

Latest commit

 

History

History
128 lines (93 loc) · 4.91 KB

README.md

File metadata and controls

128 lines (93 loc) · 4.91 KB

weather-dl – Weather Downloader

Weather Downloader ingests weather data to cloud buckets, such as Google Cloud Storage (beta).

Features

  • Flexible Pipelines: weather-dl offers a high degree of control over what is downloaded via configuration files. Separate scripts need not be written to get new data or add parameters. For more, see the configuration docs.
  • Efficient Parallelization: The tool gives you full control over how downloads are sharded and parallelized (with good defaults). This lets you focus on the data and not the plumbing.
  • Hassle-Free Dev-Ops. weather-dl and Dataflow make it easy to spin up VMs on your behalf with one command. No need to keep your local machine online all night to acquire data.
  • Robust Downloads. If an error occurs when fetching a shard, Dataflow will automatically retry the download for you. Previously downloaded shards will be skipped by default, so you can re-run the tool without having to worry about duplication of work.

Note: Currently, only ECMWF's MARS and CDS clients are supported. If you'd like to use weather-dl to work with other data sources, please file an issue (or consider making a contribution).

Usage

usage: weather-dl [-h] [-f] [-d] [-l] [-m MANIFEST_LOCATION] config

Weather Downloader ingests weather data to cloud storage.

positional arguments:
  config                path/to/config.cfg, containing client and data information. Accepts *.cfg and *.json files.

Common options:

  • -f, --force-download: Force redownload of partitions that were previously downloaded.
  • -d, --dry-run: Run pipeline steps without actually downloading or writing to cloud storage.
  • -l, --local-run: Run locally and download to local hard drive. The data and manifest directory is set by default to '<$CWD>/local_run'. The runner will be set to DirectRunner. The only other relevant option is the config and --direct_num_workers
  • -m, --manifest-location MANIFEST_LOCATION: Location of the manifest. Either a Firestore collection URI ('fs://?projectId='), a GCS bucket URI, or 'noop://' for an in-memory location.
  • -n, --num-requests-per-key: Number of concurrent requests to make per API key. Default: make an educated guess per client & config. Please see the client documentation for more details.

Invoke with -h or --help to see the full range of options.

For further information on how to write config files, please consult this documentation.

Usage Examples:

weather-dl configs/era5_example_config_local_run.cfg --local-run

Preview download with a dry run:

weather-dl configs/mars_example_config.cfg --dry-run

Using DataflowRunner

weather-dl configs/mars_example_config.cfg \
           --runner DataflowRunner \
           --project $PROJECT \
           --temp_location gs://$BUCKET/tmp  \
           --job_name $JOB_NAME

Using the DataflowRunner and specifying 3 requests per license

weather-dl configs/mars_example_config.cfg \
           -n 3 \
           --runner DataflowRunner \
           --project $PROJECT \
           --temp_location gs://$BUCKET/tmp  \
           --job_name $JOB_NAME

For a full list of how to configure the Dataflow pipeline, please review this table.

Monitoring

You can view how your ECMWF API jobs are by visitng the client-specific job queue:

If you use Google Cloud Storage, we recommend using gsutil (link) to inspect the progress of your downloads. For example:

# Check that the file-sizes of your downloads look alright
gsutil du -h gs://your-cloud-bucket/mars-data/*T00z.nc 
# See how many downloads have finished
gsutil du -h gs://your-cloud-bucket/mars-data/*T00z.nc | wc -l

download-status

In addition, we've provided a simple tool for getting a rough measure of download state. Provided a bucket prefix, it will output the counts of the statuses in that prefix.

usage: download-status [-h] [-m MANIFEST_LOCATION] prefix

Check statuses of `weather-dl` downloads.

positional arguments:
  prefix                Prefix of the location string (e.g. a cloud bucket); used to filter which statuses to check.

Options

  • -m, --manifest-location: Specify the location to a manifest; this is the same as weather-dl. Only supports Firebase Manifests.

Usage Examples:

download-status "gs://ecmwf-downloads/hres/world/
...
The current download statuses for 'gs://ecmwf-downloads/hres/world/' are: Counter({'scheduled': 245, 'success': 116, 'in-progress': 4, 'failure': 1}).