Skip to content
Damien Irving edited this page Sep 2, 2021 · 29 revisions

The package assists with going from a forecast dataset to a collection of valid, independent samples for assessment.

Relevant code and packages

Relevant papers

Parallel processing on NCI

A local Dask cluster can be setup as follows:

from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)

You can set up a Dask cluster under the OpenOnDemand environment across multiple nodes using the cloud-based SLURM cluster.

from dask.distributed import Client,Scheduler
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(cores=16, memory="31GB")
client = Client(cluster)
cluster.scale(cores=32)

You can also use climtas.nci.GadiClient() following these instructions.

Observational data

/g/data/xv83/observations/
/g/data/zv2/agcd/v2

CAFE data

Project xv83 on NCI (zarr format).

The complete dataset is at /g/data/xv83/dcfp/CAFE-f6 (don't use May start dates prior to 1987 as they are still running).

The control run that the simulation branched from (c5), data assimilation method (d60), perturbation method (pX) are indicated. They are the defining features of the f5 dataset.

The f5 dataset has 10 ensemble members and was run over a longer period of time, whereas f6 has 96 members over a shorter period.

Issues

The CAFE monthly precipitation has listed units of kg/m2/s-1 but is actually the sum of the daily kg/m2/s-1 values.

Processing steps

Step 1: Pre-processing

TODO:

  • I/O with netcdf
  • Temporal aggregation (e.g. annual, seasonal or monthly) is the tricky part of the preprocessing, because you need to decide how to handle/modify the lead_time coordinate
    • Resample frequencies
      • Y (annual, with date label being last day of year)
      • M (monthly, with date label being last day of month)
      • Q-NOV (DJF, MAM, JJA, SON, with date label being last day of season)
      • A-NOV (annual mean for Dec-Nov, date label being last day of the year)
      • YS (annual, with date label being first day of the year)
      • MS (monthly, with date label being first day of month)
      • QS-DEC (DJF, MAM, JJA, SON, with date label being first day of season)
      • AS-DEC (annual mean for Dec-Nov, with date label being first day of year)
  • xarray pad for NaN filling
  • ClimPred has functions for (and a good explanation of) verification alignment

Step 2: Fidelity testing

Returns a mask (to remove gird points that fail the test) or a test statistic/s for each grid point corresponding to different alpha levels (e.g. 0.99, 0.95) that a mask could be created from.

Distribution tests

Dougie's paper uses the Kolmogorov–Smirnov test statistic (the 1D function comes from SciPy, the 2D he wrote himself) to test the similarity between the observational and forecast distributions. Other papers apply the Multiple-moment test (1D only), or we could use scipy implementation of the Anderson-Darling test. It may be possible to use xclim for the KS test calculation once this issue is resolved, otherwise xks could be incorporated into this unseen package. For a 1D KS test there's a formula to determine the confidence interval. For the 2D test you need to use bootstrapping to find the confidence interval (i.e. you take many sub-samples of the forecast data - each sub-sample has the same size as the observations - and calculate a KS value for each sub-sample).

Need to calculate test scores for each lead time separately (because model can drift away from initial state).

Correlation tests

Because all the ensemble members start from the same point, they are also correlated (i.e. not independent) for n lead times. For instance, lead times < 30 months might have a correlation higher than some threshold, so you'd would exclude all lead times < 30 months from your analysis.

Issues

An issue is data selection where you want the same number of samples for each lead time (e.g. for a dataset where 10-year forecasts are initiated every year from 2005-2020, you don't get full overlap until 2014). Dougie has written a function stack_super_ensemble to subset the overlap data, but it's inefficient.

Step 3: The easier stuff

i.e. counting to determine the likelihood of a particular event.

Development notes

Design choices

Define functions that can be used to construct workflows in a Jupyter notebook or linked together in the main() function of Python scripts.

Patch extraction

A common operation involves taking observational datasets and converting the time axis to a dual initial date / lead time coordinate. This basically involves duplicating the data via overlapping windows, which is analogous to a process in image processing (to make images more blurry) called patch extraction. There are implementations available as part of skimage (skimage.util.view_as_windows), numpy (numpy.lib.stride_tricks.sliding_window_view), dask (dask.array.overlap.sliding_window_view) and xarray (xarray.core.rolling).

Data selection using multi-indexes

For many analyses we ultimately want to reduce all the non-spatial dimensions down to one single dimension. For example, this array,

ds_cafe

Dimensions:    (ensemble: 96, init_date: 4, lat: 17, lead_time: 3650, lon: 17)
Coordinates:
  * ensemble   (ensemble) int64 1 2 3 4 5 6 7 8 9 ... 88 89 90 91 92 93 94 95 96
  * init_date  (init_date) datetime64[ns] 1990-11-01 1991-11-01 ... 1993-11-01
  * lat        (lat) float64 -43.48 -41.46 -39.44 ... -15.17 -13.15 -11.12
  * lead_time  (lead_time) int64 0 1 2 3 4 5 6 ... 3644 3645 3646 3647 3648 3649
  * lon        (lon) float64 113.8 116.2 118.8 121.2 ... 146.2 148.8 151.2 153.8
    time       (lead_time, init_date) datetime64[ns] dask.array<chunksize=(3650, 4), meta=np.ndarray>
Data variables:
    precip     (init_date, lead_time, ensemble, lat, lon) float64 dask.array<chunksize=(1, 50, 96, 17, 17), meta=np.ndarray>

would simply become (lat, lon, sample), where sample is a MultiIndex with related ensemble, init_date and lead_time coordinate values.

ds_stacked = ds_cafe.stack({'sample': ['ensemble', 'init_date', 'lead_time']})
ds_stacked

Dimensions:    (lat: 17, lon: 17, sample: 1401600)
Coordinates:
  * lat        (lat) float64 -43.48 -41.46 -39.44 ... -15.17 -13.15 -11.12
  * lon        (lon) float64 113.8 116.2 118.8 121.2 ... 146.2 148.8 151.2 153.8
    time       (sample) datetime64[ns] dask.array<chunksize=(1401600,), meta=np.ndarray>
  * sample     (sample) MultiIndex
  - ensemble   (sample) int64 1 1 1 1 1 1 1 1 1 1 ... 96 96 96 96 96 96 96 96 96
  - init_date  (sample) datetime64[ns] 1990-11-01 1990-11-01 ... 1993-11-01
  - lead_time  (sample) int64 0 1 2 3 4 5 6 ... 3644 3645 3646 3647 3648 3649
Data variables:
    precip     (lat, lon, sample) float64 dask.array<chunksize=(17, 17, 14600), meta=np.ndarray>

We can index (i.e. select and slice) the data using references to ensemble, init_date, lead_time, but not time, which is a bit of a limitation because the work around for time selection involves creating a boolean array the same size as the data array.

I think this limitation will be overcome with the implementation of flexible indexes. A proposal has been developed for adding this feature and work has started as part of the xarray CZI grant.

Making functions dask-aware

Use apply_ufunc. Here's some examples:

Paper ideas

Apply UNSEEN to various large ensembles (and synthetic distributions) and try to figure out what size ensemble is needed for what problems. This will help isolate the particular usefulness of CAFE (it has the relative advantage of large sample size but poor resolution compared to Weather@Home).

Clone this wiki locally