Data Cube Statistics is a an application used to calculate large scale temporal statistics on data stored using an Open Data Cube (ODC) installation. It provides a command line application which uses a YAML configuration file to specify the data range and statistics to calculate.
Contents
- Main features
- Installation
- Usage
- Configuration format
- Available statistics
- Custom statistics
- Running with PBS job scheduler
- Release Notes
- Calculate continental scale statistics over a 30+ year time range
- Simple yet powerful Configuration format
- Scripts supplied for running in a HPC PBS based environment
- Flexible tiling and chunking for efficient execution of 100+Tb jobs
- Track and store full provenance record of operations
- Round-trip workflow from ODC back into ODC
- Supports saving to NetCDF, GeoTIFF or other GDAL supported format
- Optional per-pixel metadata tracking
- Out of the box support for most common statistics - see Available statistics
- Able to create user defined Custom statistics
- Able to handle any CRS and resolution combination (through the power of the ODC)
$ pip install git+https://github.com/opendatacube/datacube-stats/
At it's simplest, Data Cube Statistics only requires specifying a configuration file:
$ datacube-stats example-configuration.yaml
If a configuration file is not provided and a file named config.yaml
is found in the
current directory, then it will be used automatically.
More detailed usage information is also available:
$ datacube-stats --help
Parallel operation is provided by executors in ODC. For example to run across 4 cores:
$ datacube-stats --parallel 4 example-configuration.yaml
For tiled jobs, you can specify a single tile as a test run:
$ datacube-stats --tile-index [X] [Y] example-configuration.yaml
Also useful when testing stats configurations, you can override the output directory:
$ datacube-stats --output-location /home/user/example_folder/ example-configuration.yaml
$ datacube-stats --list-statistics
Specify the product/s of interest, measurements of interest, and any masks to be applied.
A simple example loading a single measurement from a single product:
sources:
- product: old_wofs
measurements: [water]
group_by: solar_day
A (much) more complicated example which combines Landsat 5 and Landsat 7 data, with filtering based on particular flags in a Pixel Quality layer, as well as eliminating data which doesn't meet the minimum required spatial accuracy:
sources:
- product: ls5_nbar_albers
measurements: [blue, green, red, nir, swir1, swir2]
group_by: solar_day
source_filter:
product: ls5_level1_scene
gqa_iterative_mean_xy: [0, 1]
masks:
- product: ls5_pq_albers
measurement: pixelquality
group_by: solar_day
fuse_func: datacube.helpers.ga_pq_fuser
flags:
contiguous: True
cloud_acca: no_cloud
cloud_fmask: no_cloud
cloud_shadow_acca: no_cloud_shadow
cloud_shadow_fmask: no_cloud_shadow
blue_saturated: False
green_saturated: False
red_saturated: False
nir_saturated: False
swir1_saturated: False
swir2_saturated: False
- product: ls7_nbar_albers
measurements: [blue, green, red, nir, swir1, swir2]
group_by: solar_day
source_filter:
product: ls7_level1_scene
gqa_iterative_mean_xy: [0, 1]
masks:
- product: ls7_pq_albers
measurement: pixelquality
group_by: solar_day
fuse_func: datacube.helpers.ga_pq_fuser
flags:
contiguous: True
cloud_acca: no_cloud
cloud_fmask: no_cloud
cloud_shadow_acca: no_cloud_shadow
cloud_shadow_fmask: no_cloud_shadow
blue_saturated: False
green_saturated: False
red_saturated: False
nir_saturated: False
swir1_saturated: False
swir2_saturated: False
By default, null or no-data values are automatically masked out, according to their definition in the Data Cube Product they are loaded from. In some cases this doesn't make sense, and can be disabled by specifying mask_nodata: False
. For example bitfield data like PQ and WOfS Extents that have a more complicated representation of no-data, which will be handled by the statistic being run over them.
sources:
- product: ls5_pq_albers
group_by: solar_day
mask_nodata: False
fuse_func: datacube.helpers.ga_pq_fuser
group_by: solar_day
Perform statistics over a single time range. The first date is inclusive and the last date is exclusive.
date_ranges:
start_date: 2000-01-01
end_date: 2016-01-01
Or over a sequence of time steps, for example, an output for each year over a 15 year period:
date_ranges:
start_date: 2000-01-01
end_date: 2016-01-01
stats_duration: 1y
step_size: 1y
Winter seasons in the southern hemisphere over the same 15 year period:
date_ranges:
start_date: 2000-06-01
end_date: 2016-09-01
stats_duration: 3m
step_size: 1y
Specify the base output directory where files will be written:
location: /home/user/mystats_outputs/
Able to write fully compliant NetCDF-CF, either projected or unprojected spatially, with optional `Extra metadata attributes`_.
For example, to output 100Ă—100km tiles, with 25m per pixel:
storage:
driver: NetCDF CF
crs: EPSG:3577
tile_size:
x: 100000.0
y: 100000.0
resolution:
x: 25
y: -25
chunking:
x: 200
y: 200
time: 1
dimension_order: [time, y, x]
Write GeoTIFF files for each defined output. Side car dataset metadata documents in YAML format will be written which include the provenance information and allow re-indexing into the Data Cube.
Output 1°×1° tiles, with 4000×4000 pixels per tile:
storage:
driver: GeoTIFF
crs: EPSG:4326
tile_size:
longitude: 1.0
latitude: 1.0
resolution:
longitude: 0.00025
latitude: -0.00025
chunking:
longitude: 400
latitude: 400
time: 1
dimension_order: [time, latitude, longitude]
Adjust the size of the spatial chunks that are loaded into memory. This setting can be adjusted depending on the time depth being processed, the available memory on the processing machine, and how many simultaneous tasks are being run on the machine.
computation:
chunking:
longitude: 1000
latitude: 1000
An easy way to create a Shapefile is to use GeoJSON.io, draw your region of interest,
then from the top menu Save -> Shapefile
to download the zipped Shapefile.
input_region:
from_file: /home/user/mdb_floodplan/mdb_floodplain.shp
Whether the output will be gridded (tile-based, default) or not (feature-based) may be specified by
setting gridded: True
or gridded: False
respectively. The features to generate output for
may also be specified (in which case the output is feature-based),
input_region:
from_file: /home/user/mdb_floodplan/mdb_floodplain.shp
feature_id: [39]
The tiling regime is determined by the tile_size
parameter of the Output storage format section.
A list of tiles can be passed on to tiles
parameter.
input_region:
tiles:
- [16, -39]
- [17, -39]
The time period can be specified for individual sensors to include only datasets for this period. This can be sometime useful to exclude datasets for Landsat 7 due to SLC failure.
sources:
product: ls7_nbar_albers
name: intertidal_low
measurements: [blue, green, red, nir, swir1, swir2]
group_by: solar_day
time: [1986-01-01, 2003-05-01]
Specify the maximum and minimum spatial range. You must also specify the CRS to use, normally with an EPSG code, this alters whether you are specifying x/y or latitude/longitude.
input_region:
crs: EPSG:4326
longitude: [147.1, 147.9]
latitude: [-33, -34]
Note
This method of specifying extents will output a single large file, not a set of tiles.
Don't specify an input_region
to process all available data.
Use http://geojson.io/ to draw out a region of interest. Copy the geometry
portion of the GeoJSON and paste
it into your configuration file under input_region
. An example for Australia:
input_region:
"geometry": {
"type": "Polygon",
"coordinates": [ [
[ 143.26171875, -9.88227549342994 ],
[ 129.7265625, -9.96885060854611 ],
[ 125.859375, -12.554563528593656 ],
[ 119.35546875000001, -18.22935133838667 ],
[ 111.005859375, -22.350075806124853 ],
[ 113.818359375, -36.17335693522159 ],
[ 117.94921874999999, -36.52729481454623 ],
[ 130.78125, -33.06392419812064 ],
[ 135.966796875, -37.43997405227057 ],
[ 147.041015625, -44.59046718130883 ],
[ 154.248046875, -34.234512362369856 ],
[ 154.775390625, -24.5271348225978 ],
[ 143.26171875, -9.88227549342994 ]
] ]
}
This section of the configuration file specifies both which statistics to calculate, and which files to write them out to.
For many statistics workflows, it takes longer to load the data into memory than it does to compute the result. For these cases
it makes sense to perform multiple computations on the same set of data, and so output_products
is a list of outputs, but at
a minimum it only needs one definition.
Specify which statistic to use, and optionally any arguments. For example, a simple mean:
statistic: simple
statistic_args:
reduction_function: mean
Define the name of the output product. eg:
name: landsat_yearly_mean
Optional field allows to specify product_type
field of the output product.
Defaults to !!NOTSET!!
. This is needed when output is to be indexed into the
datacube.
product_type: seasonal_stats
Any extra arguments to pass to the output driver for an individual output band:
output_params:
zlib: True
fletcher32: True
Specify arbitrary metadata to attach. The format
section is required by datacube
.
metadata:
format:
name: NetCDF
platform:
code: LANDSAT-8
NOTE: This feature is being deprecated. We expect to remove it in the next release after release 0.9b1 and replace it with something more general.
To filter out sources that correspond to any derived products. It currently supports two methods to filter out list of dates. Filtering in hydrological months ('by_hydrological_months'), can be used to filter months from July to November for the year after the dry or wet years collected from the polygon. Specific month range can also be specified. Second method of filtering is 'by_tide_height', which uses OTPS model to get tide_height:
filter_product:
method: by_tide_height
args:
tide_range: 10
tide_percent: 20
filter_product:
method: by_hydrological_months
args:
type: dry
months: ['07', '11']
Specify a template string used to name the output files. Uses the python format()
string syntax, with the following placeholders available:
Placeholder | Description |
---|---|
x | X Tile Index |
y | Y Tile Index |
epoch_start | Start date of the epoch, format using strftime syntax |
epoch_end | End date of the epoch, format using strftime syntax |
name | The product name given to this output product |
stat_name | The name of the statistic used to compute this product |
For example:
file_path_template: '{y}_{x}/LS_PQ_COUNT_3577_{y}_{x}_{epoch_start:%Y-%m-%d}_{epoch_end:%Y-%m-%d}.nc'
Will output filenames similar to:
10_15/LS_PQ_COUNT_3577_10_15_2010-01-01_2011-01-01.nc
output_products:
- name: landsat_seasonal_mean
product_type: seasonal_stats
metadata:
format:
name: NetCDF
statistic: simple
statistic_args:
reduction_function: mean
output_params:
zlib: True
fletcher32: True
file_path_template: 'SR_N_MEAN/SR_N_MEAN_3577_{x:02d}_{y:02d}_{epoch_start:%Y%m%d}.nc'
- name: landsat_seasonal_medoid
product_type: seasonal_stats
statistic: medoid
output_params:
zlib: True
fletcher32: True
file_path_template: 'SR_N_MEDOID/SR_N_MEDOID_3577_{x:02d}_{y:02d}_{epoch_start:%Y%m%d}.nc'
- name: landsat_seasonal_percentile_10
product_type: seasonal_stats
statistic: percentile
statistic_args:
q: 10
output_params:
zlib: True
fletcher32: True
file_path_template: 'SR_N_PCT_10/SR_N_PCT_10_3577_{x:02d}_{y:02d}_{epoch_start:%Y%m%d}.nc'
Additional metadata can be specified which will be written as
global attributes
into the output NetCDF file. For example:
global_attributes:
institution: Commonwealth of Australia (Geoscience Australia)
instrument: OLI
keywords: AU/GA,NASA/GSFC/SED/ESD/LANDSAT,ETM+,TM,OLI,EARTH SCIENCE
keywords_vocabulary: GCMD
platform: LANDSAT-8
publisher_email: earth.observation@ga.gov.au
publisher_name: Section Leader, Operations Section, NEMO, Geoscience Australia
publisher_url: http://www.ga.gov.au
license: CC BY Attribution 4.0 International License
coverage_content_type: physicalMeasurement
cdm_data_type: Grid
product_suite: Pixel Quality 25m
Any reduction operation supported by xarray. eg:
- mean
- median
- percentile
High-dimensional medians implemented by the hdmedians python package
- Medoid
- Geometric median
Normalised difference statistics. eg. NDVI + statistic
Statistics operations in Data Cube Statistics are implemented as Python Classes, which extends datacube_stats.statistics.Statistic
. Two
methods should be implemented, measurements()
and compute()
.
- measurements()
- Takes a list of measurements provided by the input product type, and returns a list of measurements that this class will produce when asked to compute a statistic over some data.
- compute()
- Takes a
xarray.Dataset
containing some data that has been loaded, and returns anotherxarray.Dataset
after doing some computation. The variables on the returned dataset must match the types specified bymeasurements()
.
For example, the following implementation requires its input data to contain a
variable named water
, and outputs data with a single variable named
count_wet
of type int16
. In the configuration file, we will need to pass
a list of values for water
that indicate "wetness" as an argument named
wet_values
to the statistic.
import xarray
from datacube_stats.statistics import Statistic
from datacube.model import Measurement
class CountWet(Statistic):
def __init__(self, wet_values):
# list of values of 'water' that we count as "wet"
assert len(wet_values) > 0, 'no wet values provided'
self.wet_values = wet_values
def compute(self, data):
wet = xarray.zeros_like(data.water)
for val in self.wet_values:
wet += data.water == val
return xarray.Dataset({'count_wet': wet.sum(dim='time')},
attrs={'crs': data.crs})
def measurements(self, input_measurements):
assert 'water' in [m['name'] for m in input_measurements]
wet = Measurement(name='count_wet', dtype='int16', nodata= -1, units='1')
return [wet]
Suppose the package that contains this implementation is called pseudo.example
,
and it is available in the Python path (with the current directory added). Then
the configuration file could look like (eliding location
, computation
,
and storage
specifications)
sources:
- product: wofs_albers
name: wofs_dry
measurements: [water]
group_by: solar_day
date_ranges:
start_date: 2014-01-01
end_date: 2014-02-01
stats_duration: 1m
step_size: 1m
output_products:
- name: wet_count_summary
product_type: wofs_statistical_summary
metadata:
format:
name: NetCDF
statistic: external
statistic_args:
impl: pseudo.example.CountWet
# ignoring PQ sea mask that excludes a lot of useful data
wet_values:
- 128 # clear and wet
- 132 # clear and wet and masked for sea
output_params:
zlib: True
fletcher32: True
file_path_template: 'WOFS_COUNT/{x}_{y}/WOFS_COUNT_3577_{x}_{y}_{epoch_start:%Y%m%d}_{epoch_end:%Y%m%d}.nc'
To submit a job to PBS, run datacube-stats
like
$ datacube-stats --qsub="project=u46,nodes=100,walltime=5h,mem=large,queue=normal" example.yaml
The mem
specification can be small
, medium
, or large
, for 2GB, 4GB, or 8GB
memory per core respectively. For more details, run
$ datacube-stats --qsub=help
Note: This is the last release before our planned move to make use
of virtual products now available in ODC core. This version should be
compatible datacube
1.805b or 06b.
- Fix incompatibilities with ODC 1.8
- Fix NetCDF CF driver issue with ODC 1.8
- Minor update to the percentile algorithm
- Some PyLint fixes
Note: This is another (minor) maintenance release targeting datacube 1.6.1.
- Refactor that should not affect output
- Add FC percentile configuration
.yaml
files
Note: This is another maintenance release targeting datacube 1.6.1.
Percentile
now supports specifyingminimum_valid_observations
- One GeoTIFF file per band option is re-enabled
(specify
var_name
in thefile_path_template
) - Polygon mask is now compatible with
mask_nodata
- Fix some tests
Note: This is a maintenance release compatible with datacube 1.6.1.
- Add mangrove statistics
- Minor changes to the Python API to support multi-processing
- The GeoTIFF output driver name has changed to
GeoTIFF
fromGeoTiff
Note: We expect several backwards incompatible changes to the datacube-stats
package
in the near future. Release 0.9b1 is intended to be the last release fully supporting
configurations from earlier releases.
- Add tasseled cap indices statistic
- Fix GeoTiff output driver
- Preliminary Python API and an
xarray
output driver to produce in-momery results - Support default configurations (
config.yaml
in the current working folder) - Support discoverable external plugins (in the current working folder)
- Fix
xarray
sorting bug - Add ability to specify
num_threads
to thenew_geomedian
statistic - Add ability to attach custom metadata to generated datasets
- Add ability for feature-based task generation from a shapefile
- Fix issue with
hdmedian
GeoMedian statistics
- Move task execution code to the
digitalearthau
repository - ITEM and low/high tide composite, FC percentile
- Schema-validated configuration
- Time filters on individual source products
- User documentation!
- List available statistics from the command line
datacube-stats --list-statistics