This is a set of tools for working with the "kerchunk" library:
https://fsspec.github.io/kerchunk/
Kerchunk provides cloud-friendly indexing of data files without needing to move the data itself.
The tools included here allow:
- indexing of existing NetCDF files to kerchunk files
- aggregation of existing NetCDF files to a single kerchunk file
- tools to write to either POSIX file systems or S3-compatible object-store
- a wrapper around
xarray
to ensure that the data can be read by Python - integration with access control to limit read/write operations as desired
An example notebook can be run using binder:
https://mybinder.org/v2/gh/cedadev/kerchunk-tools.git/main?filepath=notebooks
From scratch, you can conda install with:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -p ~/miniconda -b
source ~/miniconda/bin/activate
conda create --name kerchunk-tools --file spec-file.txt
conda activate kerchunk-tools
pip install -e . --no-deps
Assuming you have Python 3 installed, you can also install with Pip:
python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -e . --no-deps
NOTE: this installation method generated a lot of HDF5 library warnings when reading data, which were not seen with the Conda install.
Here is an example of using kerchunk_tools
with authentication to the
S3 service:
import kerchunk_tools.xarray_wrapper as wrap_xr
s3_config = {
"token": "TOKEN",
"secret": "SECRET",
"endpoint_url": "ENDPOINT_URL"
}
# Load a Kerchunk file
# Load a Kerchunk file
index_uri = "s3://kc-indexes-cci-cloud-v2/BICEP-OC-L3S-PP-MERGED-1M_MONTHLY_9km_mapped-1998-2020-fv4.2.zstd"
ds = wrap_xr.wrap_xr_open(index_uri, s3_config=s3_config)
# Look at the metadata
print(ds)
pp = ds.pp
print(pp.shape, pp.dims)
# Look at the data
mx = ds.pp.sel(time=slice("1998-03-01", "2000-02-01"), lat=slice(34, 40), lon=slice(20, 23)).max()
mx = float(mx)
print(mx)
assert 2137 < mx < 2139
If you are connecting to a secured endpoint, then you will need three items for your S3 configuration:
S3_TOKEN
S3_SECRET
S3_ENDPOINT_URL
Then you can run a full workflow that:
- creates a bucket in S3
- uploads some NetCDF files to S3
- creates a kerchunk file in S3 (for a single NetCDF file)
- creates a kerchunk file in S3 (for an aggregation of multiple NetCDF files)
- read from the kerchunk files and extract/process a subset of data
S3_TOKEN=s3_token S3_SECRET=s3_secret S3_ENDPOINT_URL=s3_endpoint pytest tests/test_workflows/test_workflow_s3_quobyte_single.py -v
Our initial tests, having only run once, came out as follows:
Table of test timings (in seconds). Where multiple values appear, the test was run multiple times.
Test type | Read/process small subset | Read/process larger subset |
---|---|---|
POSIX Kerchunk | 1.0, 0.7 | 15.2, 37.9 |
S3-Quobyte Kerchunk | 1.1, 4.7, 1.3 | 8.5, 9.1, 5.7 |
S3-DataCore Zarr | 3.9, 3.8 | 99.8, 99.2 |
POSIX Xarray | 0.6, 0.9 | 86.0, 91.4 |
We need to run these repeatedly to validate them.
The test types are:
- POSIX Kerchunk:
- This uses a Kerchunk index file on the POSIX file system
- It references NetCDF files on the POSIX file system
- There is no use of object-store
- This test depends on having pre-generated the Kerchunk index file
- S3-Quobyte Kerchunk:
- This uses a Kerchunk index file in the JASMIN S3-Quobyte object-store
- It references NetCDF files in the S3-Quobyte object-store
- The files are actually part of the CEDA Archive and are exposed via an S3 interface
- There is no use of the POSIX file systems
- This test depends on having pre-generated the Kerchunk index file
- S3-DataCore Zarr:
- This reads a Zarr file that we have copied into the JASMIN DataCore (formerly Caringo) object-store
- The data is the same content as used for the other tests, converted from NetCDF to Zarr
- There is no use of Kerchunk
- This test depends on having pre-generated the Zarr file from NetCDF
- POSIX Xarray:
- This reads all the NetCDF files directly into Xarray (as a list of files)
- The files are read directly from the POSIX file system
- There is no pre-generation step for this test
- This is slower because the aggregation of the NetCDF content is done on-the-fly
The test data, being used is a list of 279 data files from the CCI archive, under the directory:
/neodc/esacci/ocean_colour/data/v5.0-release/geographic/netcdf/chlor_a/monthly/v5.0/
The first and last files are:
First: .../1997/ESACCI-OC-L3S-CHLOR_A-MERGED-1M_MONTHLY_4km_GEO_PML_OCx-199709-fv5.0.nc
Last: .../2020/ESACCI-OC-L3S-CHLOR_A-MERGED-1M_MONTHLY_4km_GEO_PML_OCx-202011-fv5.0.nc
In all cases the test is run as follows.
Test 1 - Read/process small subset:
- Load the data as an
xarray.Dataset
object. - Create a small time/lat/lon slice of shape:
(2, 144, 72)
(only 2 time steps == 2 files) - Calculate the maximum value and assert it equals the expected value. s = time.time()
Test 2 - Read/process larger subset:
- Load the data as an
xarray.Dataset
object. - Create a larger time/lat/lon slice of shape:
(279, 12, 24)
(279 time steps == 279 files) - Calculate the maximum value and assert it equals the expected value.
These resources may be useful for understanding why we wanted to look at Kerchunk and how it fits into our bigger picture plans at CEDA:
JASMIN Notebook service intro: https://www.youtube.com/watch?v=nle9teGLAb0&list=PLhF74YhqhjqmZgbQLu_PXZmA27q7vHygg
JASMIN Notebooks workshop tutorial: https://www.youtube.com/watch?v=7UWjhIKq2x0&list=PLhF74Yhqhjqn8NDgU7xfKGLGP8h-FQ1lt&index=16
Notebook that I showed (demonstrating intake access to CMIP6): https://github.com/cedadev/cmip6-object-store/blob/master/notebooks/cmip6-zarr-jasmin.ipynb
Intake library documentation: https://intake.readthedocs.io/en/latest/?badge=latest
Intake ESM (for Earth System Model data) docs: https://intake-esm.readthedocs.io/en/stable/
Kerchunk docs: https://fsspec.github.io/kerchunk/
Useful intro talk on Kerchunk (when it was called ReferenceFileSystem - I think): https://www.youtube.com/watch?v=AWJzDk6M6NM&t=628s