Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ XpySTACAssetReader for reading COG, NetCDF & Zarr STAC assets #87

Merged
merged 11 commits into from
Apr 2, 2023

Conversation

weiji14
Copy link
Owner

@weiji14 weiji14 commented Mar 14, 2023

An iterable-style DataPipe for STAC asset data like Cloud-Optimized GeoTIFFs, NetCDF and Zarr!

Preview at https://zen3geo--87.org.readthedocs.build/en/87/api.html#zen3geo.datapipes.XpySTACAssetReader

Usage

from torchdata.datapipes.iter import IterableWrapper
from zen3geo.datapipes import XpySTACAssetReader

# Read in STAC Asset using DataPipe
collection_url: str = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/nasa-nex-gddp-cmip6"
asset: pystac.Asset = pystac.Collection.from_file(href=collection_url).assets[
    "ACCESS-CM2.historical"
]
dp = IterableWrapper(iterable=[asset])
dp_xpystac = dp.read_from_xpystac()

# Loop or iterate over the DataPipe stream
it = iter(dp_xpystac)
dataset = next(it)

print(dataset.sizes)
# Frozen({'time': 23741, 'lat': 600, 'lon': 1440})

print(dataset.data_vars)
# Data variables:
#     hurs     (time, lat, lon) float32 ...
#     huss     (time, lat, lon) float32 ...
#     pr       (time, lat, lon) float32 ...
#     rlds     (time, lat, lon) float32 ...
#     rsds     (time, lat, lon) float32 ...
#     sfcWind  (time, lat, lon) float32 ...
#     tas      (time, lat, lon) float32 ...
#     tasmax   (time, lat, lon) float32 ...
#     tasmin   (time, lat, lon) float32 ...

print(dataset.attrs)
# {'Conventions': 'CF-1.7',
#  'activity': 'NEX-GDDP-CMIP6',
#  'cmip6_institution_id': 'CSIRO-ARCCSS',
#  'cmip6_license': 'CC-BY-SA 4.0',
#  'cmip6_source_id': 'ACCESS-CM2',
#  ...
#  'history': '2021-10-04T13:59:21.654137+00:00: install global attributes',
#  'institution': 'NASA Earth Exchange, NASA Ames Research Center, ...
#  'product': 'output',
#  'realm': 'atmos',
#  'references': 'BCSD method: Thrasher et al., 2012, ...
#  'resolution_id': '0.25 degree',
#  'scenario': 'historical',
#  'source': 'BCSD',
#  'title': 'ACCESS-CM2, r1i1p1f1, historical, global downscaled CMIP6 ...
#  'tracking_id': '16d27564-470f-41ea-8077-f4cc3efa5bfe',
#  'variant_label': 'r1i1p1f1',
#  'version': '1.0'}

TODO:

  • Initial implementation with a doctest
  • Add unit tests for both COG and Zarr STAC assets
  • Think about the upper/lower casing of letters in XpySTACAssetReader
  • etc

Personally, I've been debating whether to add xr.open_dataset and/or xr.open_zarr to zen3geo for months because:

  1. While xarray is widely used by 'geo' folks, it is not just for geo, so it didn't seem proper to wrap a torchdata DataPipe for xarray here.
  2. The Zarr format is also not only for geo.
  3. There are discussions on deprecating open_zarr at Deprecate open_zarr in favor of open_dataset(..., engine='zarr') pydata/xarray#7495. Didn't want to wrap open_zarr only to have to drop it later.

These concerns become moot with the availability of xr.open_dataset(..., engine="stac") enabled by xpystac! It provides a single entrypoint to Zarr, Cloud-Optimized GeoTIFFs, and potentially more STAC Asset based datasets, and since STAC is spatiotemporal (read: geo), this fits naturally within zen3geo 😄

References:

Extend xarray.open_dataset to accept pystac objects!
An implementation of chunked, compressed, N-dimensional arrays for Python!
An iterable-style DataPipe for STAC asset data like Cloud-Optimized GeoTIFFs and Zarr! Uses xpystac for the I/O. Included a doctest and unit test, added a new section in the API docs and some more intersphinx mappings.
@weiji14 weiji14 added the feature New feature or request label Mar 14, 2023
@weiji14 weiji14 added this to the 0.6.0 milestone Mar 14, 2023
@weiji14 weiji14 self-assigned this Mar 14, 2023
Include xpystac and zarr in the 'docs' extras dependencies to fix ReadtheDocs build.
Async http client/server framework (asyncio)!

Needed to fix `ModuleNotFoundError: No module named 'aiohttp'` when accessing NetCDF files from https://nasagddp.blob.core.windows.net/nex-gddp-cmip6-references/ACCESS-CM2_historical.json
Mention that xpystac and zarr is installed with the 'raster' extras on the main index.md page. Also mentioned that NetCDF files can be read, and added some blank lines in the XpySTACAssetReader doctest example.
Bump torchdata from 0.4.0 to 0.6.0, Run CI on Python 3.11 and Publish to TestPyPI/PyPI via OIDC.
Ensure that a STAC asset pointing to a Zarr file can be loaded using XpySTACAssetReader. Using the Daymet Annual Hawaii STAC Collection at https://planetarycomputer.microsoft.com/dataset/daymet-annual-hi for this unit test. Also edited previous unit test to specify that it is for a Cloud-Optimized GeoTIFF STAC Asset.
docs/index.md Outdated
Comment on lines 10 to 12
| `pip install zen3geo[raster]` | rioxarray, torchdata, xbatcher, xpystac, zarr |
| `pip install zen3geo[spatial]` | rioxarray, torchdata, datashader, spatialpandas |
| `pip install zen3geo[stac]` | rioxarray, torchdata, pystac, pystac-client, stackstac |
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debating on whether to put xpystac under the 'raster' or 'stac' extras 🤔

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to go with xpystac under the 'stac' extras for now (done at commit 49c29b1) since xpystac depends on pystac, xarray, and somewhat optionally on stackstac, so the Venn diagram overlaps more with the STAC ecosystem than raster-like stuff like xarray and Zarr.

Decided that xpystac fits better under the 'stac' extras, because it depends on just pystac and xarray, and has a somewhat optional dependency on stackstac. This enables a more streamlined I/O option for reading STAC Assets into an xarray.Dataset. Note that Zarr is kept under the 'raster' extras.
@weiji14 weiji14 marked this pull request as ready for review April 2, 2023 23:16
Need to use trailing underscores for RST-style URLs, and remove the `pystac.Asset` type hint so that zen3geo works without `pystac` being installed.
@weiji14 weiji14 changed the title ✨ XpySTACAssetReader for reading COG and Zarr STAC assets ✨ XpySTACAssetReader for reading COG, NetCDF & Zarr STAC assets Apr 2, 2023
@weiji14 weiji14 merged commit f5e1704 into main Apr 2, 2023
@weiji14 weiji14 deleted the xpystac branch April 2, 2023 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant