You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It could be nice to take advantage of known raster tiling (rasterio.block_shapes) to assign dask chunks when opening tiled datasets. It seems straightforward to implement with xarray's backend preferred chunk sizes.
Setting this could also avoid common errors with exploding task graphs when users rely on defaults (chunks=True) or reading way more data than necessary (chunks={}).
According to xarray docs for open_dataset/open_dataarray, "chunks={} loads the dataset with dask using engine preferred chunks if exposed by the backend, otherwise with a single chunk for all arrays. chunks='auto' will use dask auto chunking". Below is an example with some public data (18287x18460 with 256x256 tiling):
importxarrayasxrimportrasteriourl='https://capella-open-data.s3.amazonaws.com/data/2022/4/21/CAPELLA_C06_SP_GEO_HH_20220421064500_20220421064519/CAPELLA_C06_SP_GEO_HH_20220421064500_20220421064519_preview.tif'withrasterio.open(url) assrc:
print(src.block_shapes) # [(256, 256)]# This is okay, but not optimal due to misalignment with 256 block_shapesda=xr.open_dataarray(url, chunks='auto', engine='rasterio')
# da.chunksizes # Frozen({'band': (1,), 'y': (9087, 9087, 113), 'x': (3692, 3692, 3692, 3692, 3692)})
# Warning! This is the same as chunks=1, which explodes the number of tasksda=xr.open_dataarray(url, chunks=True, engine='rasterio')
# Warning! You'll be reading all pixel values even if you just want 1xr.open_dataarray(url, chunks={}, engine='rasterio')
There are many considerations for optimal chunks (see dask blog), but a good default size could be determined by ~100Mb and a multiple of 256x256?
The text was updated successfully, but these errors were encountered:
It could be nice to take advantage of known raster tiling (rasterio
.block_shapes
) to assign dask chunks when opening tiled datasets. It seems straightforward to implement with xarray's backend preferred chunk sizes.Setting this could also avoid common errors with exploding task graphs when users rely on defaults (chunks=True) or reading way more data than necessary (chunks={}).
According to xarray docs for open_dataset/open_dataarray, "chunks={} loads the dataset with dask using engine preferred chunks if exposed by the backend, otherwise with a single chunk for all arrays. chunks='auto' will use dask auto chunking". Below is an example with some public data (18287x18460 with 256x256 tiling):
There are many considerations for optimal chunks (see dask blog), but a good default size could be determined by ~100Mb and a multiple of 256x256?
The text was updated successfully, but these errors were encountered: