-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default chunking when reading and writing with dask #253
Comments
For reference, current location of chunking code: Lines 650 to 688 in ce1194a
|
It could be nice to take advantage of known raster tiling (rasterio Setting this could also avoid common errors with exploding task graphs when users rely on defaults (chunks=True) or reading way more data than necessary (chunks={}). According to xarray docs for open_dataset/open_dataarray, "chunks={} loads the dataset with dask using engine preferred chunks if exposed by the backend, otherwise with a single chunk for all arrays. chunks='auto' will use dask auto chunking". Below is an example with some public data (18287x18460 with 256x256 tiling): import xarray as xr
import rasterio
url = 'https://capella-open-data.s3.amazonaws.com/data/2022/4/21/CAPELLA_C06_SP_GEO_HH_20220421064500_20220421064519/CAPELLA_C06_SP_GEO_HH_20220421064500_20220421064519_preview.tif'
with rasterio.open(url) as src:
print(src.block_shapes) # [(256, 256)]
# This is okay, but not optimal due to misalignment with 256 block_shapes
da = xr.open_dataarray(url, chunks='auto', engine='rasterio')
# da.chunksizes
# Frozen({'band': (1,), 'y': (9087, 9087, 113), 'x': (3692, 3692, 3692, 3692, 3692)}) # Warning! This is the same as chunks=1, which explodes the number of tasks
da = xr.open_dataarray(url, chunks=True, engine='rasterio') # Warning! You'll be reading all pixel values even if you just want 1
xr.open_dataarray(url, chunks={}, engine='rasterio') There are many considerations for optimal chunks (see dask blog), but a good default size could be determined by ~100Mb and a multiple of 256x256? |
@scottyhq. that sounds like a great place to start for improving the default chunking. Thanks for sharing your thoughts on this 👍 |
Related to: pangeo-data/cog-best-practices#2
https://github.com/pangeo-data/cog-best-practices/blob/main/4-threads-vs-async.ipynb
Sounds like it would be worth digging into and seeing what could be improved.
The text was updated successfully, but these errors were encountered: