Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set xarray "preferred_chunks" based on rasterio.block_shapes #567

Closed
scottyhq opened this issue Aug 25, 2022 · 3 comments
Closed

Set xarray "preferred_chunks" based on rasterio.block_shapes #567

scottyhq opened this issue Aug 25, 2022 · 3 comments
Labels
proposal Idea for a new feature.

Comments

@scottyhq
Copy link
Contributor

It could be nice to take advantage of known raster tiling (rasterio.block_shapes) to assign dask chunks when opening tiled datasets. It seems straightforward to implement with xarray's backend preferred chunk sizes.

Setting this could also avoid common errors with exploding task graphs when users rely on defaults (chunks=True) or reading way more data than necessary (chunks={}).

According to xarray docs for open_dataset/open_dataarray, "chunks={} loads the dataset with dask using engine preferred chunks if exposed by the backend, otherwise with a single chunk for all arrays. chunks='auto' will use dask auto chunking". Below is an example with some public data (18287x18460 with 256x256 tiling):

import xarray as xr
import rasterio
url = 'https://capella-open-data.s3.amazonaws.com/data/2022/4/21/CAPELLA_C06_SP_GEO_HH_20220421064500_20220421064519/CAPELLA_C06_SP_GEO_HH_20220421064500_20220421064519_preview.tif'

with rasterio.open(url) as src:
    print(src.block_shapes) # [(256, 256)]

# This is okay, but not optimal due to misalignment with 256 block_shapes
da = xr.open_dataarray(url, chunks='auto', engine='rasterio')
# da.chunksizes 
# Frozen({'band': (1,), 'y': (9087, 9087, 113), 'x': (3692, 3692, 3692, 3692, 3692)})

Screen Shot 2022-08-24 at 5 20 59 PM

# Warning! This is the same as chunks=1, which explodes the number of tasks
da = xr.open_dataarray(url, chunks=True, engine='rasterio')

Screen Shot 2022-08-24 at 5 21 44 PM

# Warning! You'll be reading all pixel values even if you just want 1
xr.open_dataarray(url, chunks={}, engine='rasterio')

Screen Shot 2022-08-24 at 5 21 25 PM

There are many considerations for optimal chunks (see dask blog), but a good default size could be determined by ~100Mb and a multiple of 256x256?

@scottyhq scottyhq added the proposal Idea for a new feature. label Aug 25, 2022
@snowman2
Copy link
Member

Mind moving this to #253 ?

@scottyhq
Copy link
Contributor Author

scottyhq commented Aug 25, 2022

Mind moving this to #253 ?

🤦 sorry i overlooked that! Should I just copy/paste and close this one?

@snowman2
Copy link
Member

sorry i overlooked that! Should I just copy/paste and close this one?

No worries. I overlook things all the time 😄. Yes, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Idea for a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants