Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default chunking when reading and writing with dask #253

Open
snowman2 opened this issue Feb 23, 2021 · 3 comments
Open

Default chunking when reading and writing with dask #253

snowman2 opened this issue Feb 23, 2021 · 3 comments
Labels
dask Dask related issue. proposal Idea for a new feature.

Comments

@snowman2
Copy link
Member

Related to: pangeo-data/cog-best-practices#2

https://github.com/pangeo-data/cog-best-practices/blob/main/4-threads-vs-async.ipynb

Sounds like it would be worth digging into and seeing what could be improved.

@snowman2
Copy link
Member Author

For reference, current location of chunking code:

rioxarray/rioxarray/_io.py

Lines 650 to 688 in ce1194a

def _prepare_dask(
result: DataArray,
riods: RasterioReader,
filename: Union[str, os.PathLike],
chunks: Union[int, Tuple, Dict],
) -> DataArray:
"""
Prepare the data for dask computations
"""
# pylint: disable=import-outside-toplevel
from dask.base import tokenize
# augment the token with the file modification time
try:
mtime = os.path.getmtime(filename)
except (TypeError, OSError):
# the filename is probably an s3 bucket rather than a regular file
mtime = None
if chunks in (True, "auto"):
import dask
from dask.array.core import normalize_chunks
if version.parse(dask.__version__) < version.parse("0.18.0"):
msg = (
"Automatic chunking requires dask.__version__ >= 0.18.0 . "
f"You currently have version {dask.__version__}"
)
raise NotImplementedError(msg)
block_shape = (1,) + riods.block_shapes[0]
chunks = normalize_chunks(
chunks=(1, "auto", "auto"),
shape=(riods.count, riods.height, riods.width),
dtype=_rasterio_to_numpy_dtype(riods.dtypes),
previous_chunks=tuple((c,) for c in block_shape),
)
token = tokenize(filename, mtime, chunks)
name_prefix = f"open_rasterio-{token}"
return result.chunk(chunks, name_prefix=name_prefix, token=token)

@snowman2 snowman2 added the proposal Idea for a new feature. label Aug 25, 2022
@scottyhq
Copy link
Contributor

It could be nice to take advantage of known raster tiling (rasterio.block_shapes) to assign dask chunks when opening tiled datasets. It seems straightforward to implement with xarray's backend preferred chunk sizes.

Setting this could also avoid common errors with exploding task graphs when users rely on defaults (chunks=True) or reading way more data than necessary (chunks={}).

According to xarray docs for open_dataset/open_dataarray, "chunks={} loads the dataset with dask using engine preferred chunks if exposed by the backend, otherwise with a single chunk for all arrays. chunks='auto' will use dask auto chunking". Below is an example with some public data (18287x18460 with 256x256 tiling):

import xarray as xr
import rasterio
url = 'https://capella-open-data.s3.amazonaws.com/data/2022/4/21/CAPELLA_C06_SP_GEO_HH_20220421064500_20220421064519/CAPELLA_C06_SP_GEO_HH_20220421064500_20220421064519_preview.tif'

with rasterio.open(url) as src:
    print(src.block_shapes) # [(256, 256)]

# This is okay, but not optimal due to misalignment with 256 block_shapes
da = xr.open_dataarray(url, chunks='auto', engine='rasterio')
# da.chunksizes 
# Frozen({'band': (1,), 'y': (9087, 9087, 113), 'x': (3692, 3692, 3692, 3692, 3692)})

Screen Shot 2022-08-24 at 5 20 59 PM

# Warning! This is the same as chunks=1, which explodes the number of tasks
da = xr.open_dataarray(url, chunks=True, engine='rasterio')

Screen Shot 2022-08-24 at 5 21 44 PM

# Warning! You'll be reading all pixel values even if you just want 1
xr.open_dataarray(url, chunks={}, engine='rasterio')

Screen Shot 2022-08-24 at 5 21 25 PM

There are many considerations for optimal chunks (see dask blog), but a good default size could be determined by ~100Mb and a multiple of 256x256?

@snowman2
Copy link
Member Author

@scottyhq. that sounds like a great place to start for improving the default chunking. Thanks for sharing your thoughts on this 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Dask related issue. proposal Idea for a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants