Default chunking when reading and writing with dask #253

snowman2 · 2021-02-23T19:04:38Z

Related to: pangeo-data/cog-best-practices#2

https://github.com/pangeo-data/cog-best-practices/blob/main/4-threads-vs-async.ipynb

Sounds like it would be worth digging into and seeing what could be improved.

snowman2 · 2022-08-25T00:46:28Z

For reference, current location of chunking code:

Lines 650 to 688 in ce1194a

    
           def _prepare_dask( 
        
               result: DataArray, 
        
               riods: RasterioReader, 
        
               filename: Union[str, os.PathLike], 
        
               chunks: Union[int, Tuple, Dict], 
        
           ) -> DataArray: 
        
               """ 
        
               Prepare the data for dask computations 
        
               """ 
        
               # pylint: disable=import-outside-toplevel 
        
               from dask.base import tokenize 
        
               # augment the token with the file modification time 
        
               try: 
        
                   mtime = os.path.getmtime(filename) 
        
               except (TypeError, OSError): 
        
                   # the filename is probably an s3 bucket rather than a regular file 
        
                   mtime = None 
        
               if chunks in (True, "auto"): 
        
                   import dask 
        
                   from dask.array.core import normalize_chunks 
        
                   if version.parse(dask.__version__) < version.parse("0.18.0"): 
        
                       msg = ( 
        
                           "Automatic chunking requires dask.__version__ >= 0.18.0 . " 
        
                           f"You currently have version {dask.__version__}" 
        
                       ) 
        
                       raise NotImplementedError(msg) 
        
                   block_shape = (1,) + riods.block_shapes[0] 
        
                   chunks = normalize_chunks( 
        
                       chunks=(1, "auto", "auto"), 
        
                       shape=(riods.count, riods.height, riods.width), 
        
                       dtype=_rasterio_to_numpy_dtype(riods.dtypes), 
        
                       previous_chunks=tuple((c,) for c in block_shape), 
        
                   ) 
        
               token = tokenize(filename, mtime, chunks) 
        
               name_prefix = f"open_rasterio-{token}" 
        
               return result.chunk(chunks, name_prefix=name_prefix, token=token)

scottyhq · 2022-08-25T02:53:15Z

It could be nice to take advantage of known raster tiling (rasterio.block_shapes) to assign dask chunks when opening tiled datasets. It seems straightforward to implement with xarray's backend preferred chunk sizes.

Setting this could also avoid common errors with exploding task graphs when users rely on defaults (chunks=True) or reading way more data than necessary (chunks={}).

According to xarray docs for open_dataset/open_dataarray, "chunks={} loads the dataset with dask using engine preferred chunks if exposed by the backend, otherwise with a single chunk for all arrays. chunks='auto' will use dask auto chunking". Below is an example with some public data (18287x18460 with 256x256 tiling):

import xarray as xr
import rasterio
url = 'https://capella-open-data.s3.amazonaws.com/data/2022/4/21/CAPELLA_C06_SP_GEO_HH_20220421064500_20220421064519/CAPELLA_C06_SP_GEO_HH_20220421064500_20220421064519_preview.tif'

with rasterio.open(url) as src:
    print(src.block_shapes) # [(256, 256)]

# This is okay, but not optimal due to misalignment with 256 block_shapes
da = xr.open_dataarray(url, chunks='auto', engine='rasterio')
# da.chunksizes 
# Frozen({'band': (1,), 'y': (9087, 9087, 113), 'x': (3692, 3692, 3692, 3692, 3692)})

# Warning! This is the same as chunks=1, which explodes the number of tasks
da = xr.open_dataarray(url, chunks=True, engine='rasterio')

# Warning! You'll be reading all pixel values even if you just want 1
xr.open_dataarray(url, chunks={}, engine='rasterio')

There are many considerations for optimal chunks (see dask blog), but a good default size could be determined by ~100Mb and a multiple of 256x256?

snowman2 · 2022-08-25T13:17:19Z

@scottyhq. that sounds like a great place to start for improving the default chunking. Thanks for sharing your thoughts on this 👍

snowman2 mentioned this issue Aug 25, 2022

Set xarray "preferred_chunks" based on rasterio.block_shapes #567

Closed

snowman2 added the proposal Idea for a new feature. label Aug 25, 2022

snowman2 added the dask Dask related issue. label Sep 20, 2022

scottyhq mentioned this issue Jan 31, 2023

dask chunking tutorial outline xarray-contrib/xarray-tutorial#157

Open

snowman2 mentioned this issue Jul 10, 2023

ENH: Pass on on-disk chunk sizes as preferred chunk sizes to the xarray backend #678

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default chunking when reading and writing with dask #253

Default chunking when reading and writing with dask #253

snowman2 commented Feb 23, 2021

snowman2 commented Aug 25, 2022

scottyhq commented Aug 25, 2022

snowman2 commented Aug 25, 2022

Default chunking when reading and writing with dask #253

Default chunking when reading and writing with dask #253

Comments

snowman2 commented Feb 23, 2021

snowman2 commented Aug 25, 2022

scottyhq commented Aug 25, 2022

snowman2 commented Aug 25, 2022