direct-to-GPU decoding? #316

martindurant · 2022-04-22T19:15:10Z

What would it take to make codecs which can interact with cupy arrays (or TF tensors, etc) as the origin or output of zarr? I assume it would be a simple change in zarr, but would all the codecs need to be rewritten in CODA (or numba or ....?)?

jakirkham · 2022-05-04T20:45:45Z

Maybe something like this is what you are looking for? 🙂

cc @madsbk @thomcom (who may be interested in this)

martindurant · 2022-05-04T20:47:10Z

Exactly! I am not surprise that these exist, but numcodecs doesn't know about them, and there's no way to say "lz4" codec, but "gpu" version. We should think about this.

jakirkham · 2022-05-04T21:18:47Z

We are thinking about this ( zarr-developers/community#19 (comment) ) 😀

We should probably leverage the entrypoint hooks that you added ( #300 ) in KvikIO ( rapidsai/kvikio#66 )

martindurant · 2022-05-04T21:54:31Z

Perhaps another set of entrypoints? Perhaps a global config or context that says "make GPU (cupy?) arrays" or something like that?

jakirkham · 2022-05-04T22:40:58Z

Ok going to disentangle a few things here.

First there is a big question, how do we use Zarr with other arrays (like CuPy)? There are changes in Numcodecs ( #305 ) and pending changes for Zarr ( zarr-developers/zarr-python#934 ) to address this use case.

Second what does the end user workflow look like for users working with Zarr on GPUs? This would involve using GDSStore and some compressor (if needed). This would give us something rough users can run with.

Third how can we make this process smooth/seamless for the end user? Maybe some additional flags are needed in APIs and/or global config to select between different backends for opening files. This could also be an iterative process with users.

martindurant · 2022-05-05T02:17:41Z

I can see how making a new GPU implementation of zarr might be easier than putting options throughout the existing code. Just a thought. Or you might say that making zarr agnostic to the buffer and array implementations is essential, but I don't yet know how hard that is.

jakirkham · 2022-05-05T20:56:56Z

We've been going with the latter approach. It's been ok so far.

jakirkham · 2023-07-25T04:52:31Z

cc @akshaysubr

akshaysubr · 2023-07-25T06:14:17Z

say "lz4" codec, but "gpu" version

Agreed! How we can achieve this though? Maybe a new numcodecs.codecs.gpu entrypoint and a device entry in numcodecs.registry.get_codec?

akshaysubr · 2023-07-25T06:25:05Z

There's another issue with GPU support and numcodecs that can potentially be solved with an interface change that I'd like to bring up. Typically, we would want to schedule work to the GPU asynchronously, but because of the single buffer in, single buffer out interface of numcodecs, if you get compressed data on the GPU, you still need to know how much space to allocate for the decompressed buffer before you can schedule the decompression work. The makes the API synchronous since you would need to pull the header of the compressed buffer onto the CPU, synchronize, use that data to figure out how much space to allocate for the decompressed buffer and then schedule the decompression work. And this is made worse for compression formats that do not have the decompressed size in the header, like LZ4. Numcodecs currently gets around this issue for LZ4 on the CPU by adding a 4 byte header with the decompressed size, essentially making the compressed buffer not compatible with another LZ4 implementation: https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/lz4.pyx#L91.

For the zarr use case though, the decode call site knows what size to expect from the store level metadata. So ideally, that would get passed down to decode allowing every format to be fully standards compliant, allowing for a fully asynchronous API and improving performance. Is this a feasible change to the numcodecs interface?

martindurant · 2023-07-25T13:26:44Z

Note that then algorithms within blosc always have blosc's framing around them, so the sieze should be known, and other algorithms like zstd and snappy include the decompression size in the standard (not necessarily guaranteed, but usually).

For the zarr use case though, the decode call site knows what size to expect from the store level metadata

This is only true if the compression is the only codec - but there could in principle be more in a chain.

jakirkham · 2023-07-27T03:32:56Z

Wonder if we could just store these size(s) somehow when doing compression so they can be more easily retrieved prior to decompression. How might we encode this size metadata effectively?

martindurant · 2023-07-27T13:17:48Z

How might we encode this size metadata effectively?

Since this is a per-chunk thing, you'd either have to have a separate source of information for each chunk (like kerchunk might), or store it as bytes at the start/end of the chunk, thus making it a non-standard implementation. The former possibility is of course something I have been thinking about because of kerchunk, and other per-chunk information is conceivable such as scale factor for scale/offset encoding. For sharded chunks, the shard index could play this role maybe.

Or of course, insist on using codecs that do know the size, as I mentioned above.

akshaysubr · 2023-07-31T16:45:47Z

There are two issues with having the size be part of the chunk header (either like in the LZ4 codec or like in zstd):

Not all codecs currently support at least one of these. For example, the zlib codec doesn't have this size information but the decompression implementation assumes a default size and if insufficient, allocates a bigger chunk of memory and moves data to that before continuing. This kind of approach can become fairly expensive.
For GPU decoding, if we do use GDSStore, data is read directly into GPU memory and is never on the CPU. To decode then, you'd have to pull the header from the GPU to the CPU, synchronize the device, allocate memory and then call the decompression kernel making the pipeline partially synchronous.

The kerchunk approach is a nice solution to these issues since that second stream of relatively lightweight information can be read into the CPU and is mainly used for orchestration/control. Do you see a way this can be generalized? Maybe this information can be stored at encode time at the array level after a reduction for max size at each codec stage? This would allow decoders to allocate memory once and use that for multiple chunks.

jakirkham · 2023-08-02T01:42:04Z

Yeah agree having the size information separate. Maybe in any metadata. Or perhaps in some small metadata adjacent binary file (cc @joshmoore as we discussed something like this a while back)

jakirkham mentioned this issue May 4, 2022

Add entrypoints for Numcodecs compressors rapidsai/kvikio#66

Open

ivirshup mentioned this issue Aug 7, 2023

Zarr notebook rapidsai/kvikio#261

Merged

This was referenced Feb 26, 2024

DOC: Document CUDA acceleration thewtex/ngff-zarr#68

Merged

Enable kvikio GDSStore in cli thewtex/ngff-zarr#70

Open

akshaysubr mentioned this issue Apr 29, 2024

[v3] Generalized NDArray support zarr-developers/zarr-python#1751

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

direct-to-GPU decoding? #316

direct-to-GPU decoding? #316

martindurant commented Apr 22, 2022

jakirkham commented May 4, 2022

martindurant commented May 4, 2022

jakirkham commented May 4, 2022 •

edited

Loading

martindurant commented May 4, 2022

jakirkham commented May 4, 2022

martindurant commented May 5, 2022

jakirkham commented May 5, 2022

jakirkham commented Jul 25, 2023

akshaysubr commented Jul 25, 2023 •

edited

Loading

akshaysubr commented Jul 25, 2023

martindurant commented Jul 25, 2023

jakirkham commented Jul 27, 2023

martindurant commented Jul 27, 2023

akshaysubr commented Jul 31, 2023

jakirkham commented Aug 2, 2023

direct-to-GPU decoding? #316

direct-to-GPU decoding? #316

Comments

martindurant commented Apr 22, 2022

jakirkham commented May 4, 2022

martindurant commented May 4, 2022

jakirkham commented May 4, 2022 • edited Loading

martindurant commented May 4, 2022

jakirkham commented May 4, 2022

martindurant commented May 5, 2022

jakirkham commented May 5, 2022

jakirkham commented Jul 25, 2023

akshaysubr commented Jul 25, 2023 • edited Loading

akshaysubr commented Jul 25, 2023

martindurant commented Jul 25, 2023

jakirkham commented Jul 27, 2023

martindurant commented Jul 27, 2023

akshaysubr commented Jul 31, 2023

jakirkham commented Aug 2, 2023

jakirkham commented May 4, 2022 •

edited

Loading

akshaysubr commented Jul 25, 2023 •

edited

Loading