Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

direct-to-GPU decoding? #316

Open
martindurant opened this issue Apr 22, 2022 · 15 comments
Open

direct-to-GPU decoding? #316

martindurant opened this issue Apr 22, 2022 · 15 comments

Comments

@martindurant
Copy link
Member

What would it take to make codecs which can interact with cupy arrays (or TF tensors, etc) as the origin or output of zarr? I assume it would be a simple change in zarr, but would all the codecs need to be rewritten in CODA (or numba or ....?)?

@jakirkham
Copy link
Member

Maybe something like this is what you are looking for? 🙂

cc @madsbk @thomcom (who may be interested in this)

@martindurant
Copy link
Member Author

Exactly! I am not surprise that these exist, but numcodecs doesn't know about them, and there's no way to say "lz4" codec, but "gpu" version. We should think about this.

@jakirkham
Copy link
Member

jakirkham commented May 4, 2022

We are thinking about this ( zarr-developers/community#19 (comment) ) 😀

We should probably leverage the entrypoint hooks that you added ( #300 ) in KvikIO ( rapidsai/kvikio#66 )

@martindurant
Copy link
Member Author

Perhaps another set of entrypoints? Perhaps a global config or context that says "make GPU (cupy?) arrays" or something like that?

@jakirkham
Copy link
Member

Ok going to disentangle a few things here.

First there is a big question, how do we use Zarr with other arrays (like CuPy)? There are changes in Numcodecs ( #305 ) and pending changes for Zarr ( zarr-developers/zarr-python#934 ) to address this use case.

Second what does the end user workflow look like for users working with Zarr on GPUs? This would involve using GDSStore and some compressor (if needed). This would give us something rough users can run with.

Third how can we make this process smooth/seamless for the end user? Maybe some additional flags are needed in APIs and/or global config to select between different backends for opening files. This could also be an iterative process with users.

@martindurant
Copy link
Member Author

I can see how making a new GPU implementation of zarr might be easier than putting options throughout the existing code. Just a thought. Or you might say that making zarr agnostic to the buffer and array implementations is essential, but I don't yet know how hard that is.

@jakirkham
Copy link
Member

We've been going with the latter approach. It's been ok so far.

@jakirkham
Copy link
Member

cc @akshaysubr

@akshaysubr
Copy link

akshaysubr commented Jul 25, 2023

say "lz4" codec, but "gpu" version

Agreed! How we can achieve this though? Maybe a new numcodecs.codecs.gpu entrypoint and a device entry in numcodecs.registry.get_codec?

@akshaysubr
Copy link

There's another issue with GPU support and numcodecs that can potentially be solved with an interface change that I'd like to bring up. Typically, we would want to schedule work to the GPU asynchronously, but because of the single buffer in, single buffer out interface of numcodecs, if you get compressed data on the GPU, you still need to know how much space to allocate for the decompressed buffer before you can schedule the decompression work. The makes the API synchronous since you would need to pull the header of the compressed buffer onto the CPU, synchronize, use that data to figure out how much space to allocate for the decompressed buffer and then schedule the decompression work. And this is made worse for compression formats that do not have the decompressed size in the header, like LZ4. Numcodecs currently gets around this issue for LZ4 on the CPU by adding a 4 byte header with the decompressed size, essentially making the compressed buffer not compatible with another LZ4 implementation: https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/lz4.pyx#L91.

For the zarr use case though, the decode call site knows what size to expect from the store level metadata. So ideally, that would get passed down to decode allowing every format to be fully standards compliant, allowing for a fully asynchronous API and improving performance. Is this a feasible change to the numcodecs interface?

@martindurant
Copy link
Member Author

Note that then algorithms within blosc always have blosc's framing around them, so the sieze should be known, and other algorithms like zstd and snappy include the decompression size in the standard (not necessarily guaranteed, but usually).

For the zarr use case though, the decode call site knows what size to expect from the store level metadata

This is only true if the compression is the only codec - but there could in principle be more in a chain.

@jakirkham
Copy link
Member

Wonder if we could just store these size(s) somehow when doing compression so they can be more easily retrieved prior to decompression. How might we encode this size metadata effectively?

@martindurant
Copy link
Member Author

How might we encode this size metadata effectively?

Since this is a per-chunk thing, you'd either have to have a separate source of information for each chunk (like kerchunk might), or store it as bytes at the start/end of the chunk, thus making it a non-standard implementation. The former possibility is of course something I have been thinking about because of kerchunk, and other per-chunk information is conceivable such as scale factor for scale/offset encoding. For sharded chunks, the shard index could play this role maybe.

Or of course, insist on using codecs that do know the size, as I mentioned above.

@akshaysubr
Copy link

There are two issues with having the size be part of the chunk header (either like in the LZ4 codec or like in zstd):

  1. Not all codecs currently support at least one of these. For example, the zlib codec doesn't have this size information but the decompression implementation assumes a default size and if insufficient, allocates a bigger chunk of memory and moves data to that before continuing. This kind of approach can become fairly expensive.
  2. For GPU decoding, if we do use GDSStore, data is read directly into GPU memory and is never on the CPU. To decode then, you'd have to pull the header from the GPU to the CPU, synchronize the device, allocate memory and then call the decompression kernel making the pipeline partially synchronous.

The kerchunk approach is a nice solution to these issues since that second stream of relatively lightweight information can be read into the CPU and is mainly used for orchestration/control. Do you see a way this can be generalized? Maybe this information can be stored at encode time at the array level after a reduction for max size at each codec stage? This would allow decoders to allocate memory once and use that for multiple chunks.

@jakirkham
Copy link
Member

Yeah agree having the size information separate. Maybe in any metadata. Or perhaps in some small metadata adjacent binary file (cc @joshmoore as we discussed something like this a while back)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants