-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC implementation of ZEP003 #1483
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting started on this!
What would you like to do here, finish up for v2, then do v3? Or try to just go for v3?
Btw, I noticed a bug in the indexing and have opened a PR here: martindurant#21 to fix.
zarr/indexing.py
Outdated
nelem = (slice_end - slice_start) // step | ||
self.projections.append( | ||
ChunkDimProjection( | ||
i, slice(slice_start, slice_end, step), slice(nfilled, nfilled + nelem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will only work if the chunks boundaries are multiples of the step size.
import numpy as np, zarr
z = zarr.array(np.arange(10), chunks=[[1, 2, 2, 5]])
z[::2]
# array([6, 8], dtype=int32)
I've opened a PR into your branch fixing this plus adding some tests: martindurant#21
The point was to get feedback and show that what was proposed in ZEP0003 is very achievable if we can get some buy-in. I would be happy to get this working for v2, since that alone solves the kerchunk case. However, it should be part of v3, and the implementation would presumably we identical except for how the chunks are written to/from metadata; of and the correct API for creating such arrays, As things stand here, bool indexing and fancy (int) indexing haven't been done yet, where I expect the former to be very easy. Also, plenty of things that access |
So, broadly, this code is relevant in any path forward and it would be good to flesh out so we can show off use cases? I would be up for collaborating some more here, either via PRs to your branch or whatever.
I think I just got this working. It was basically replacing Also boolean, which was indeed quite easy. Essentially |
Great to see this happening, @martindurant ! @alex-s-gardner and I are very interested in seeing the progress of this (zarr-developers/zarr-specs#138). |
Fix indexing with steps, add tests
@ivirshup , the current code does break a significant number of tests with standard indexing; seems to happen always in the last chunk. Probably happening in |
Just a quick note before I dive in more. In ZEP0003 there's the line " It would be reasonable to wish to backport the feature to v2." and this is on v2. I'll just point out we really don't have the mechanisms to introduce something like this in v2: no process for updating the spec and no way for implementations to know they're getting something they can't support. |
@joshmoore - the index code should be exactly the same. It's in v2 exactly because we didn't have to update the spec/metadata handling code to get it to work. Actually, it would be useful to kerchunk as-is, given that it would be for datasets that simply cannot otherwise be represented by zarr. But yes, the aim is to make this an implementation for v3. I still think it would be useful for v2 to be able to read such datasets, however. |
Definitely no objections that it would be useful. But I'm concerned about the cost on all the other implementations. Correct me if I'm wrong, but they'd fall over spectacularly, no? |
They would fall over, yes. It would be an early error before loading any data. However, if this were a kerchunk thing, they probably can't open the store object anyway. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1483 +/- ##
===========================================
- Coverage 100.00% 99.96% -0.04%
===========================================
Files 37 37
Lines 14729 14889 +160
===========================================
+ Hits 14729 14884 +155
- Misses 0 5 +5
|
Just to point out: this works transparently with V3, we apparently DO NOT VALIDATE the chunks property.
metadata:
|
Does this implementation already work with kerchunk ? |
In principle, the implementation works, but there is no code currently in kerchunk to produce zarr metadata that would use it |
+1 to this PR, would be great if this worked in V2 |
It does work for V2! Some things will break as given here, e.g., array.info(), but dask.array.from_zarr should work as it. |
This feature would be very interesting to have also from the Julia side and I would be very much in favor even for having this as some kind of patch for v2 to have it in a usable state earlier. I drafted an implementation for Zarr.jl JuliaIO/Zarr.jl#126 , but I think it is not compatible with this implementation. The main detail I stumbled across was that without ZEP0003 there was a guarantee that every chunk in a store would represent a chunk of exactly the same shape when decompressed. Even for incomplete chunks at the array boundaries this was achieved through padding of fillvalues that are ignored during reading, but allow zero-cost resizing. This allowed an easy re-use of compression/decompression buffers when reading/writing multiple chunks sequentially. My current Julia implementation keeps this behavior, in that it always compresses chunks of the maximum chunk size by padding fillvalues, so that the invariant mentioned above is maintained. Maybe it would be a good idea to clarify in the ZEP text, the consequences this has on the uniformity of an uncompressed chunk. To illustrate this in a small example: import zarr
z1 = zarr.create((10,), chunks=(3,), dtype='i4',fill_value=1,store="./chunktests.zarr",compressor=None)
z2 = zarr.create((10,), chunks=([3,3,3,1],), dtype='i4',fill_value=1,store="./chunktests.zarr",compressor=None) The question is if these 2 arrays should be equivalent and store the same binary information. I think with this current implementation they would not, because in the last chunk |
Thinking more about this I realized that your non-padding implementation is the only one that would work well together with kerchunk, so this is definitely the way to go. We might still want to mention this point somewhere in the zep draft. |
Indeed, the kerchunk workflow is very important to me, if not everyone. Furthermore, we read multiple chunks concurrently and in the future will decompress in parallel too. That means you can't easily reuse buffers. In python's memory model, the buffer will not actually be released to the OS for a while anyway, so maybe it's no win for python at all. In the final, best implementation, we would even like to read or decompress directly into the target array memory buffer for contiguous cases. |
I've started a POC on top of this POC for I'm also pretty sure I can only get this working without padding. |
Hi @martindurant, thanks for sending the PR. I have requested reviews from the Zarr-Python devs. Additionally, if anyone from @zarr-developers/python-core-devs can review the PR, I'd be thankful to them. Also, @normanrz, if you could review this PR, that'd be great. |
Meta comment - we are working on a fresh Array implementation that covers both V2 and V3 arrays (see #1583 for details). We are actively seeing input/participation from those invested in ZEP003. Variable chunking is likely to require some changes to the codec api and we want to get your input as we roll out the new design. cc @d-v-b and @normanrz. xref: #1595 |
@martindurant and others - now that v3 has come together, I'd be very interested to see this move to that branch. Who is interested in trying out an implementation on top of the new array api? |
I would be - this is something that @abarciauskas-bgse and I want to get funding to work on... Don't let that stop anyone else having a go in the meantime though! |
It's nice to see people wanting to see this move ahead. I'm not familiar enough with the v3 code to know how easy it is to port the partial implementation here. Before progressing, do we need action on the original ZEP? It has not been accepted, and has a specific prescription on how to store the chunk sizes. |
Implemented on v2, since the array metadata is not validated (!).
Example: