Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC implementation of ZEP003 #1483

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

martindurant
Copy link
Member

Implemented on v2, since the array metadata is not validated (!).

Example:

In [1]: import zarr

In [2]: import numpy as np

In [3]: store = {}

In [4]: z = zarr.open_array(store, mode="w", dtype="i4", shape=(10, 10), chunks=[[1, 2, 2, 5], 5], compression=None)

In [5]: z[:] = (np.arange(100)).reshape((10, 10))

In [6]: store  # notice the different numbers of data bytes
Out[6]:
{'.zarray': b'{\n    "chunks": [\n        [\n            1,\n            2,\n            2,\n            5\n        ],\n        5\n    ],\n    "compressor": null,\n    "dimension_separator": ".",\n    "dtype": "<i4",\n    "fill_value": 0,\n    "filters": null,\n    "order": "C",\n    "shape": [\n        10,\n        10\n    ],\n    "zarr_format": 2\n}',
 '0.0': b'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00',
 '0.1': b'\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00',
 '1.0': b'\n\x00\x00\x00\x0b\x00\x00\x00\x0c\x00\x00\x00\r\x00\x00\x00\x0e\x00\x00\x00\x14\x00\x00\x00\x15\x00\x00\x00\x16\x00\x00\x00\x17\x00\x00\x00\x18\x00\x00\x00',
 '1.1': b'\x0f\x00\x00\x00\x10\x00\x00\x00\x11\x00\x00\x00\x12\x00\x00\x00\x13\x00\x00\x00\x19\x00\x00\x00\x1a\x00\x00\x00\x1b\x00\x00\x00\x1c\x00\x00\x00\x1d\x00\x00\x00',
 '2.0': b'\x1e\x00\x00\x00\x1f\x00\x00\x00 \x00\x00\x00!\x00\x00\x00"\x00\x00\x00(\x00\x00\x00)\x00\x00\x00*\x00\x00\x00+\x00\x00\x00,\x00\x00\x00',
 '2.1': b"#\x00\x00\x00$\x00\x00\x00%\x00\x00\x00&\x00\x00\x00'\x00\x00\x00-\x00\x00\x00.\x00\x00\x00/\x00\x00\x000\x00\x00\x001\x00\x00\x00",
 '3.0': b'2\x00\x00\x003\x00\x00\x004\x00\x00\x005\x00\x00\x006\x00\x00\x00<\x00\x00\x00=\x00\x00\x00>\x00\x00\x00?\x00\x00\x00@\x00\x00\x00F\x00\x00\x00G\x00\x00\x00H\x00\x00\x00I\x00\x00\x00J\x00\x00\x00P\x00\x00\x00Q\x00\x00\x00R\x00\x00\x00S\x00\x00\x00T\x00\x00\x00Z\x00\x00\x00[\x00\x00\x00\\\x00\x00\x00]\x00\x00\x00^\x00\x00\x00',
 '3.1': b'7\x00\x00\x008\x00\x00\x009\x00\x00\x00:\x00\x00\x00;\x00\x00\x00A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00E\x00\x00\x00K\x00\x00\x00L\x00\x00\x00M\x00\x00\x00N\x00\x00\x00O\x00\x00\x00U\x00\x00\x00V\x00\x00\x00W\x00\x00\x00X\x00\x00\x00Y\x00\x00\x00_\x00\x00\x00`\x00\x00\x00a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00'}

In [7]: z[:]
Out[7]:
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]], dtype=int32)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Aug 1, 2023
Copy link

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting started on this!

What would you like to do here, finish up for v2, then do v3? Or try to just go for v3?

Btw, I noticed a bug in the indexing and have opened a PR here: martindurant#21 to fix.

zarr/indexing.py Outdated
nelem = (slice_end - slice_start) // step
self.projections.append(
ChunkDimProjection(
i, slice(slice_start, slice_end, step), slice(nfilled, nfilled + nelem)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will only work if the chunks boundaries are multiples of the step size.

import numpy as np, zarr

z = zarr.array(np.arange(10), chunks=[[1, 2, 2, 5]])
z[::2]
# array([6, 8], dtype=int32)

I've opened a PR into your branch fixing this plus adding some tests: martindurant#21

@martindurant
Copy link
Member Author

The point was to get feedback and show that what was proposed in ZEP0003 is very achievable if we can get some buy-in.

I would be happy to get this working for v2, since that alone solves the kerchunk case. However, it should be part of v3, and the implementation would presumably we identical except for how the chunks are written to/from metadata; of and the correct API for creating such arrays,

As things stand here, bool indexing and fancy (int) indexing haven't been done yet, where I expect the former to be very easy. Also, plenty of things that access .chunks are broken if there are var chunks, such as arr.info.

@ivirshup
Copy link

ivirshup commented Aug 7, 2023

So, broadly, this code is relevant in any path forward and it would be good to flesh out so we can show off use cases?

I would be up for collaborating some more here, either via PRs to your branch or whatever.

fancy (int) indexing haven't been done yet

I think I just got this working. It was basically replacing dim_sel_chunk = dim_sel // dim_chunk_len with dim_sel_chunk = np.digitize(dim_sel, self.offsets[1:]).

Also boolean, which was indeed quite easy. Essentially np.add.reduceat. Both cases did end up with a fair amount of copied code though. However, I did run into some poor support for boolean masks (#1490), so not sure how commonly used this is.

@agoodm
Copy link
Contributor

agoodm commented Aug 7, 2023

Great to see this happening, @martindurant ! @alex-s-gardner and I are very interested in seeing the progress of this (zarr-developers/zarr-specs#138).

@martindurant
Copy link
Member Author

@ivirshup , the current code does break a significant number of tests with standard indexing; seems to happen always in the last chunk. Probably happening in _process_chunk I think.

@joshmoore
Copy link
Member

Just a quick note before I dive in more. In ZEP0003 there's the line " It would be reasonable to wish to backport the feature to v2." and this is on v2. I'll just point out we really don't have the mechanisms to introduce something like this in v2: no process for updating the spec and no way for implementations to know they're getting something they can't support.

@martindurant
Copy link
Member Author

@joshmoore - the index code should be exactly the same. It's in v2 exactly because we didn't have to update the spec/metadata handling code to get it to work. Actually, it would be useful to kerchunk as-is, given that it would be for datasets that simply cannot otherwise be represented by zarr. But yes, the aim is to make this an implementation for v3. I still think it would be useful for v2 to be able to read such datasets, however.

@joshmoore
Copy link
Member

I still think it would be useful for v2 to be able to read such datasets, however.

Definitely no objections that it would be useful. But I'm concerned about the cost on all the other implementations. Correct me if I'm wrong, but they'd fall over spectacularly, no?

@martindurant
Copy link
Member Author

Correct me if I'm wrong, but they'd fall over spectacularly, no?

They would fall over, yes. It would be an early error before loading any data. However, if this were a kerchunk thing, they probably can't open the store object anyway.

@codecov
Copy link

codecov bot commented Aug 14, 2023

Codecov Report

Attention: Patch coverage is 97.09302% with 5 lines in your changes missing coverage. Please review.

Project coverage is 99.96%. Comparing base (f542fca) to head (f8d9bf2).
Report is 163 commits behind head on main.

Files with missing lines Patch % Lines
zarr/indexing.py 96.42% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##              main    #1483      +/-   ##
===========================================
- Coverage   100.00%   99.96%   -0.04%     
===========================================
  Files           37       37              
  Lines        14729    14889     +160     
===========================================
+ Hits         14729    14884     +155     
- Misses           0        5       +5     
Files with missing lines Coverage Δ
zarr/core.py 100.00% <100.00%> (ø)
zarr/tests/test_indexing.py 100.00% <100.00%> (ø)
zarr/util.py 100.00% <100.00%> (ø)
zarr/indexing.py 99.25% <96.42%> (-0.75%) ⬇️

... and 1 file with indirect coverage changes

@martindurant
Copy link
Member Author

martindurant commented Aug 14, 2023

Just to point out: this works transparently with V3, we apparently DO NOT VALIDATE the chunks property.

store_ = {}
store = zarr._storage.v3.KVStoreV3(store_)
z = zarr.open_array(store, mode="w", dtype="i4", shape=(10, 10), chunks=([1, 2, 2, 5], 5), compression=None)
# z = zarr.open_array(store_, mode="w", dtype="i4", shape=(10, 10), chunks=([1, 2, 2, 5], 5), compression=None, zarr_version=3) # same as above
z[:] = (np.arange(100)).reshape((10, 10))
assert (z[:] == np.arange(100).reshape((10, 10))).all()

metadata:

>>> print(store_["meta/root/array.array.json"].decode())
{
    "attributes": {},
    "chunk_grid": {
        "chunk_shape": [
            [
                1,
                2,
                2,
                5
            ],
            5
        ],
        "separator": "/",
        "type": "regular"
    },
    "chunk_memory_layout": "C",
    "data_type": "<i4",
    "dimension_separator": "/",
    "extensions": [],
    "fill_value": 0,
    "shape": [
        10,
        10
    ]
}

@okz
Copy link

okz commented Sep 15, 2023

Does this implementation already work with kerchunk ?

@martindurant
Copy link
Member Author

Does this implementation already work with kerchunk ?

In principle, the implementation works, but there is no code currently in kerchunk to produce zarr metadata that would use it

@pbranson
Copy link

pbranson commented Oct 2, 2023

+1 to this PR, would be great if this worked in V2

@martindurant
Copy link
Member Author

would be great if this worked in V2

It does work for V2! Some things will break as given here, e.g., array.info(), but dask.array.from_zarr should work as it.

@meggart
Copy link
Member

meggart commented Oct 6, 2023

This feature would be very interesting to have also from the Julia side and I would be very much in favor even for having this as some kind of patch for v2 to have it in a usable state earlier. I drafted an implementation for Zarr.jl JuliaIO/Zarr.jl#126 , but I think it is not compatible with this implementation. The main detail I stumbled across was that without ZEP0003 there was a guarantee that every chunk in a store would represent a chunk of exactly the same shape when decompressed. Even for incomplete chunks at the array boundaries this was achieved through padding of fillvalues that are ignored during reading, but allow zero-cost resizing.

This allowed an easy re-use of compression/decompression buffers when reading/writing multiple chunks sequentially. My current Julia implementation keeps this behavior, in that it always compresses chunks of the maximum chunk size by padding fillvalues, so that the invariant mentioned above is maintained.

Maybe it would be a good idea to clarify in the ZEP text, the consequences this has on the uniformity of an uncompressed chunk. To illustrate this in a small example:

import zarr
z1 = zarr.create((10,), chunks=(3,), dtype='i4',fill_value=1,store="./chunktests.zarr",compressor=None)
z2 = zarr.create((10,), chunks=([3,3,3,1],), dtype='i4',fill_value=1,store="./chunktests.zarr",compressor=None)

The question is if these 2 arrays should be equivalent and store the same binary information. I think with this current implementation they would not, because in the last chunk z1 would pad fillvalues while z2 would not.

@meggart
Copy link
Member

meggart commented Oct 6, 2023

Thinking more about this I realized that your non-padding implementation is the only one that would work well together with kerchunk, so this is definitely the way to go. We might still want to mention this point somewhere in the zep draft.

@martindurant
Copy link
Member Author

Indeed, the kerchunk workflow is very important to me, if not everyone. Furthermore, we read multiple chunks concurrently and in the future will decompress in parallel too. That means you can't easily reuse buffers. In python's memory model, the buffer will not actually be released to the OS for a while anyway, so maybe it's no win for python at all. In the final, best implementation, we would even like to read or decompress directly into the target array memory buffer for contiguous cases.

@ivirshup
Copy link

ivirshup commented Oct 16, 2023

I've started a POC on top of this POC for kerchunk.combine.concatenate_arrays at fsspec/kerchunk#374

I'm also pretty sure I can only get this working without padding.

@MSanKeys963
Copy link
Member

MSanKeys963 commented Oct 24, 2023

Hi @martindurant, thanks for sending the PR.

I have requested reviews from the Zarr-Python devs. Additionally, if anyone from @zarr-developers/python-core-devs can review the PR, I'd be thankful to them.

Also, @normanrz, if you could review this PR, that'd be great.

@jhamman
Copy link
Member

jhamman commented Dec 7, 2023

Meta comment - we are working on a fresh Array implementation that covers both V2 and V3 arrays (see #1583 for details). We are actively seeing input/participation from those invested in ZEP003. Variable chunking is likely to require some changes to the codec api and we want to get your input as we roll out the new design. cc @d-v-b and @normanrz.

xref: #1595

@jhamman
Copy link
Member

jhamman commented Oct 11, 2024

@martindurant and others - now that v3 has come together, I'd be very interested to see this move to that branch. Who is interested in trying out an implementation on top of the new array api?

@jhamman jhamman added the V2 Affects the v2 branch label Oct 11, 2024
@TomNicholas
Copy link
Member

I would be - this is something that @abarciauskas-bgse and I want to get funding to work on... Don't let that stop anyone else having a go in the meantime though!

@martindurant
Copy link
Member Author

It's nice to see people wanting to see this move ahead. I'm not familiar enough with the v3 code to know how easy it is to port the partial implementation here.

Before progressing, do we need action on the original ZEP? It has not been accepted, and has a specific prescription on how to store the chunk sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes V2 Affects the v2 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants