POC implementation of ZEP003 #1483

martindurant · 2023-08-01T17:32:13Z

Implemented on v2, since the array metadata is not validated (!).

Example:

In [1]: import zarr

In [2]: import numpy as np

In [3]: store = {}

In [4]: z = zarr.open_array(store, mode="w", dtype="i4", shape=(10, 10), chunks=[[1, 2, 2, 5], 5], compression=None)

In [5]: z[:] = (np.arange(100)).reshape((10, 10))

In [6]: store  # notice the different numbers of data bytes
Out[6]:
{'.zarray': b'{\n    "chunks": [\n        [\n            1,\n            2,\n            2,\n            5\n        ],\n        5\n    ],\n    "compressor": null,\n    "dimension_separator": ".",\n    "dtype": "<i4",\n    "fill_value": 0,\n    "filters": null,\n    "order": "C",\n    "shape": [\n        10,\n        10\n    ],\n    "zarr_format": 2\n}',
 '0.0': b'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00',
 '0.1': b'\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00',
 '1.0': b'\n\x00\x00\x00\x0b\x00\x00\x00\x0c\x00\x00\x00\r\x00\x00\x00\x0e\x00\x00\x00\x14\x00\x00\x00\x15\x00\x00\x00\x16\x00\x00\x00\x17\x00\x00\x00\x18\x00\x00\x00',
 '1.1': b'\x0f\x00\x00\x00\x10\x00\x00\x00\x11\x00\x00\x00\x12\x00\x00\x00\x13\x00\x00\x00\x19\x00\x00\x00\x1a\x00\x00\x00\x1b\x00\x00\x00\x1c\x00\x00\x00\x1d\x00\x00\x00',
 '2.0': b'\x1e\x00\x00\x00\x1f\x00\x00\x00 \x00\x00\x00!\x00\x00\x00"\x00\x00\x00(\x00\x00\x00)\x00\x00\x00*\x00\x00\x00+\x00\x00\x00,\x00\x00\x00',
 '2.1': b"#\x00\x00\x00$\x00\x00\x00%\x00\x00\x00&\x00\x00\x00'\x00\x00\x00-\x00\x00\x00.\x00\x00\x00/\x00\x00\x000\x00\x00\x001\x00\x00\x00",
 '3.0': b'2\x00\x00\x003\x00\x00\x004\x00\x00\x005\x00\x00\x006\x00\x00\x00<\x00\x00\x00=\x00\x00\x00>\x00\x00\x00?\x00\x00\x00@\x00\x00\x00F\x00\x00\x00G\x00\x00\x00H\x00\x00\x00I\x00\x00\x00J\x00\x00\x00P\x00\x00\x00Q\x00\x00\x00R\x00\x00\x00S\x00\x00\x00T\x00\x00\x00Z\x00\x00\x00[\x00\x00\x00\\\x00\x00\x00]\x00\x00\x00^\x00\x00\x00',
 '3.1': b'7\x00\x00\x008\x00\x00\x009\x00\x00\x00:\x00\x00\x00;\x00\x00\x00A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00E\x00\x00\x00K\x00\x00\x00L\x00\x00\x00M\x00\x00\x00N\x00\x00\x00O\x00\x00\x00U\x00\x00\x00V\x00\x00\x00W\x00\x00\x00X\x00\x00\x00Y\x00\x00\x00_\x00\x00\x00`\x00\x00\x00a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00'}

In [7]: z[:]
Out[7]:
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]], dtype=int32)

ivirshup

Thanks for getting started on this!

What would you like to do here, finish up for v2, then do v3? Or try to just go for v3?

Btw, I noticed a bug in the indexing and have opened a PR here: martindurant#21 to fix.

ivirshup · 2023-08-07T10:19:05Z

zarr/indexing.py

+            nelem = (slice_end - slice_start) // step
+            self.projections.append(
+                ChunkDimProjection(
+                    i, slice(slice_start, slice_end, step), slice(nfilled, nfilled + nelem)


I believe this will only work if the chunks boundaries are multiples of the step size.

import numpy as np, zarr z = zarr.array(np.arange(10), chunks=[[1, 2, 2, 5]]) z[::2] # array([6, 8], dtype=int32)

I've opened a PR into your branch fixing this plus adding some tests: martindurant#21

martindurant · 2023-08-07T13:25:02Z

The point was to get feedback and show that what was proposed in ZEP0003 is very achievable if we can get some buy-in.

I would be happy to get this working for v2, since that alone solves the kerchunk case. However, it should be part of v3, and the implementation would presumably we identical except for how the chunks are written to/from metadata; of and the correct API for creating such arrays,

As things stand here, bool indexing and fancy (int) indexing haven't been done yet, where I expect the former to be very easy. Also, plenty of things that access .chunks are broken if there are var chunks, such as arr.info.

ivirshup · 2023-08-07T14:34:06Z

So, broadly, this code is relevant in any path forward and it would be good to flesh out so we can show off use cases?

I would be up for collaborating some more here, either via PRs to your branch or whatever.

fancy (int) indexing haven't been done yet

I think I just got this working. It was basically replacing dim_sel_chunk = dim_sel // dim_chunk_len with dim_sel_chunk = np.digitize(dim_sel, self.offsets[1:]).

Also boolean, which was indeed quite easy. Essentially np.add.reduceat. Both cases did end up with a fair amount of copied code though. However, I did run into some poor support for boolean masks (#1490), so not sure how commonly used this is.

agoodm · 2023-08-07T22:29:23Z

Great to see this happening, @martindurant ! @alex-s-gardner and I are very interested in seeing the progress of this (zarr-developers/zarr-specs#138).

Fix indexing with steps, add tests

…varchunk

martindurant · 2023-08-09T15:30:43Z

@ivirshup , the current code does break a significant number of tests with standard indexing; seems to happen always in the last chunk. Probably happening in _process_chunk I think.

joshmoore · 2023-08-14T07:45:56Z

Just a quick note before I dive in more. In ZEP0003 there's the line " It would be reasonable to wish to backport the feature to v2." and this is on v2. I'll just point out we really don't have the mechanisms to introduce something like this in v2: no process for updating the spec and no way for implementations to know they're getting something they can't support.

martindurant · 2023-08-14T14:09:42Z

@joshmoore - the index code should be exactly the same. It's in v2 exactly because we didn't have to update the spec/metadata handling code to get it to work. Actually, it would be useful to kerchunk as-is, given that it would be for datasets that simply cannot otherwise be represented by zarr. But yes, the aim is to make this an implementation for v3. I still think it would be useful for v2 to be able to read such datasets, however.

joshmoore · 2023-08-14T14:23:58Z

I still think it would be useful for v2 to be able to read such datasets, however.

Definitely no objections that it would be useful. But I'm concerned about the cost on all the other implementations. Correct me if I'm wrong, but they'd fall over spectacularly, no?

martindurant · 2023-08-14T14:27:47Z

Correct me if I'm wrong, but they'd fall over spectacularly, no?

They would fall over, yes. It would be an early error before loading any data. However, if this were a kerchunk thing, they probably can't open the store object anyway.

codecov · 2023-08-14T15:35:56Z

Codecov Report

Attention: Patch coverage is 97.09302% with 5 lines in your changes missing coverage. Please review.

Project coverage is 99.96%. Comparing base (f542fca) to head (f8d9bf2).
Report is 163 commits behind head on main.

Files with missing lines	Patch %	Lines
zarr/indexing.py	96.42%	5 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##              main    #1483      +/-   ##
===========================================
- Coverage   100.00%   99.96%   -0.04%     
===========================================
  Files           37       37              
  Lines        14729    14889     +160     
===========================================
+ Hits         14729    14884     +155     
- Misses           0        5       +5

Files with missing lines	Coverage Δ
zarr/core.py	`100.00% <100.00%> (ø)`
zarr/tests/test_indexing.py	`100.00% <100.00%> (ø)`
zarr/util.py	`100.00% <100.00%> (ø)`
zarr/indexing.py	`99.25% <96.42%> (-0.75%)`	⬇️

... and 1 file with indirect coverage changes

martindurant · 2023-08-14T18:10:58Z

Just to point out: this works transparently with V3, we apparently DO NOT VALIDATE the chunks property.

store_ = {}
store = zarr._storage.v3.KVStoreV3(store_)
z = zarr.open_array(store, mode="w", dtype="i4", shape=(10, 10), chunks=([1, 2, 2, 5], 5), compression=None)
# z = zarr.open_array(store_, mode="w", dtype="i4", shape=(10, 10), chunks=([1, 2, 2, 5], 5), compression=None, zarr_version=3) # same as above
z[:] = (np.arange(100)).reshape((10, 10))
assert (z[:] == np.arange(100).reshape((10, 10))).all()

metadata:

>>> print(store_["meta/root/array.array.json"].decode())
{
    "attributes": {},
    "chunk_grid": {
        "chunk_shape": [
            [
                1,
                2,
                2,
                5
            ],
            5
        ],
        "separator": "/",
        "type": "regular"
    },
    "chunk_memory_layout": "C",
    "data_type": "<i4",
    "dimension_separator": "/",
    "extensions": [],
    "fill_value": 0,
    "shape": [
        10,
        10
    ]
}

okz · 2023-09-15T16:58:34Z

Does this implementation already work with kerchunk ?

martindurant · 2023-09-15T17:05:26Z

Does this implementation already work with kerchunk ?

In principle, the implementation works, but there is no code currently in kerchunk to produce zarr metadata that would use it

pbranson · 2023-10-02T08:58:08Z

+1 to this PR, would be great if this worked in V2

martindurant · 2023-10-04T13:37:35Z

would be great if this worked in V2

It does work for V2! Some things will break as given here, e.g., array.info(), but dask.array.from_zarr should work as it.

meggart · 2023-10-06T09:12:02Z

This feature would be very interesting to have also from the Julia side and I would be very much in favor even for having this as some kind of patch for v2 to have it in a usable state earlier. I drafted an implementation for Zarr.jl JuliaIO/Zarr.jl#126 , but I think it is not compatible with this implementation. The main detail I stumbled across was that without ZEP0003 there was a guarantee that every chunk in a store would represent a chunk of exactly the same shape when decompressed. Even for incomplete chunks at the array boundaries this was achieved through padding of fillvalues that are ignored during reading, but allow zero-cost resizing.

This allowed an easy re-use of compression/decompression buffers when reading/writing multiple chunks sequentially. My current Julia implementation keeps this behavior, in that it always compresses chunks of the maximum chunk size by padding fillvalues, so that the invariant mentioned above is maintained.

Maybe it would be a good idea to clarify in the ZEP text, the consequences this has on the uniformity of an uncompressed chunk. To illustrate this in a small example:

import zarr
z1 = zarr.create((10,), chunks=(3,), dtype='i4',fill_value=1,store="./chunktests.zarr",compressor=None)
z2 = zarr.create((10,), chunks=([3,3,3,1],), dtype='i4',fill_value=1,store="./chunktests.zarr",compressor=None)

The question is if these 2 arrays should be equivalent and store the same binary information. I think with this current implementation they would not, because in the last chunk z1 would pad fillvalues while z2 would not.

meggart · 2023-10-06T10:30:33Z

Thinking more about this I realized that your non-padding implementation is the only one that would work well together with kerchunk, so this is definitely the way to go. We might still want to mention this point somewhere in the zep draft.

martindurant · 2023-10-06T13:03:32Z

Indeed, the kerchunk workflow is very important to me, if not everyone. Furthermore, we read multiple chunks concurrently and in the future will decompress in parallel too. That means you can't easily reuse buffers. In python's memory model, the buffer will not actually be released to the OS for a while anyway, so maybe it's no win for python at all. In the final, best implementation, we would even like to read or decompress directly into the target array memory buffer for contiguous cases.

ivirshup · 2023-10-16T16:18:42Z

I've started a POC on top of this POC for kerchunk.combine.concatenate_arrays at fsspec/kerchunk#374

I'm also pretty sure I can only get this working without padding.

MSanKeys963 · 2023-10-24T23:33:44Z

Hi @martindurant, thanks for sending the PR.

I have requested reviews from the Zarr-Python devs. Additionally, if anyone from @zarr-developers/python-core-devs can review the PR, I'd be thankful to them.

Also, @normanrz, if you could review this PR, that'd be great.

jhamman · 2023-12-07T21:45:38Z

Meta comment - we are working on a fresh Array implementation that covers both V2 and V3 arrays (see #1583 for details). We are actively seeing input/participation from those invested in ZEP003. Variable chunking is likely to require some changes to the codec api and we want to get your input as we roll out the new design. cc @d-v-b and @normanrz.

xref: #1595

jhamman · 2024-10-11T23:33:40Z

@martindurant and others - now that v3 has come together, I'd be very interested to see this move to that branch. Who is interested in trying out an implementation on top of the new array api?

TomNicholas · 2024-10-12T01:14:41Z

I would be - this is something that @abarciauskas-bgse and I want to get funding to work on... Don't let that stop anyone else having a go in the meantime though!

martindurant · 2024-10-13T17:40:01Z

It's nice to see people wanting to see this move ahead. I'm not familiar enough with the v3 code to know how easy it is to port the partial implementation here.

Before progressing, do we need action on the original ZEP? It has not been accepted, and has a specific prescription on how to store the chunk sizes.

Very early POC

e8be25e

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Aug 1, 2023

pre-commit-ci bot and others added 4 commits August 1, 2023 17:34

style: pre-commit fixes

c50ee29

Fix slices with steps, add tests

4f890ec

More indexing test cases

b9ba867

Fix indexing with offsets and steps

e12a849

ivirshup reviewed Aug 7, 2023

View reviewed changes

Get boolean + integer indexing working

4672937

martindurant and others added 4 commits August 9, 2023 10:16

Merge pull request #21 from ivirshup/varchunk-fixslices

f5ec764

Fix indexing with steps, add tests

style: pre-commit fixes

8033a0a

types

1124500

Merge branch 'varchunk' of https://github.com/martindurant/zarr into …

d90b43c

…varchunk

Merge branch 'main' into varchunk

46b4d82

Merge branch 'main' into varchunk

a35252b

martindurant added 2 commits August 14, 2023 11:10

some fix

2831521

fix more

0d10ccb

martindurant and others added 2 commits August 14, 2023 15:12

some coverage

882f301

style: pre-commit fixes

f8d9bf2

JSKenyon mentioned this pull request Aug 17, 2023

Add rechunk_by_size functionality ratt-ru/dask-ms#284

Merged

2 tasks

MSanKeys963 mentioned this pull request Sep 2, 2023

can zarr support irregularity chunk size ? #466

Closed

martindurant mentioned this pull request Sep 15, 2023

Passing multiple kerchunk sideload files to open_mfdataset, not possible with intake intake/intake-xarray#135

Open

meggart mentioned this pull request Oct 6, 2023

POC: Implementation of ZEP0003 JuliaIO/Zarr.jl#126

Open

ivirshup mentioned this pull request Oct 16, 2023

Concatenate arrays with varchunks fsspec/kerchunk#374

Draft

9 tasks

MSanKeys963 requested review from rabernat, jakirkham and d-v-b October 19, 2023 15:17

martindurant mentioned this pull request Oct 27, 2023

AssertionError: Found chunk size mismatch fsspec/kerchunk#218

Closed

martindurant mentioned this pull request Nov 30, 2023

Unexpected behavior with MultiZarrToZarr with partial chunks. fsspec/kerchunk#400

Open

jhamman mentioned this pull request Dec 7, 2023

Implement Variable Chunking (ZEP003) in V3 #1595

Open

martindurant mentioned this pull request Jan 9, 2024

Control chunksize of the underlying zarrdata fsspec/kerchunk#406

Open

martindurant mentioned this pull request Feb 5, 2024

Refactor MultiZarrToZarr into multiple functions fsspec/kerchunk#377

Open

jhamman added the V2 Affects the v2 branch label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC implementation of ZEP003 #1483

POC implementation of ZEP003 #1483

martindurant commented Aug 1, 2023

ivirshup left a comment

ivirshup Aug 7, 2023

martindurant commented Aug 7, 2023

ivirshup commented Aug 7, 2023 •

edited

Loading

agoodm commented Aug 7, 2023

martindurant commented Aug 9, 2023

joshmoore commented Aug 14, 2023

martindurant commented Aug 14, 2023

joshmoore commented Aug 14, 2023

martindurant commented Aug 14, 2023

codecov bot commented Aug 14, 2023 •

edited

Loading

martindurant commented Aug 14, 2023 •

edited

Loading

okz commented Sep 15, 2023

martindurant commented Sep 15, 2023

pbranson commented Oct 2, 2023

martindurant commented Oct 4, 2023

meggart commented Oct 6, 2023

meggart commented Oct 6, 2023

martindurant commented Oct 6, 2023

ivirshup commented Oct 16, 2023 •

edited

Loading

MSanKeys963 commented Oct 24, 2023 •

edited

Loading

jhamman commented Dec 7, 2023

jhamman commented Oct 11, 2024

TomNicholas commented Oct 12, 2024

martindurant commented Oct 13, 2024

POC implementation of ZEP003 #1483

Are you sure you want to change the base?

POC implementation of ZEP003 #1483

Conversation

martindurant commented Aug 1, 2023

ivirshup left a comment

Choose a reason for hiding this comment

ivirshup Aug 7, 2023

Choose a reason for hiding this comment

martindurant commented Aug 7, 2023

ivirshup commented Aug 7, 2023 • edited Loading

agoodm commented Aug 7, 2023

martindurant commented Aug 9, 2023

joshmoore commented Aug 14, 2023

martindurant commented Aug 14, 2023

joshmoore commented Aug 14, 2023

martindurant commented Aug 14, 2023

codecov bot commented Aug 14, 2023 • edited Loading

Codecov Report

martindurant commented Aug 14, 2023 • edited Loading

okz commented Sep 15, 2023

martindurant commented Sep 15, 2023

pbranson commented Oct 2, 2023

martindurant commented Oct 4, 2023

meggart commented Oct 6, 2023

meggart commented Oct 6, 2023

martindurant commented Oct 6, 2023

ivirshup commented Oct 16, 2023 • edited Loading

MSanKeys963 commented Oct 24, 2023 • edited Loading

jhamman commented Dec 7, 2023

jhamman commented Oct 11, 2024

TomNicholas commented Oct 12, 2024

martindurant commented Oct 13, 2024

ivirshup commented Aug 7, 2023 •

edited

Loading

codecov bot commented Aug 14, 2023 •

edited

Loading

martindurant commented Aug 14, 2023 •

edited

Loading

ivirshup commented Oct 16, 2023 •

edited

Loading

MSanKeys963 commented Oct 24, 2023 •

edited

Loading