Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Refactor arrays in v3 #1589

Closed
wants to merge 32 commits into from
Closed

[WIP] Refactor arrays in v3 #1589

wants to merge 32 commits into from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Dec 5, 2023

The goal of this PR is to create a user-facing array class that implements the basic attributes of a numpy array and which abstracts over the structural differences between v2 and v3 arrays. This is very heavily based on the zarrita approach, but I'm modifying things in a lot of places.

Design goals:

  • An array class with .shape, .ndim, .size, etc attributes that are consistent with the array api attributes. I don't plan on adding .device or .mT unless there's an acute need for that, but that's not a deeply considered perspective. The array class will support __getitem__ and __setitem__ for getting and setting numeric data like a normal numpy array, and for attributes as well.
  • The array should have a .metadata attribute that expresses array metadata according to a zarr array specification. The data in .metadata should correspond to the contents of the stored array metadata document. The .metadata class should not perform any behavior besides JSON ser/deserialization.
  • IO routines will execute blocking calls to async IO routines.
    • The IO behavior of the array should be configurable, e.g. w.r.t caching, writing empty chunks, concurrency limits, locking, etc.
  • All of the above should work similarly for arrays based on zarr specs v2 and v3.
  • All of the above should be handled by separate APIs, that together support the array interface.

@pep8speaks
Copy link

pep8speaks commented Dec 5, 2023

Hello @d-v-b! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 11:1: E302 expected 2 blank lines, found 1

Line 73:20: E203 whitespace before ':'
Line 73:46: E203 whitespace before ':'
Line 73:72: E203 whitespace before ':'
Line 205:20: E203 whitespace before ':'
Line 205:46: E203 whitespace before ':'
Line 205:72: E203 whitespace before ':'

Line 320:51: E203 whitespace before ':'

Line 90:19: W291 trailing whitespace

Line 47:101: E501 line too long (114 > 100 characters)

Line 48:101: E501 line too long (142 > 100 characters)

Line 173:4: W291 trailing whitespace

Line 302:44: E203 whitespace before ':'
Line 314:30: E203 whitespace before ':'

Comment last updated at 2024-01-07 18:00:00 UTC

d-v-b and others added 2 commits December 6, 2023 00:02
Co-authored-by: Joe Hamman <jhamman1@gmail.com>
@joshmoore
Copy link
Member

Is the base branch here intended to be main?

@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 6, 2023

Is the base branch here intended to be main?

nope! It should be against https://github.com/zarr-developers/zarr-python/tree/v3. I will see how fixable that is.

@d-v-b d-v-b changed the base branch from main to v3 December 6, 2023 10:46
@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 6, 2023

Is the base branch here intended to be main?

nope! It should be against https://github.com/zarr-developers/zarr-python/tree/v3. I will see how fixable that is.

very fixable, it turns out. thanks for spotting this @joshmoore!

@joshmoore
Copy link
Member

👍 np

zarr/v3/array/v3.py Outdated Show resolved Hide resolved
zarr/v3/array/v3.py Outdated Show resolved Hide resolved
zarr/v3/array/base.py Outdated Show resolved Hide resolved
…ing metadata for v2 and v3; rename zarray to array
@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 6, 2023

@normanrz if you have the patience for it, I wonder if you could check out my branch and help me understand some tests that are failing due to .attrs deserialization issues? Here's an example traceback. It's certainly due to something I broke, but I want to understand how to un-break it :)

Here's an example traceback:
python
store = MemoryStore('memory://4904285632')

    @pytest.mark.asyncio
    async def test_resize(store: Store):
        data = np.zeros((16, 18), dtype="uint16")
    
        a = await AsyncArray.create(
            store / "resize",
            shape=data.shape,
            chunk_shape=(10, 10),
            dtype=data.dtype,
            chunk_key_encoding=("v2", "."),
            fill_value=1,
        )
    
        await _AsyncArrayProxy(a)[:16, :18].set(data)
>       assert await store.get_async("resize/0.0") is not None
E       assert None is not None

zarr/tests/test_codecs_v3.py:935: AssertionError
__________________________________________________________________ test_update_attributes_array __________________________________________________________________
  + Exception Group Traceback (most recent call last):
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/_pytest/runner.py", line 341, in from_call
  |     result: Optional[TResult] = func()
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/_pytest/runner.py", line 262, in <lambda>
  |     lambda: ihook(item=item, **kwds), when=when, reraise=reraise
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 493, in __call__
  |     return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/pluggy/_manager.py", line 115, in _hookexec
  |     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/pluggy/_callers.py", line 152, in _multicall
  |     return outcome.get_result()
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/pluggy/_result.py", line 114, in get_result
  |     raise exc.with_traceback(exc.__traceback__)
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/pluggy/_callers.py", line 77, in _multicall
  |     res = hook_impl.function(*args)
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/_pytest/runner.py", line 177, in pytest_runtest_call
  |     raise e
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/_pytest/runner.py", line 169, in pytest_runtest_call
  |     item.runtest()
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/_pytest/python.py", line 1792, in runtest
  |     self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 493, in __call__
  |     return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/pluggy/_manager.py", line 115, in _hookexec
  |     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/pluggy/_callers.py", line 152, in _multicall
  |     return outcome.get_result()
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/pluggy/_result.py", line 114, in get_result
  |     raise exc.with_traceback(exc.__traceback__)
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/pluggy/_callers.py", line 77, in _multicall
  |     res = hook_impl.function(*args)
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/_pytest/python.py", line 194, in pytest_pyfunc_call
  |     result = testfunction(**testargs)
  |   File "/Users/bennettd/dev/zarr-python/zarr/tests/test_codecs_v3.py", line 983, in test_update_attributes_array
  |     a = Array.open(store / "update_attributes")
  |   File "/Users/bennettd/dev/zarr-python/zarr/v3/array/v3.py", line 455, in open
  |     async_array = sync(
  |   File "/Users/bennettd/dev/zarr-python/zarr/v3/sync.py", line 67, in sync
  |     raise return_result
  |   File "/Users/bennettd/dev/zarr-python/zarr/v3/sync.py", line 30, in _runner
  |     result_box[0] = await coro
  |   File "/Users/bennettd/dev/zarr-python/zarr/v3/array/v3.py", line 218, in open
  |     return cls.from_json(
  |   File "/Users/bennettd/dev/zarr-python/zarr/v3/array/v3.py", line 197, in from_json
  |     metadata = ArrayMetadata.from_json(zarr_json)
  |   File "/Users/bennettd/dev/zarr-python/zarr/v3/array/v3.py", line 101, in from_json
  |     return make_cattr().structure(zarr_json, cls)
  |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/cattrs/converters.py", line 334, in structure
  |     return self._structure_func.dispatch(cl)(obj, cl)
  |   File "<cattrs generated structure zarr.v3.array.v3.ArrayMetadata-68>", line 58, in structure_ArrayMetadata
  |     if errors: raise __c_cve('While structuring ' + 'ArrayMetadata', errors, __cl)
  | cattrs.errors.ClassValidationError: While structuring ArrayMetadata (1 sub-exception)
  +-+---------------- 1 ----------------
    | Exception Group Traceback (most recent call last):
    |   File "<cattrs generated structure zarr.v3.array.v3.ArrayMetadata-68>", line 30, in structure_ArrayMetadata
    |     res['codecs'] = __c_structure_codecs(o['codecs'], __c_type_codecs)
    |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/cattrs/converters.py", line 545, in _structure_list
    |     raise IterableValidationError(
    | cattrs.errors.IterableValidationError: While structuring list[zarr.v3.metadata.CodecMetadata] (1 sub-exception)
    | Structuring class ArrayMetadata @ attribute codecs
    +-+---------------- 1 ----------------
      | Traceback (most recent call last):
      |   File "<cattrs generated structure zarr.v3.codecs.bytes.BytesCodecMetadata-68>", line 17, in structure_BytesCodecMetadata
      |     return __cl(
      | TypeError: __init__() got an unexpected keyword argument 'name'
      | 
      | During handling of the above exception, another exception occurred:
      | 
      | Exception Group Traceback (most recent call last):
      |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/cattrs/converters.py", line 535, in _structure_list
      |     res.append(handler(e, elem_type))
      |   File "/Users/bennettd/dev/zarr-python/zarr/v3/common.py", line 57, in _structure_codec_metadata
      |     return converter.structure(d, codec_metadata_cls)
      |   File "/Users/bennettd/dev/zarr-python/.venv/lib/python3.9/site-packages/cattrs/converters.py", line 334, in structure
      |     return self._structure_func.dispatch(cl)(obj, cl)
      |   File "<cattrs generated structure zarr.v3.codecs.bytes.BytesCodecMetadata-68>", line 20, in structure_BytesCodecMetadata
      |     except Exception as exc: raise __c_cve('While structuring ' + 'BytesCodecMetadata', [exc], __cl)
      | cattrs.errors.ClassValidationError: While structuring BytesCodecMetadata (1 sub-exception)
      | Structuring list[zarr.v3.metadata.CodecMetadata] @ index 0
      +-+---------------- 1 ----------------
        | Traceback (most recent call last):
        |   File "<cattrs generated structure zarr.v3.codecs.bytes.BytesCodecMetadata-68>", line 17, in structure_BytesCodecMetadata
        |     return __cl(
        | TypeError: __init__() got an unexpected keyword argument 'name'

d-v-b and others added 2 commits December 6, 2023 14:32
@d-v-b
Copy link
Contributor Author

d-v-b commented Dec 9, 2023

an update about this effort:

  • I have unified the chunk read / write APIs for v2 and v3 arrays. The strategy here was to remove the _write_chunk, _read_chunk methods from both v2 and v3 classes and combine the logic into free-standing functions, which the v2 and v3 array classes can call. See the write_chunk / read_chunk functions, which, in this branch, are defined in chunk.py. Ripping methods off of the array classes ends up swelling the function signature of those routines, but it also exposes the "true" signature of the function, which at least for me is useful for understanding the code and ultimately refactoring.
  • I am starting to apply this same surgical technique to the codecs. In v3, a CodecPipeline is initialized with a RuntimeConfiguration, but this pipeline will be used by an array that already has a RuntimeConfiguration. See here. The same piece of data appearing in two places in a function call is a request for simplification. So I removed the RuntimeConfiguration from the construction of CodecPipeline, and as a consequence all of the codec methods like encode and decode require a config argument, which is an instance of RuntimeConfiguration. As with the chunk io story above, this makes the function signatures of these routines bigger, but also more explicit about their true parameterization. In particular, I think only the sharding routines use the RuntimeConfiguration at all, so it might make sense to make the RuntimeConfiguration a keyword-only argument that defaults to None.

I think it's going well, but I don't have any outlook on when this will be done, so don't consider this effort a blocker to immediate v3 efforts.

@jhamman
Copy link
Member

jhamman commented May 17, 2024

@d-v-b - I think @normanrz's PR (#1857) superseded this effort. Feel free to re-open if I have that wrong.

@jhamman jhamman closed this May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants