Merge branch 'doc/3.0-updates' of github.com:jhamman/zarr-python into…

… doc/3.0-updates
jhamman · Dec 20, 2024 · dcb2e39 · dcb2e39
2 parents 61b4477 + a829fbb
commit dcb2e39
Show file tree

Hide file tree

Showing 19 changed files with 540 additions and 111 deletions.
diff --git a/docs/user-guide/config.rst b/docs/user-guide/config.rst
@@ -0,0 +1,41 @@
+Runtime configuration
+=====================
+
+The :mod:`zarr.core.config` module is responsible for managing the configuration of zarr
+and is based on the `donfig <https://github.com/pytroll/donfig>`_ Python library.
+
+Configuration values can be set using code like the following:
+
+.. code-block:: python
+
+    import zarr
+    zarr.config.set({"array.order": "F"})
+
+Alternatively, configuration values can be set using environment variables, e.g.
+``ZARR_ARRAY__ORDER=F``.
+
+The configuration can also be read from a YAML file in standard locations.
+For more information, see the
+`donfig documentation <https://donfig.readthedocs.io/en/latest/>`_.
+
+Configuration options include the following:
+
+- Default Zarr format ``default_zarr_version``
+- Default array order in memory ``array.order``
+- Default codecs ``array.v3_default_codecs`` and ``array.v2_default_compressor``
+- Whether empty chunks are written to storage ``array.write_empty_chunks``
+- Async and threading options, e.g. ``async.concurrency`` and ``threading.max_workers``
+- Selections of implementations of codecs, codec pipelines and buffers
+
+For selecting custom implementations of codecs, pipelines, buffers and ndbuffers,
+first register the implementations in the registry and then select them in the config.
+For example, an implementation of the bytes codec in a class "custompackage.NewBytesCodec",
+requires the value of ``codecs.bytes.name`` to be "custompackage.NewBytesCodec".
+
+This is the current default configuration:
+
+.. ipython:: python
+
+    import zarr
+
+    zarr.config.pprint()
diff --git a/docs/user-guide/extending.rst b/docs/user-guide/extending.rst
@@ -6,8 +6,78 @@ Zarr-Python 3 was designed to be extensible. This means that you can extend
 the library by writing custom classes and plugins. Currently, Zarr can be extended
 in the following ways:
 
-1. Writing custom stores
-2. Writing custom codecs
+Custom stores
+-------------
+
+
+Custom codecs
+-------------
+
+There are three types of codecs in Zarr: array-to-array, array-to-bytes, and bytes-to-bytes.
+Array-to-array codecs are used to transform the n-dimensional array data before serializing
+to bytes. Examples include delta encoding or scaling codecs. Array-to-bytes codecs are used
+for serializing the array data to bytes. In Zarr, the main codec to use for numeric arrays
+is the :class:`zarr.codecs.BytesCodec`. Bytes-to-bytes transform the serialized bytestreams
+of the array data. Examples include compression codecs, such as
+:class:`zarr.codecs.GzipCodec`, :class:`zarr.codecs.BloscCodec` or
+:class:`zarr.codecs.ZstdCodec`, and codecs that add a checksum to the bytestream, such as
+:class:`zarr.codecs.Crc32cCodec`.
+
+Custom codecs for Zarr are implemented by subclassing the relevant base class, see
+:class:`zarr.abc.codec.ArrayArrayCodec`, :class:`zarr.abc.codec.ArrayBytesCodec` and
+:class:`zarr.abc.codec.BytesBytesCodec`. Most custom codecs should implemented the
+``_encode_single`` and ``_decode_single`` methods. These methods operate on single chunks
+of the array data. Alternatively, custom codecs can implement the ``encode`` and ``decode``
+methods, which operate on batches of chunks, in case the codec is intended to implement
+its own batch processing.
+
+Custom codecs should also implement the following methods:
+
+- ``compute_encoded_size``, which returns the byte size of the encoded data given the byte
+  size of the original data. It should raise ``NotImplementedError`` for codecs with
+  variable-sized outputs, such as compression codecs.
+- ``validate``, which can be used to check that the codec metadata is compatible with the
+  array metadata. It should raise errors if not.
+- ``resolve_metadata`` (optional), which is important for codecs that change the shape,
+  dtype or fill value of a chunk.
+- ``evolve_from_array_spec`` (optional), which can be useful for automatically filling in
+  codec configuration metadata from the array metadata.
+
+To use custom codecs in Zarr, they need to be registered using the
+`entrypoint mechanism <https://packaging.python.org/en/latest/specifications/entry-points/>`_.
+Commonly, entrypoints are declared in the ``pyproject.toml`` of your package under the
+``[project.entry-points."zarr.codecs"]`` section. Zarr will automatically discover and
+load all codecs registered with the entrypoint mechanism from imported modules.
+
+.. code-block:: toml
+
+    [project.entry-points."zarr.codecs"]
+    "custompackage.fancy_codec" = "custompackage:FancyCodec"
+
+New codecs need to have their own unique identifier. To avoid naming collisions, it is
+strongly recommended to prefix the codec identifier with a unique name. For example,
+the codecs from ``numcodecs`` are prefixed with ``numcodecs.``, e.g. ``numcodecs.delta``.
+
+.. note::
+    Note that the extension mechanism for the Zarr version 3 is still under development.
+    Requirements for custom codecs including the choice of codec identifiers might
+    change in the future.
+
+It is also possible to register codecs as replacements for existing codecs. This might be
+useful for providing specialized implementations, such as GPU-based codecs. In case of
+multiple codecs, the :mod:`zarr.core.config` mechanism can be used to select the preferred
+implementation.
+
+.. note::
+    This sections explains how custom codecs can be created for Zarr version 3. For Zarr
+    version 2, codecs should subclass the
+    `numcodecs.abc.Codec <https://numcodecs.readthedocs.io/en/stable/abc.html#numcodecs.abc.Codec>`_
+    base class and register through
+    `numcodecs.registry.register_codec <https://numcodecs.readthedocs.io/en/stable/registry.html#numcodecs.registry.register_codec>`_.
+
+
+Other extensions
+----------------
 
 In the future, Zarr will support writing custom custom data types and chunk grids.
 

diff --git a/docs/user-guide/index.rst b/docs/user-guide/index.rst
@@ -10,6 +10,7 @@ User Guide
     arrays
     groups
     storage
+    config
     v3_migration
     todo
 

diff --git a/docs/user-guide/v3_migration.rst b/docs/user-guide/v3_migration.rst
@@ -93,6 +93,9 @@ The Array class
 1. Disallow direct construction - use :func:`zarr.open_array` or :func:`zarr.create_array`
    instead of directly constructing the :class:`zarr.Array` class.
 
+2. Defaulting to ``zarr_format=3`` - newly created arrays will use the version 3 of the
+   Zarr specification. To continue using version 2, set ``zarr_format=2`` when creating arrays.
+
 The Group class
 ~~~~~~~~~~~~~~~
 
@@ -131,6 +134,30 @@ Dependencies Changes
 - The ``jupyter`` optional dependency group has been removed, since v3 contains no
   jupyter specific functionality.
 
+Configuration
+~~~~~~~~~~~~~
+
+There is a new configuration system based on `donfig <https://github.com/pytroll/donfig>`_,
+which can be accessed via :mod:`zarr.core.config`.
+Configuration values can be set using code like the following:
+
+.. code-block:: python
+
+   import zarr
+   zarr.config.set({"array.order": "F"})
+
+Alternatively, configuration values can be set using environment variables,
+e.g. ``ZARR_ARRAY__ORDER=F``.
+
+Configuration options include the following:
+
+- Default Zarr format ``default_zarr_version``
+- Default array order in memory ``array.order``
+- Default codecs ``array.v3_default_codecs`` and ``array.v2_default_compressor``
+- Whether empty chunks are written to storage ``array.write_empty_chunks``
+- Async and threading options, e.g. ``async.concurrency`` and ``threading.max_workers``
+- Selections of implementations of codecs, codec pipelines and buffers
+
 Miscellaneous
 ~~~~~~~~~~~~~
 

diff --git a/src/zarr/api/asynchronous.py b/src/zarr/api/asynchronous.py
@@ -10,13 +10,16 @@
 from typing_extensions import deprecated
 
 from zarr.core.array import Array, AsyncArray, get_array_metadata
+from zarr.core.array_spec import ArrayConfig, ArrayConfigParams
 from zarr.core.buffer import NDArrayLike
 from zarr.core.common import (
     JSON,
     AccessModeLiteral,
     ChunkCoords,
     MemoryOrder,
     ZarrFormat,
+    _warn_order_kwarg,
+    _warn_write_empty_chunks_kwarg,
     parse_dtype,
 )
 from zarr.core.config import config
@@ -794,7 +797,7 @@ async def create(
     read_only: bool | None = None,
     object_codec: Codec | None = None,  # TODO: type has changed
     dimension_separator: Literal[".", "/"] | None = None,
-    write_empty_chunks: bool = False,  # TODO: default has changed
+    write_empty_chunks: bool | None = None,
     zarr_version: ZarrFormat | None = None,  # deprecated
     zarr_format: ZarrFormat | None = None,
     meta_array: Any | None = None,  # TODO: need type
@@ -810,6 +813,7 @@ async def create(
     codecs: Iterable[Codec | dict[str, JSON]] | None = None,
     dimension_names: Iterable[str] | None = None,
     storage_options: dict[str, Any] | None = None,
+    config: ArrayConfig | ArrayConfigParams | None = None,
     **kwargs: Any,
 ) -> AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]:
     """Create an array.
@@ -856,8 +860,10 @@ async def create(
         These defaults can be changed by modifying the value of ``array.v2_default_compressor`` in :mod:`zarr.core.config`.    fill_value : object
         Default value to use for uninitialized portions of the array.
     order : {'C', 'F'}, optional
+        Deprecated in favor of the ``config`` keyword argument.
+        Pass ``{'order': <value>}`` to ``create`` instead of using this parameter.
         Memory layout to be used within each chunk.
-        If not specified, default is taken from the Zarr config ```array.order```.
+        If not specified, the ``array.order`` parameter in the global config will be used.
     store : Store or str
         Store or path to directory in file system or name of zip file.
     synchronizer : object, optional
@@ -891,30 +897,26 @@ async def create(
         Separator placed between the dimensions of a chunk.
         V2 only. V3 arrays should use ``chunk_key_encoding`` instead.
         Default is ".".
-        .. versionadded:: 2.8
-
     write_empty_chunks : bool, optional
-        If True (default), all chunks will be stored regardless of their
+        Deprecated in favor of the ``config`` keyword argument.
+        Pass ``{'write_empty_chunks': <value>}`` to ``create`` instead of using this parameter.
+        If True, all chunks will be stored regardless of their
         contents. If False, each chunk is compared to the array's fill value
         prior to storing. If a chunk is uniformly equal to the fill value, then
         that chunk is not be stored, and the store entry for that chunk's key
-        is deleted. This setting enables sparser storage, as only chunks with
-        non-fill-value data are stored, at the expense of overhead associated
-        with checking the data of each chunk.
-
-        .. versionadded:: 2.11
-
+        is deleted.
     zarr_format : {2, 3, None}, optional
         The zarr format to use when saving.
         Default is 3.
     meta_array : array-like, optional
         An array instance to use for determining arrays to create and return
         to users. Use `numpy.empty(())` by default.
-
-        .. versionadded:: 2.13
     storage_options : dict
         If using an fsspec URL to create the store, these will be passed to
         the backend implementation. Ignored otherwise.
+    config : ArrayConfig or ArrayConfigParams, optional
+        Runtime configuration of the array. If provided, will override the
+        default values from `zarr.config.array`.
 
     Returns
     -------
@@ -951,26 +953,47 @@ async def create(
         warnings.warn("object_codec is not yet implemented", RuntimeWarning, stacklevel=2)
     if read_only is not None:
         warnings.warn("read_only is not yet implemented", RuntimeWarning, stacklevel=2)
-    if dimension_separator is not None:
-        if zarr_format == 3:
-            raise ValueError(
-                "dimension_separator is not supported for zarr format 3, use chunk_key_encoding instead"
-            )
-        else:
-            warnings.warn(
-                "dimension_separator is not yet implemented",
-                RuntimeWarning,
-                stacklevel=2,
-            )
-    if write_empty_chunks:
-        warnings.warn("write_empty_chunks is not yet implemented", RuntimeWarning, stacklevel=2)
+    if dimension_separator is not None and zarr_format == 3:
+        raise ValueError(
+            "dimension_separator is not supported for zarr format 3, use chunk_key_encoding instead"
+        )
+
+    if order is not None:
+        _warn_order_kwarg()
+    if write_empty_chunks is not None:
+        _warn_write_empty_chunks_kwarg()
+
     if meta_array is not None:
         warnings.warn("meta_array is not yet implemented", RuntimeWarning, stacklevel=2)
 
     mode = kwargs.pop("mode", None)
     if mode is None:
         mode = "a"
     store_path = await make_store_path(store, path=path, mode=mode, storage_options=storage_options)
+
+    config_dict: ArrayConfigParams = {}
+
+    if write_empty_chunks is not None:
+        if config is not None:
+            msg = (
+                "Both write_empty_chunks and config keyword arguments are set. "
+                "This is redundant. When both are set, write_empty_chunks will be ignored and "
+                "config will be used."
+            )
+            warnings.warn(UserWarning(msg), stacklevel=1)
+        config_dict["write_empty_chunks"] = write_empty_chunks
+    if order is not None:
+        if config is not None:
+            msg = (
+                "Both order and config keyword arguments are set. "
+                "This is redundant. When both are set, order will be ignored and "
+                "config will be used."
+            )
+            warnings.warn(UserWarning(msg), stacklevel=1)
+        config_dict["order"] = order
+
+    config_parsed = ArrayConfig.from_dict(config_dict)
+
     return await AsyncArray.create(
         store_path,
         shape=shape,
@@ -987,7 +1010,7 @@ async def create(
         codecs=codecs,
         dimension_names=dimension_names,
         attributes=attributes,
-        order=order,
+        config=config_parsed,
         **kwargs,
     )
 
@@ -1163,6 +1186,11 @@ async def open_array(
 
     zarr_format = _handle_zarr_version_or_format(zarr_version=zarr_version, zarr_format=zarr_format)
 
+    if "order" in kwargs:
+        _warn_order_kwarg()
+    if "write_empty_chunks" in kwargs:
+        _warn_write_empty_chunks_kwarg()
+
     try:
         return await AsyncArray.open(store_path, zarr_format=zarr_format)
     except FileNotFoundError:
-Original file line number
+Diff line change
@@ Expand Up / @@ -10,6 +10,7 @@ User Guide @@
         arrays
         groups
         storage
+        config
         v3_migration
         todo
@@ Expand Down @@