Skip to content

Commit

Permalink
Merge branch 'doc/3.0-updates' of github.com:jhamman/zarr-python into…
Browse files Browse the repository at this point in the history
… doc/3.0-updates
  • Loading branch information
jhamman committed Dec 20, 2024
2 parents 61b4477 + a829fbb commit dcb2e39
Show file tree
Hide file tree
Showing 19 changed files with 540 additions and 111 deletions.
41 changes: 41 additions & 0 deletions docs/user-guide/config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
Runtime configuration
=====================

The :mod:`zarr.core.config` module is responsible for managing the configuration of zarr
and is based on the `donfig <https://github.com/pytroll/donfig>`_ Python library.

Configuration values can be set using code like the following:

.. code-block:: python
import zarr
zarr.config.set({"array.order": "F"})
Alternatively, configuration values can be set using environment variables, e.g.
``ZARR_ARRAY__ORDER=F``.

The configuration can also be read from a YAML file in standard locations.
For more information, see the
`donfig documentation <https://donfig.readthedocs.io/en/latest/>`_.

Configuration options include the following:

- Default Zarr format ``default_zarr_version``
- Default array order in memory ``array.order``
- Default codecs ``array.v3_default_codecs`` and ``array.v2_default_compressor``
- Whether empty chunks are written to storage ``array.write_empty_chunks``
- Async and threading options, e.g. ``async.concurrency`` and ``threading.max_workers``
- Selections of implementations of codecs, codec pipelines and buffers

For selecting custom implementations of codecs, pipelines, buffers and ndbuffers,
first register the implementations in the registry and then select them in the config.
For example, an implementation of the bytes codec in a class "custompackage.NewBytesCodec",
requires the value of ``codecs.bytes.name`` to be "custompackage.NewBytesCodec".

This is the current default configuration:

.. ipython:: python
import zarr
zarr.config.pprint()
74 changes: 72 additions & 2 deletions docs/user-guide/extending.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,78 @@ Zarr-Python 3 was designed to be extensible. This means that you can extend
the library by writing custom classes and plugins. Currently, Zarr can be extended
in the following ways:

1. Writing custom stores
2. Writing custom codecs
Custom stores
-------------


Custom codecs
-------------

There are three types of codecs in Zarr: array-to-array, array-to-bytes, and bytes-to-bytes.
Array-to-array codecs are used to transform the n-dimensional array data before serializing
to bytes. Examples include delta encoding or scaling codecs. Array-to-bytes codecs are used
for serializing the array data to bytes. In Zarr, the main codec to use for numeric arrays
is the :class:`zarr.codecs.BytesCodec`. Bytes-to-bytes transform the serialized bytestreams
of the array data. Examples include compression codecs, such as
:class:`zarr.codecs.GzipCodec`, :class:`zarr.codecs.BloscCodec` or
:class:`zarr.codecs.ZstdCodec`, and codecs that add a checksum to the bytestream, such as
:class:`zarr.codecs.Crc32cCodec`.

Custom codecs for Zarr are implemented by subclassing the relevant base class, see
:class:`zarr.abc.codec.ArrayArrayCodec`, :class:`zarr.abc.codec.ArrayBytesCodec` and
:class:`zarr.abc.codec.BytesBytesCodec`. Most custom codecs should implemented the
``_encode_single`` and ``_decode_single`` methods. These methods operate on single chunks
of the array data. Alternatively, custom codecs can implement the ``encode`` and ``decode``
methods, which operate on batches of chunks, in case the codec is intended to implement
its own batch processing.

Custom codecs should also implement the following methods:

- ``compute_encoded_size``, which returns the byte size of the encoded data given the byte
size of the original data. It should raise ``NotImplementedError`` for codecs with
variable-sized outputs, such as compression codecs.
- ``validate``, which can be used to check that the codec metadata is compatible with the
array metadata. It should raise errors if not.
- ``resolve_metadata`` (optional), which is important for codecs that change the shape,
dtype or fill value of a chunk.
- ``evolve_from_array_spec`` (optional), which can be useful for automatically filling in
codec configuration metadata from the array metadata.

To use custom codecs in Zarr, they need to be registered using the
`entrypoint mechanism <https://packaging.python.org/en/latest/specifications/entry-points/>`_.
Commonly, entrypoints are declared in the ``pyproject.toml`` of your package under the
``[project.entry-points."zarr.codecs"]`` section. Zarr will automatically discover and
load all codecs registered with the entrypoint mechanism from imported modules.

.. code-block:: toml
[project.entry-points."zarr.codecs"]
"custompackage.fancy_codec" = "custompackage:FancyCodec"
New codecs need to have their own unique identifier. To avoid naming collisions, it is
strongly recommended to prefix the codec identifier with a unique name. For example,
the codecs from ``numcodecs`` are prefixed with ``numcodecs.``, e.g. ``numcodecs.delta``.

.. note::
Note that the extension mechanism for the Zarr version 3 is still under development.
Requirements for custom codecs including the choice of codec identifiers might
change in the future.

It is also possible to register codecs as replacements for existing codecs. This might be
useful for providing specialized implementations, such as GPU-based codecs. In case of
multiple codecs, the :mod:`zarr.core.config` mechanism can be used to select the preferred
implementation.

.. note::
This sections explains how custom codecs can be created for Zarr version 3. For Zarr
version 2, codecs should subclass the
`numcodecs.abc.Codec <https://numcodecs.readthedocs.io/en/stable/abc.html#numcodecs.abc.Codec>`_
base class and register through
`numcodecs.registry.register_codec <https://numcodecs.readthedocs.io/en/stable/registry.html#numcodecs.registry.register_codec>`_.


Other extensions
----------------

In the future, Zarr will support writing custom custom data types and chunk grids.

Expand Down
1 change: 1 addition & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ User Guide
arrays
groups
storage
config
v3_migration
todo

Expand Down
27 changes: 27 additions & 0 deletions docs/user-guide/v3_migration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,9 @@ The Array class
1. Disallow direct construction - use :func:`zarr.open_array` or :func:`zarr.create_array`
instead of directly constructing the :class:`zarr.Array` class.

2. Defaulting to ``zarr_format=3`` - newly created arrays will use the version 3 of the
Zarr specification. To continue using version 2, set ``zarr_format=2`` when creating arrays.

The Group class
~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -131,6 +134,30 @@ Dependencies Changes
- The ``jupyter`` optional dependency group has been removed, since v3 contains no
jupyter specific functionality.

Configuration
~~~~~~~~~~~~~

There is a new configuration system based on `donfig <https://github.com/pytroll/donfig>`_,
which can be accessed via :mod:`zarr.core.config`.
Configuration values can be set using code like the following:

.. code-block:: python
import zarr
zarr.config.set({"array.order": "F"})
Alternatively, configuration values can be set using environment variables,
e.g. ``ZARR_ARRAY__ORDER=F``.

Configuration options include the following:

- Default Zarr format ``default_zarr_version``
- Default array order in memory ``array.order``
- Default codecs ``array.v3_default_codecs`` and ``array.v2_default_compressor``
- Whether empty chunks are written to storage ``array.write_empty_chunks``
- Async and threading options, e.g. ``async.concurrency`` and ``threading.max_workers``
- Selections of implementations of codecs, codec pipelines and buffers

Miscellaneous
~~~~~~~~~~~~~

Expand Down
82 changes: 55 additions & 27 deletions src/zarr/api/asynchronous.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,16 @@
from typing_extensions import deprecated

from zarr.core.array import Array, AsyncArray, get_array_metadata
from zarr.core.array_spec import ArrayConfig, ArrayConfigParams
from zarr.core.buffer import NDArrayLike
from zarr.core.common import (
JSON,
AccessModeLiteral,
ChunkCoords,
MemoryOrder,
ZarrFormat,
_warn_order_kwarg,
_warn_write_empty_chunks_kwarg,
parse_dtype,
)
from zarr.core.config import config
Expand Down Expand Up @@ -794,7 +797,7 @@ async def create(
read_only: bool | None = None,
object_codec: Codec | None = None, # TODO: type has changed
dimension_separator: Literal[".", "/"] | None = None,
write_empty_chunks: bool = False, # TODO: default has changed
write_empty_chunks: bool | None = None,
zarr_version: ZarrFormat | None = None, # deprecated
zarr_format: ZarrFormat | None = None,
meta_array: Any | None = None, # TODO: need type
Expand All @@ -810,6 +813,7 @@ async def create(
codecs: Iterable[Codec | dict[str, JSON]] | None = None,
dimension_names: Iterable[str] | None = None,
storage_options: dict[str, Any] | None = None,
config: ArrayConfig | ArrayConfigParams | None = None,
**kwargs: Any,
) -> AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]:
"""Create an array.
Expand Down Expand Up @@ -856,8 +860,10 @@ async def create(
These defaults can be changed by modifying the value of ``array.v2_default_compressor`` in :mod:`zarr.core.config`. fill_value : object
Default value to use for uninitialized portions of the array.
order : {'C', 'F'}, optional
Deprecated in favor of the ``config`` keyword argument.
Pass ``{'order': <value>}`` to ``create`` instead of using this parameter.
Memory layout to be used within each chunk.
If not specified, default is taken from the Zarr config ```array.order```.
If not specified, the ``array.order`` parameter in the global config will be used.
store : Store or str
Store or path to directory in file system or name of zip file.
synchronizer : object, optional
Expand Down Expand Up @@ -891,30 +897,26 @@ async def create(
Separator placed between the dimensions of a chunk.
V2 only. V3 arrays should use ``chunk_key_encoding`` instead.
Default is ".".
.. versionadded:: 2.8
write_empty_chunks : bool, optional
If True (default), all chunks will be stored regardless of their
Deprecated in favor of the ``config`` keyword argument.
Pass ``{'write_empty_chunks': <value>}`` to ``create`` instead of using this parameter.
If True, all chunks will be stored regardless of their
contents. If False, each chunk is compared to the array's fill value
prior to storing. If a chunk is uniformly equal to the fill value, then
that chunk is not be stored, and the store entry for that chunk's key
is deleted. This setting enables sparser storage, as only chunks with
non-fill-value data are stored, at the expense of overhead associated
with checking the data of each chunk.
.. versionadded:: 2.11
is deleted.
zarr_format : {2, 3, None}, optional
The zarr format to use when saving.
Default is 3.
meta_array : array-like, optional
An array instance to use for determining arrays to create and return
to users. Use `numpy.empty(())` by default.
.. versionadded:: 2.13
storage_options : dict
If using an fsspec URL to create the store, these will be passed to
the backend implementation. Ignored otherwise.
config : ArrayConfig or ArrayConfigParams, optional
Runtime configuration of the array. If provided, will override the
default values from `zarr.config.array`.
Returns
-------
Expand Down Expand Up @@ -951,26 +953,47 @@ async def create(
warnings.warn("object_codec is not yet implemented", RuntimeWarning, stacklevel=2)
if read_only is not None:
warnings.warn("read_only is not yet implemented", RuntimeWarning, stacklevel=2)
if dimension_separator is not None:
if zarr_format == 3:
raise ValueError(
"dimension_separator is not supported for zarr format 3, use chunk_key_encoding instead"
)
else:
warnings.warn(
"dimension_separator is not yet implemented",
RuntimeWarning,
stacklevel=2,
)
if write_empty_chunks:
warnings.warn("write_empty_chunks is not yet implemented", RuntimeWarning, stacklevel=2)
if dimension_separator is not None and zarr_format == 3:
raise ValueError(
"dimension_separator is not supported for zarr format 3, use chunk_key_encoding instead"
)

if order is not None:
_warn_order_kwarg()
if write_empty_chunks is not None:
_warn_write_empty_chunks_kwarg()

if meta_array is not None:
warnings.warn("meta_array is not yet implemented", RuntimeWarning, stacklevel=2)

mode = kwargs.pop("mode", None)
if mode is None:
mode = "a"
store_path = await make_store_path(store, path=path, mode=mode, storage_options=storage_options)

config_dict: ArrayConfigParams = {}

if write_empty_chunks is not None:
if config is not None:
msg = (
"Both write_empty_chunks and config keyword arguments are set. "
"This is redundant. When both are set, write_empty_chunks will be ignored and "
"config will be used."
)
warnings.warn(UserWarning(msg), stacklevel=1)
config_dict["write_empty_chunks"] = write_empty_chunks
if order is not None:
if config is not None:
msg = (
"Both order and config keyword arguments are set. "
"This is redundant. When both are set, order will be ignored and "
"config will be used."
)
warnings.warn(UserWarning(msg), stacklevel=1)
config_dict["order"] = order

config_parsed = ArrayConfig.from_dict(config_dict)

return await AsyncArray.create(
store_path,
shape=shape,
Expand All @@ -987,7 +1010,7 @@ async def create(
codecs=codecs,
dimension_names=dimension_names,
attributes=attributes,
order=order,
config=config_parsed,
**kwargs,
)

Expand Down Expand Up @@ -1163,6 +1186,11 @@ async def open_array(

zarr_format = _handle_zarr_version_or_format(zarr_version=zarr_version, zarr_format=zarr_format)

if "order" in kwargs:
_warn_order_kwarg()
if "write_empty_chunks" in kwargs:
_warn_write_empty_chunks_kwarg()

try:
return await AsyncArray.open(store_path, zarr_format=zarr_format)
except FileNotFoundError:
Expand Down
Loading

0 comments on commit dcb2e39

Please sign in to comment.