WIP: Zarr backend #1528

rabernat · 2017-08-27T02:38:01Z

Closes zarr as persistent store for xarray #1223
Tests added / passed
Passes git diff upstream/master | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

I think that a zarr backend could be the ideal storage format for xarray datasets, overcoming many of the frustrations associated with netcdf and enabling optimal performance on cloud platforms.

This is a very basic start to implementing a zarr backend (as proposed in #1223); however, I am taking a somewhat different approach. I store the whole dataset in a single zarr group. I encode the extra metadata needed by xarray (so far just dimension information) as attributes within the zarr group and child arrays. I hide these special attributes from the user by wrapping the attribute dictionaries in a "HiddenKeyDict", so that they can't be viewed or modified.

I have no tests yet (:flushed:), but the following code works.

from xarray.backends.zarr import ZarrStore
import xarray as xr
import numpy as np

ds = xr.Dataset(
    {'foo': (('y', 'x'), np.ones((100, 200)), {'myattr1': 1, 'myattr2': 2}),
     'bar': (('x',), np.zeros(200))},
    {'y': (('y',), np.arange(100)),
     'x': (('x',), np.arange(200))},
    {'some_attr': 'copana'}
).chunk({'y': 50, 'x': 40})

zs = ZarrStore(store='zarr_test')
ds.dump_to_store(zs)
ds2 = xr.Dataset.load_store(zs)
assert ds2.equals(ds)

There is a very long way to go here, but I thought I would just get a PR started. Some questions that would help me move forward.

What is "encoding" at the variable level? (I have never understood this part of xarray.) How should encoding be handled with zarr?
Should we encode / decode CF for zarr stores?
Do we want to always automatically align dask chunks with the underlying zarr chunks?
What sort of public API should the zarr backend have? Should you be able to load zarr stores via open_dataset? Or do we need a new method? I think .to_zarr() would be quite useful.
zarr arrays are extensible along all axes. What does this imply for unlimited dimensions?
Is any autoclose logic needed? As far as I can tell, zarr objects don't need to be closed.

rabernat · 2017-08-27T02:40:22Z

cc @martindurant, @mrocklin, @alimanfoo

martindurant · 2017-08-27T19:46:31Z

Sorry that I let this slide - there was not a huge upswell of interest around what I had done, and I was not ready to dive into xarray internals.
Could you comment more on the difference between your approach and mine? Is the aim to reduce the number of metadata files hanging around? zarr has made an effort with the groups interface to parallel netCDF, which is, after all, what xarray essentially expects of all its data sources.

As in this comment I have come to the realisation that although nice to/from zarr methods can be made relatively easily, they will not get traction unless they can be put within a class that mimics the existing xarray infrastructure, i.e., the user would never know, except that magically they have extra encoding/compression options, the file-path can be an S3 URL (say), and dask parallel computation suddenly works on a cluster and/or with out-of-core processing.
That would raise some eyebrows!

rabernat · 2017-08-27T21:38:35Z

Could you comment more on the difference between your approach and mine?

Your functions are a great proof of concept for the relative ease of interoperability between xarray and zarr. What I have done here is to implement an xarray "backend" (i.e. DataStore) that uses zarr as its storage medium. This puts zarr on the same level as netCDF and HDF5 as a "first class" storage format for xarray data, as suggested by @shoyer in the comment on that thread. My hope is that this will enable the magical performance benefits that you have anticipated.

Digging deeper into that thread, I see @shoyer makes the following proposition:

So we could either directly write a DataStore or write a separate "znetcdf" or "netzdf" module that implements an interface similar to h5netcdf (which itself is a thin wrapper on top of h5py).

With this PR, I have started to do the former (write a DataStore). However, I can already see the wisdom of what he says next:

All things being equal, I would prefer the later approach, because people seem to find these intermediate interfaces useful, and it would help clarify the specification of the file format vs. details of how xarray uses it.

I have already implemented my own custom DataStore for a different project, so I felt comfortable diving into this. But I might end up reinventing the wheel several times over if I continue down this road. In particular, I can see that my HiddenKeyDict is very similar to h5netcdf's treatment of attributes. (I had never looked at the h5netcdf code until just now!)

On the other hand, zarr is so simple to use that a separate wrapper package might be overkill.

So I am still not sure whether the approach I am taking here is worth pursuing further. I consider this a highly experimental PR, and I'm really looking for feedback.

rabernat · 2017-08-27T21:42:23Z

Is the aim to reduce the number of metadata files hanging around?

This is also part of my goal. I think all the metadata can be stored internally to zarr via attributes. There just have to be some "special" attributes that xarray hides from the user. This is the same as h5netcdf.

@alimanfoo suggested this should be possible in that earlier thread:

Specifically I'm wondering if this could all be stored as attributes on the
Zarr array, with some conventions for special xarray attribute names?

martindurant · 2017-08-28T15:41:08Z

@rabernat : on actually looking through your code :) Happy to see you doing exactly as I felt I was not knowledgeable to do and poke xarray's guts. If I can help in any way, please let me know, although I don't have a lot of spare hours right now.

shoyer

This is slightly easier than than I thought, to be honest :). Given the simplicity of the spec on top of zarr (just adding dimensions), we probably don't need the separate wrapper -- we should just describe it well in the docs.

shoyer · 2017-08-29T00:49:09Z

xarray/backends/zarr.py

+# the first question is whether it should be based on BaseNetCDF4Array or
+# NdimSizeLenMixing?
+
+# or maybe we don't need wrappers at all? probably not true


I actually think we probably don't need a wrapper at all -- zarr already defines all these attributes!

This time around I did add the wrapper.

shoyer · 2017-08-29T00:52:36Z

xarray/backends/zarr.py

+    first_chunk = all_chunks.next()
+    for this_chunk in all_chunks:
+        if not (this_chunk == first_chunk):
+            raise ValueError("zarr requires uniform chunk sizes, found %s" %


Calling rechunk() to make chunks uniform might be more user friendly here.

Note that zarr does allow chunks that overlap the edge of the array (i.e., the last chunk of a dask array). This use case might be important when storing arrays with unusual dimension sizes (e.g., prime numbers).

shoyer · 2017-08-29T00:56:32Z

xarray/core/utils.py

+                yield k
+
+    def __len__(self):
+        return len(list(self.__iter__()))


I would certainly try to use len(self._data) here rather than iteration so this is still constant time (in practice it probably doesn't matter, though).

shoyer · 2017-08-29T00:59:00Z

xarray/core/utils.py

+    Acts like a normal dictionary, but hides certain keys.
+    '''
+    # ``__init__`` method required to create instance from class.
+    def __init__(self, data, *hidden_keys):


nit: I prefer avoiding *args -- it gives more freedom to adjust APIs later (e.g., by adding keyword arguments)

Can you suggest the best way to tell whether an argument is a string or list of strings? This is something I always need to do but don't know the "correct" pythonic way to do it.

if you know it is an iterable, isinstance(var, basestring) should do.

shoyer · 2017-08-29T01:01:23Z

xarray/backends/zarr.py

+    """
+
+    # need some special secret attributes to tell us the dimensions
+    _dimension_key = '_XARRAY_DIMENSIONS'


nit: should be _DIMENSION_KEY since it's a constant.

Also: maybe better to pick something more generic for the constant value, perhaps '_ARRAY_DIMENSIONS'?

shoyer · 2017-08-29T01:06:40Z

xarray/backends/zarr.py

+
+    def __init__(self, store=None, overwrite=False, chunk_store=None,
+                 synchronizer=None, path=None, writer=None, autoclose=False):
+        opener = functools.partial(_open_zarr_group, store, overwrite,


Let's try to follow something closer to the model for NetCDFDataStore that I suggest over in #1508:

open classmethod constructs the backend object for typical use cases (e.g., from a file)

__init__ just wraps an existing zarr group.

This preserves a little bit more flexibility for downstream users.

shoyer · 2017-08-29T01:14:05Z

What is "encoding" at the variable level? (I have never understood this part of xarray.) How should encoding be handled with zarr?

encoding keeps track of how variables are represented in a file (e.g., chunking schemes, _FillValue/add_offset/scale_factor compression, time units), so we reconstruct a netCDF file that looks almost exactly like the file we've read from disk. In the case of zarr, I guess we might include chunking, fill values, compressor options....

Should we encode / decode CF for zarr stores?

Yes, probably, if we want to handle netcdf conventions for times, fill values and scaling.

Do we want to always automatically align dask chunks with the underlying zarr chunks?

This would be nice! But it's also a bigger issue (will look for the number, I think it's already been opened).

What sort of public API should the zarr backend have? Should you be able to load zarr stores via open_dataset? Or do we need a new method? I think .to_zarr() would be quite useful.

Still need to think about this one.

zarr arrays are extensible along all axes. What does this imply for unlimited dimensions?

I guess we can ignore them (maybe add a warning?) -- they're not part of the zarr data model.

Is any autoclose logic needed? As far as I can tell, zarr objects don't need to be closed.

I don't think we need any autoclose logic at all -- zarr doesn't leave open files hanging around already.

rabernat · 2017-08-29T13:18:33Z

encoding keeps track of how variables are represented in a file (e.g., chunking schemes, _FillValue/add_offset/scale_factor compression, time units), so we reconstruct a netCDF file that looks almost exactly like the file we've read from disk.

Is the goal here to be able to round-trip the file, such that calling .to_netcdf() produces an identical file to the original source file? For zarr, I think this would mean having the ability to read from one zarr store into xarray, and then write back to a different store, and have these two stores be identical. That makes sense to me.

I don't understand how encoding interacts with attributes? When is something an attribute vs. an encoding (add_offset for example)? How does xarray know whether the store automatically encodes / decodes the encodings vs. when it has to be done by xarray, e.g. by calling mask_and_scale

Should we encode / decode CF for zarr stores?

Yes, probably, if we want to handle netcdf conventions for times, fill values and scaling.

Does this mean that my ZarrStore should inherit from WritableCFDataStore instead of AbstractWritableDataStore?

Regarding encoding, zarr has its own internal mechanism for encoding, which it calls "filters", that closely resemble some of the CF encoding options. For example the FixedScaleOffset filter does something similar to as xarray's mask_and_scale function.

I don't yet understand how to make these elements work together properly, for example, do avoid applying the scale / offset function twice, as I mentioned above.

rabernat · 2017-08-29T14:54:53Z

I am now trying to understand the backend test suite structure.

Can someone explain to me why so many tests are skipped? For example, if I run

 py.test -v xarray/tests/test_backends.py  -rsx -k GenericNetCDFDataTest

I get

================================================== test session starts ==================================================
platform darwin -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /Users/rpa/anaconda/bin/python
cachedir: .cache
rootdir: /Users/rpa/RND/Public/xarray, inifile: setup.cfg
plugins: cov-2.5.1
collected 683 items 

xarray/tests/test_backends.py::GenericNetCDFDataTest::test_coordinates_encoding SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_cross_engine_read_write_netcdf3 PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_dataset_caching SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_dataset_compute SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_default_fill_value SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_encoding_kwarg SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_encoding_same_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_encoding_unlimited_dims PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_engine PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_invalid_dataarray_names_raise SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_load SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_orthogonal_indexing PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_pickle SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_pickle_dataarray SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_None_variable SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_boolean_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_coordinates SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_datetime_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_endian SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_example_1_netcdf SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_float64_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_mask_and_scale SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_object_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_string_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_strings_with_fill_value SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_test_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_timedelta_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_write_store PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_zero_dimensional_variable SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_coordinates_encoding SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_cross_engine_read_write_netcdf3 PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_dataset_caching SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_dataset_compute SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_default_fill_value SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_encoding_kwarg SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_encoding_same_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_encoding_unlimited_dims PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_engine PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_invalid_dataarray_names_raise SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_load SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_orthogonal_indexing PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_pickle SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_pickle_dataarray SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_None_variable SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_boolean_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_coordinates SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_datetime_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_endian SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_example_1_netcdf SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_float64_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_mask_and_scale SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_object_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_string_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_strings_with_fill_value SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_test_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_timedelta_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_write_store PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_zero_dimensional_variable SKIPPED
================================================ short test summary info ================================================
SKIP [2] xarray/tests/test_backends.py:382: requires pynio
SKIP [2] xarray/tests/test_backends.py:214: requires pynio
SKIP [2] xarray/tests/test_backends.py:178: requires pynio
SKIP [2] xarray/tests/test_backends.py:468: requires pynio
SKIP [2] xarray/tests/test_backends.py:439: requires pynio
SKIP [2] xarray/tests/test_backends.py:490: requires pynio
SKIP [2] xarray/tests/test_backends.py:428: requires pynio
SKIP [2] xarray/tests/test_backends.py:145: requires pynio
SKIP [2] xarray/tests/test_backends.py:197: requires pynio
SKIP [2] xarray/tests/test_backends.py:207: requires pynio
SKIP [2] xarray/tests/test_backends.py:230: requires pynio
SKIP [2] xarray/tests/test_backends.py:311: requires pynio
SKIP [2] xarray/tests/test_backends.py:300: requires pynio
SKIP [2] xarray/tests/test_backends.py:271: requires pynio
SKIP [2] xarray/tests/test_backends.py:409: requires pynio
SKIP [2] xarray/tests/test_backends.py:291: requires pynio
SKIP [2] xarray/tests/test_backends.py:286: requires pynio
SKIP [2] xarray/tests/test_backends.py:362: requires pynio
SKIP [2] xarray/tests/test_backends.py:235: requires pynio
SKIP [2] xarray/tests/test_backends.py:264: requires pynio
SKIP [2] xarray/tests/test_backends.py:334: requires pynio
SKIP [2] xarray/tests/test_backends.py:139: requires pynio
SKIP [2] xarray/tests/test_backends.py:280: requires pynio
SKIP [2] xarray/tests/test_backends.py:109: requires pynio

Those line numbers refer to all of the skipped methods. Why should I need pynio to run those tests?

It looks like the same thing is happening on travis: https://travis-ci.org/pydata/xarray/jobs/268805771#L1527

Maybe @pwolfram understands this stuff?

shoyer · 2017-08-29T16:19:57Z

@rabernat I think this is #1531 -- require_pynio seems to have infected all our other requirements!

shoyer · 2017-08-29T16:43:58Z

Is the goal here to be able to round-trip the file, such that calling .to_netcdf() produces an identical file to the original source file?

Yes, exactly.

I don't understand how encoding interacts with attributes? When is something an attribute vs. an encoding (add_offset for example)?

Typically, we store things in encoding that are attributes on the underlying NetCDF file, but no longer make sense to describe the decoded data. For example:

On the file, add_offset is an attribute.
If loaded with open_dataset(..., mask_and_scale=True), add_offset can be found in encoding, not attrs, because the data has already been offset.
If loaded with open_dataset(..., mask_and_scale=False), add_offset will still be on attrs (the data has not been offset).

How does xarray know whether the store automatically encodes / decodes the encodings vs. when it has to be done by xarray, e.g. by calling mask_and_scale

Currently, we assume that stores never do this, and always handle it ourselves. We might need a special exception for zarr and scale/offset encoding.

Does this mean that my ZarrStore should inherit from WritableCFDataStore instead of AbstractWritableDataStore?

Maybe, though again it will probably need slightly customized conventions for writing data (if we let zarr handling scale/offset encoding).

I don't yet understand how to make these elements work together properly, for example, do avoid applying the scale / offset function twice, as I mentioned above.

We have two options:

Handle it all in xarray via the machinery in conventions.py. Never pass the arguments to do scale/offset encoding to zarr (just save them as attributes).
Handle it all in zarr. We'll need special case logic to skip this part of encoding.

I think (2) would be the preferred way to do this.

alimanfoo · 2017-08-29T16:56:55Z

Following this with interest.

Regarding autoclose, just to confirm that zarr doesn't really have any notion of whether something is open or closed. When using the DirectoryStore storage class (most common use case I imagine), all files are automatically closed, nothing is kept open. There are some storage classes (e.g., ZipStore) that do require an explicit close call to finalise the file on disk if you have been writing data, but I think you can ignore this in xarray and leave it up to the user to manage this themselves.

Out of interest, @shoyer do you still think there would be value in writing a wrapper for zarr analogous to h5netcdf? Or does this PR provide all the necessary functionality?

martindurant · 2017-08-29T16:57:10Z

Worth pointing out here, that the zarr filter-set is extensible (I suppose hdf5 is too, but I don't think this is ever done in practice), but I don't think it makes any particular claims to performance.

I think both of the options above are reasonable, and there is no particular reason to exclude either: a zarr variable could look to xarray like floats but actually be stored as ints (i.e., arguments are passed to zarr), or it could look like ints which xarray expects to inflate to floats (i.e., stored as an attribute). I mean, if a user stores a float variable, but includes kwargs to zarr for scale/filter (or any other filter arguments), we should make no attempt to interrupt that.

The only question is, if the user wishes to apply scale/offset in xarray, which is their most likely intention? I would guess the latter, compute in xarray and use attributes, since xarray users probably don't know about zarr and its filters.

martindurant · 2017-08-29T17:00:29Z

A further rather big advantage in zarr that I'm not aware of in cdf/hdf (I may be wrong) is not just null values, but not having a given block be written to disc at all if it only contains null data. This probably meshes perfectly well with most user's understanding of missing data/fill value.

alimanfoo · 2017-08-29T17:02:41Z

FWIW all filter (codec) classes have been migrated from zarr to a separate packaged called numcodecs and will be imported from there in the next (2.2) zarr release. Here is FixedScaleOffset. Implementation is basic numpy, probably some room for optimization.

rabernat · 2017-08-29T17:35:09Z

One path forward for now would be to ignore the filters like FixedScaleOffset that are not present in netCDF, let xarray handle the CF encoding / decoding, and just put the compressors (e.g. Blosc, Zlib) and their parameters in the xarray variable encoding.

If we think there is an advantage to using the zarr native filters, that could be added via a future PR once we have the basic backend working.

@alimanfoo: when do you anticipate the 2.2 zarr release to happen? Will the API change significantly? If so, I will wait for that to move forward here.

shoyer · 2017-08-29T17:50:04Z

If we think there is an advantage to using the zarr native filters, that could be added via a future PR once we have the basic backend working.

The only advantage here would be for non-xarray users, who could use zarr to do this decoding/encoding automatically.

For what it's worth, the implementation of scale offsets in xarray looks basically equivalent to what's done in zarr. I don't think there's a performance difference either way.

A further rather big advantage in zarr that I'm not aware of in cdf/hdf (I may be wrong) is not just null values, but not having a given block be written to disc at all if it only contains null data.

If you use chunks, I believe HDF5/NetCDF4 do the same thing, e.g.,

In [10]: with h5py.File('one-chunk.h5') as f: f.create_dataset('foo', (100, 100), chunks=(100, 100))

In [11]: with h5py.File('many-chunk.h5') as f: f.create_dataset('foo', (100000, 100000), chunks=(100, 100))

In [12]: ls -l | grep chunk.h5
-rw-r--r--   1 shoyer  eng   1400 Aug 29 10:48 many-chunk.h5
-rw-r--r--   1 shoyer  eng   1400 Aug 29 10:48 one-chunk.h5

(Note the same file-size)

alimanfoo · 2017-08-29T21:43:48Z

On Tuesday, August 29, 2017, Ryan Abernathey ***@***.***> wrote: @alimanfoo <https://github.com/alimanfoo>: when do you anticipate the 2.2 zarr release to happen? Will the API change significantly? If so, I will wait for that to move forward here.

Zarr 2.2 will hopefully happen some time in the next 2 months, but it will be fully backwards-compatible, no breaking API changes.

…

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

martindurant · 2017-09-07T15:23:31Z

@rabernat , is there anything I can do to help push this along?

rabernat · 2017-09-07T16:17:13Z

I am stuck on figuring out how to develop a new test case for this. (It doesn't help that #1531 is messing up the backend tests.)

If @shoyer can give us a few hints about how to best implement a test class (i.e. what to subclass, etc.), I think that could jumpstart testing and move the PR forward.

I welcome contributions from others such as @martindurant on this. I won't have much time in the near future, since a new semester just dropped on me like a load of bricks.

shoyer · 2017-09-07T19:32:41Z

@rabernat indeed, the backend tests are not terribly well organized right now. Probably the place to start is to inherit from DatasetIOTestCases and TestCase and then implement create_store and roundtrip. DaskTest abuses the "backend" notation a little bit, but these lines cover the essentials:

xarray/xarray/tests/test_backends.py

Lines 1271 to 1279 in 98a05f1

    
           class DaskTest(TestCase, DatasetIOTestCases): 
        
               @contextlib.contextmanager 
        
               def create_store(self): 
        
                   yield Dataset() 
        
               @contextlib.contextmanager 
        
               def roundtrip(self, data, save_kwargs={}, open_kwargs={}, 
        
                             allow_cleanup_failure=False): 
        
                   yield data.chunk()

martindurant · 2017-09-07T19:36:15Z

@shoyer , is martindurant@6c1fb6b a reasonable start ?

rabernat · 2017-09-30T21:13:48Z

@martindurant: I may have some time to get back to working on this next week. (Especially if @jhamman can help me sort out the backend testing.) What is the status of your branch?

shoyer · 2017-12-08T22:16:24Z

xarray/backends/common.py

@@ -184,7 +185,7 @@ def sync(self):
            import dask.array as da
            import dask
            if LooseVersion(dask.__version__) > LooseVersion('0.8.1'):
-                da.store(self.sources, self.targets, lock=GLOBAL_LOCK)
+                da.store(self.sources, self.targets, lock=self.lock)


Yes, this looks good.

jhamman

I just gave this another spin. I think its more than ready to be merged as an experimental feature. We should be able to quickly iterate on some of the smaller issues/features via follow-on PRs.

jakirkham · 2017-12-09T20:38:58Z

Just to confirm, if writes are aligned with chunk boundaries in the destination array then no locking is required.

As a minor point to complement what Matthew and Alistair have already said, one can pretty easily rechunk beforehand so that the chunks will have a nice 1-to-1 non-overlapping mapping on disk. Not sure whether this strategy is good enough to make default. However have had no issues doing this myself. Also would expect it is better than holding one lock over the whole Zarr Array. Though there may be some strange edge cases that I have not encountered.

fmaussion · 2017-12-10T15:38:18Z

doc/io.rst

+~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to read and write xarray datasets directly from / to cloud
+storage buckets using zarr. This example uses the `gcsfs`_ pacakge to provide


pacakge -> package

fmaussion · 2017-12-10T15:45:13Z

Thanks for the tremendous work @rabernat , looking forward to testing this!

In the future it would be nice to shortly describe the advantages of zarr over netcdf for new users. A speed benchmark could help, too! This can be done once the backend has more maturity, and when we will refactor the I/O docs

rabernat · 2017-12-13T14:09:12Z

Will merge later today if no further comments.

shoyer · 2017-12-14T02:23:03Z

woohoo, thank you Ryan!

martindurant · 2018-02-11T23:35:34Z

Question: how would one build a zarr-xarray dataset?

With zarr you can open an array that contains no data, and use set-slice notation to fill in the values (which is what dask's store essentially does).

If I have some pre-known coordinates and bigger-than-memory data arrays, how would I go about getting the values into the zarr structure? If this can't be done directly with the xarray interface, is there a way to call zarr's open/create/zeros such that the corresponding array will appear as a variable when the same dataset is opened with xarray?

mrocklin · 2018-02-11T23:40:18Z

Does the to_zarr method suffice: http://xarray.pydata.org/en/latest/generated/xarray.Dataset.to_zarr.html#xarray.Dataset.to_zarr ?

…

On Sun, Feb 11, 2018 at 6:35 PM, Martin Durant ***@***.***> wrote: Question: how would one *build* a zarr-xarray dataset? With zarr you can open an array that contains no data, and use set-slice notation to fill in the values (which is what dask's store essentially does). If I have some pre-known coordinates and bigger-than-memory data arrays, how would I go about getting the values into the zarr structure? If this can't be done directly with the xarray interface, is there a way to call zarr's open/create/zeros such that the corresponding array will appear as a variable when the same dataset is opened with xarray? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1528 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszIWtzhFRhlOoLnRJiQrTubrDuQ0Xks5tT3lIgaJpZM4PDrlp> .

jhamman · 2018-02-11T23:54:01Z

@martindurant - If I understand your question correctly, I think you should be able to follow a pretty standard xarray workflow:

ds = xr.Dataset()
ds['your_varname'] = xr.DataArray(some_dask_array,
                                  dims=['dimname0', 'dimname1', ...],
                                  coords=dict_of_preknown_coords)
# repeat for each variable you want in your dataset

ds.to_zarr(some_zarr_store)


# then to open
ds2 = xr.open_zarr(some_zarr_store)

Two things to note:

if you are looking for decent performance when writing to a remote store, make sure you're working off xarray@master as WIP: Performance improvements for zarr backend #1800 fixed a number of choke points in the to_zarr implementation
if you are pushing to GCS, some_zarr_store can be a GCSMap.

martindurant · 2018-02-12T00:12:36Z

@jhamman , that partially solves what I mean, I can probably turn my data into dask arrays with some difficulty; but really I was hoping for something like the following:

    ds = xr.Dataset(coords={'b': np.arange(-4, 6, 0.005),
                        'l': np.arange(150, 72, -0.005),
                        'v': np.arange(58722.24288, -164706.4225401, -8.2446e2)},
    arr = ds.create_new_zero_array(dims=['l', 'b', 'v'])
    arr[0:10, :, :] = 1

and expect to be able to set the values of the new variable in the same way that you can with the equivalent zarr array. I can probably get around this by setting the values with da.zeros, finding the zarr array in the dataset, and then setting its values.

shoyer · 2018-02-12T00:14:22Z

@martindurant that could probably be addressed most cleanly by improving __setitem__ support for dask.array.

shoyer · 2018-02-12T00:15:23Z

See dask/dask#2000 for the dask issue. Once this works in dask it should be quite easy to implement in xarray, too.

martindurant · 2018-02-12T00:19:55Z

It might be enough, in this case, to provide some helper function in zarr to create and fetch arrays that will show up as variables in xarray - this need not be specific to being used via dask. I am assuming with the work done in this PR, that there is an unambiguous way to determine if a zarr group can be interpreted as an xarray dataset, and that zarr then knows how to add things that look like variables (which generally in the zarr case don't involve writing any actual data until the parts of the array are filled in).

jakirkham · 2018-02-12T01:51:40Z

So Zarr supports storing structured arrays. Maybe that’s what you are looking for, @martindurant? Would suggest using the latest 2.2.0 RC though as it fixed a few issues in this regard (particularly with NumPy 1.14).

martindurant · 2018-02-12T02:43:43Z

OK, so the way to do this in pure-zarr appears to be to simply create the appropriate zarr array and set it's dimensions attribute:

ds = xr.Dataset(coords={'b': np.arange(-4, 6, 0.005),
                        'l': np.arange(150, 72, -0.005),
                        'v': np.arange(58722.24288, -164706.4225401, -8.2446e2)},
ds.to_zarr(mapping)
g = zarr.open_group(mapping)
arr = g.zeros(..., shape like l, b, v)
arr.attrs['_ARRAY_DIMENSIONS'] = ['l', 'b', 'v']

xr..open_zarr(mapping) now shows the new array, without having to materialize any data into it, and arr can be written to piecemeal - without the convenience of the coordinate mapping, of course.

rabernat · 2018-02-12T15:21:51Z

I'm enjoying this discussion. Zarr offers lots of new possibilities for appending / updating datasets that we should try to support. I personally would really like to be able to append / extend existing arrays from within xarray.

martindurant · 2018-02-13T21:35:03Z

Yeah, ideally when adding a variable like

ds['myvar'] = xr.DataArray(data=da.zeros(..., chunks=(..)), dims=['l', 'b', 'v'])
ds.to_zarr(mapping)

we should be able to apply an optimization strategy in which the zarr array is created without filling in all those unnecessary zeros. This seems doable.

On the other hand, implementing

ds.myvar[slice, slice, slice] = some data
ds.to_zarr(mapping)

(which cannot be done currently with dask-arrays at all), in such a way that only partitions with data get updated - this seems really hard.

rabernat added 2 commits August 26, 2017 22:18

added HiddenKeyDict class

5cdf6c8

new zarr backend

f305c25

jhamman added topic-backends topic-dask needs work labels Aug 28, 2017

shoyer mentioned this pull request Aug 28, 2017

Better compression algorithms for NetCDF #1536

Closed

shoyer reviewed Aug 29, 2017

View reviewed changes

shoyer mentioned this pull request Aug 29, 2017

@requires_pinio mass disables unrelated tests #1531

Closed

rabernat mentioned this pull request Dec 8, 2017

slow performance when storing datasets in gcsfs-backed zarr stores #1770

Closed

shoyer approved these changes Dec 8, 2017

View reviewed changes

jhamman approved these changes Dec 9, 2017

View reviewed changes

fmaussion reviewed Dec 10, 2017

View reviewed changes

rabernat and others added 4 commits December 11, 2017 12:00

change lock to False

c76a01b

fix doc typos

c011c2d

Merge branch 'master' into zarr_backend

054ffeb

Merge branch 'master' into zarr_backend

f5633ca

rabernat removed the needs work label Dec 13, 2017

rabernat merged commit 8fe7eb0 into pydata:master Dec 14, 2017

jklymak mentioned this pull request Mar 26, 2018

Next generation netCDF design and implementation discussion thread MITgcm/MITgcm#6

Open

jakirkham mentioned this pull request May 2, 2018

Adding from_zarr and to_zarr methods dask/dask#3457

Closed

jakirkham mentioned this pull request Jun 13, 2018

Filters for object dtype zarr-developers/zarr-python#208

Closed

rsignell-usgs mentioned this pull request Jul 27, 2018

znetcdf: h5netcdf analog for zarr? #2323

Closed

rabernat mentioned this pull request May 13, 2021

NCZarr - Netcdf Support for Zarr zarr-developers/zarr-specs#41

Open

WIP: Zarr backend #1528

WIP: Zarr backend #1528

Conversation

rabernat commented Aug 27, 2017 • edited Loading

rabernat commented Aug 27, 2017

martindurant commented Aug 27, 2017

rabernat commented Aug 27, 2017

rabernat commented Aug 27, 2017

martindurant commented Aug 28, 2017

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Aug 29, 2017

rabernat commented Aug 29, 2017

rabernat commented Aug 29, 2017

shoyer commented Aug 29, 2017

shoyer commented Aug 29, 2017 • edited Loading

alimanfoo commented Aug 29, 2017

martindurant commented Aug 29, 2017

martindurant commented Aug 29, 2017

alimanfoo commented Aug 29, 2017

rabernat commented Aug 29, 2017

shoyer commented Aug 29, 2017

alimanfoo commented Aug 29, 2017 via email

martindurant commented Sep 7, 2017

rabernat commented Sep 7, 2017

shoyer commented Sep 7, 2017

martindurant commented Sep 7, 2017

rabernat commented Sep 30, 2017

Choose a reason for hiding this comment

jhamman left a comment

Choose a reason for hiding this comment

jakirkham commented Dec 9, 2017

Choose a reason for hiding this comment

fmaussion commented Dec 10, 2017

rabernat commented Dec 13, 2017

shoyer commented Dec 14, 2017

martindurant commented Feb 11, 2018

mrocklin commented Feb 11, 2018 via email

jhamman commented Feb 11, 2018

martindurant commented Feb 12, 2018

shoyer commented Feb 12, 2018

shoyer commented Feb 12, 2018

martindurant commented Feb 12, 2018

jakirkham commented Feb 12, 2018

martindurant commented Feb 12, 2018 • edited Loading

rabernat commented Feb 12, 2018

martindurant commented Feb 13, 2018

rabernat commented Aug 27, 2017 •

edited

Loading

shoyer commented Aug 29, 2017 •

edited

Loading

martindurant commented Feb 12, 2018 •

edited

Loading