Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Zarr backend #1528

Merged
merged 85 commits into from
Dec 14, 2017
Merged

WIP: Zarr backend #1528

merged 85 commits into from
Dec 14, 2017

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Aug 27, 2017

I think that a zarr backend could be the ideal storage format for xarray datasets, overcoming many of the frustrations associated with netcdf and enabling optimal performance on cloud platforms.

This is a very basic start to implementing a zarr backend (as proposed in #1223); however, I am taking a somewhat different approach. I store the whole dataset in a single zarr group. I encode the extra metadata needed by xarray (so far just dimension information) as attributes within the zarr group and child arrays. I hide these special attributes from the user by wrapping the attribute dictionaries in a "HiddenKeyDict", so that they can't be viewed or modified.

I have no tests yet (:flushed:), but the following code works.

from xarray.backends.zarr import ZarrStore
import xarray as xr
import numpy as np

ds = xr.Dataset(
    {'foo': (('y', 'x'), np.ones((100, 200)), {'myattr1': 1, 'myattr2': 2}),
     'bar': (('x',), np.zeros(200))},
    {'y': (('y',), np.arange(100)),
     'x': (('x',), np.arange(200))},
    {'some_attr': 'copana'}
).chunk({'y': 50, 'x': 40})

zs = ZarrStore(store='zarr_test')
ds.dump_to_store(zs)
ds2 = xr.Dataset.load_store(zs)
assert ds2.equals(ds)

There is a very long way to go here, but I thought I would just get a PR started. Some questions that would help me move forward.

  1. What is "encoding" at the variable level? (I have never understood this part of xarray.) How should encoding be handled with zarr?
  2. Should we encode / decode CF for zarr stores?
  3. Do we want to always automatically align dask chunks with the underlying zarr chunks?
  4. What sort of public API should the zarr backend have? Should you be able to load zarr stores via open_dataset? Or do we need a new method? I think .to_zarr() would be quite useful.
  5. zarr arrays are extensible along all axes. What does this imply for unlimited dimensions?
  6. Is any autoclose logic needed? As far as I can tell, zarr objects don't need to be closed.

@rabernat
Copy link
Contributor Author

cc @martindurant, @mrocklin, @alimanfoo

@martindurant
Copy link
Contributor

Sorry that I let this slide - there was not a huge upswell of interest around what I had done, and I was not ready to dive into xarray internals.
Could you comment more on the difference between your approach and mine? Is the aim to reduce the number of metadata files hanging around? zarr has made an effort with the groups interface to parallel netCDF, which is, after all, what xarray essentially expects of all its data sources.

As in this comment I have come to the realisation that although nice to/from zarr methods can be made relatively easily, they will not get traction unless they can be put within a class that mimics the existing xarray infrastructure, i.e., the user would never know, except that magically they have extra encoding/compression options, the file-path can be an S3 URL (say), and dask parallel computation suddenly works on a cluster and/or with out-of-core processing.
That would raise some eyebrows!

@rabernat
Copy link
Contributor Author

Could you comment more on the difference between your approach and mine?

Your functions are a great proof of concept for the relative ease of interoperability between xarray and zarr. What I have done here is to implement an xarray "backend" (i.e. DataStore) that uses zarr as its storage medium. This puts zarr on the same level as netCDF and HDF5 as a "first class" storage format for xarray data, as suggested by @shoyer in the comment on that thread. My hope is that this will enable the magical performance benefits that you have anticipated.

Digging deeper into that thread, I see @shoyer makes the following proposition:

So we could either directly write a DataStore or write a separate "znetcdf" or "netzdf" module that implements an interface similar to h5netcdf (which itself is a thin wrapper on top of h5py).

With this PR, I have started to do the former (write a DataStore). However, I can already see the wisdom of what he says next:

All things being equal, I would prefer the later approach, because people seem to find these intermediate interfaces useful, and it would help clarify the specification of the file format vs. details of how xarray uses it.

I have already implemented my own custom DataStore for a different project, so I felt comfortable diving into this. But I might end up reinventing the wheel several times over if I continue down this road. In particular, I can see that my HiddenKeyDict is very similar to h5netcdf's treatment of attributes. (I had never looked at the h5netcdf code until just now!)

On the other hand, zarr is so simple to use that a separate wrapper package might be overkill.

So I am still not sure whether the approach I am taking here is worth pursuing further. I consider this a highly experimental PR, and I'm really looking for feedback.

@rabernat
Copy link
Contributor Author

Is the aim to reduce the number of metadata files hanging around?

This is also part of my goal. I think all the metadata can be stored internally to zarr via attributes. There just have to be some "special" attributes that xarray hides from the user. This is the same as h5netcdf.

@alimanfoo suggested this should be possible in that earlier thread:

Specifically I'm wondering if this could all be stored as attributes on the
Zarr array, with some conventions for special xarray attribute names?

@martindurant
Copy link
Contributor

@rabernat : on actually looking through your code :) Happy to see you doing exactly as I felt I was not knowledgeable to do and poke xarray's guts. If I can help in any way, please let me know, although I don't have a lot of spare hours right now.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly easier than than I thought, to be honest :). Given the simplicity of the spec on top of zarr (just adding dimensions), we probably don't need the separate wrapper -- we should just describe it well in the docs.

# the first question is whether it should be based on BaseNetCDF4Array or
# NdimSizeLenMixing?

# or maybe we don't need wrappers at all? probably not true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we probably don't need a wrapper at all -- zarr already defines all these attributes!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time around I did add the wrapper.

first_chunk = all_chunks.next()
for this_chunk in all_chunks:
if not (this_chunk == first_chunk):
raise ValueError("zarr requires uniform chunk sizes, found %s" %
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling rechunk() to make chunks uniform might be more user friendly here.

Note that zarr does allow chunks that overlap the edge of the array (i.e., the last chunk of a dask array). This use case might be important when storing arrays with unusual dimension sizes (e.g., prime numbers).

yield k

def __len__(self):
return len(list(self.__iter__()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would certainly try to use len(self._data) here rather than iteration so this is still constant time (in practice it probably doesn't matter, though).

Acts like a normal dictionary, but hides certain keys.
'''
# ``__init__`` method required to create instance from class.
def __init__(self, data, *hidden_keys):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer avoiding *args -- it gives more freedom to adjust APIs later (e.g., by adding keyword arguments)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suggest the best way to tell whether an argument is a string or list of strings? This is something I always need to do but don't know the "correct" pythonic way to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you know it is an iterable, isinstance(var, basestring) should do.

"""

# need some special secret attributes to tell us the dimensions
_dimension_key = '_XARRAY_DIMENSIONS'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be _DIMENSION_KEY since it's a constant.

Also: maybe better to pick something more generic for the constant value, perhaps '_ARRAY_DIMENSIONS'?


def __init__(self, store=None, overwrite=False, chunk_store=None,
synchronizer=None, path=None, writer=None, autoclose=False):
opener = functools.partial(_open_zarr_group, store, overwrite,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to follow something closer to the model for NetCDFDataStore that I suggest over in #1508:

  • open classmethod constructs the backend object for typical use cases (e.g., from a file)
  • __init__ just wraps an existing zarr group.

This preserves a little bit more flexibility for downstream users.

@shoyer
Copy link
Member

shoyer commented Aug 29, 2017

What is "encoding" at the variable level? (I have never understood this part of xarray.) How should encoding be handled with zarr?

encoding keeps track of how variables are represented in a file (e.g., chunking schemes, _FillValue/add_offset/scale_factor compression, time units), so we reconstruct a netCDF file that looks almost exactly like the file we've read from disk. In the case of zarr, I guess we might include chunking, fill values, compressor options....

Should we encode / decode CF for zarr stores?

Yes, probably, if we want to handle netcdf conventions for times, fill values and scaling.

Do we want to always automatically align dask chunks with the underlying zarr chunks?

This would be nice! But it's also a bigger issue (will look for the number, I think it's already been opened).

What sort of public API should the zarr backend have? Should you be able to load zarr stores via open_dataset? Or do we need a new method? I think .to_zarr() would be quite useful.

Still need to think about this one.

zarr arrays are extensible along all axes. What does this imply for unlimited dimensions?

I guess we can ignore them (maybe add a warning?) -- they're not part of the zarr data model.

Is any autoclose logic needed? As far as I can tell, zarr objects don't need to be closed.

I don't think we need any autoclose logic at all -- zarr doesn't leave open files hanging around already.

@rabernat
Copy link
Contributor Author

encoding keeps track of how variables are represented in a file (e.g., chunking schemes, _FillValue/add_offset/scale_factor compression, time units), so we reconstruct a netCDF file that looks almost exactly like the file we've read from disk.

Is the goal here to be able to round-trip the file, such that calling .to_netcdf() produces an identical file to the original source file? For zarr, I think this would mean having the ability to read from one zarr store into xarray, and then write back to a different store, and have these two stores be identical. That makes sense to me.

I don't understand how encoding interacts with attributes? When is something an attribute vs. an encoding (add_offset for example)? How does xarray know whether the store automatically encodes / decodes the encodings vs. when it has to be done by xarray, e.g. by calling mask_and_scale

Should we encode / decode CF for zarr stores?

Yes, probably, if we want to handle netcdf conventions for times, fill values and scaling.

Does this mean that my ZarrStore should inherit from WritableCFDataStore instead of AbstractWritableDataStore?

Regarding encoding, zarr has its own internal mechanism for encoding, which it calls "filters", that closely resemble some of the CF encoding options. For example the FixedScaleOffset filter does something similar to as xarray's mask_and_scale function.

I don't yet understand how to make these elements work together properly, for example, do avoid applying the scale / offset function twice, as I mentioned above.

@rabernat
Copy link
Contributor Author

I am now trying to understand the backend test suite structure.

Can someone explain to me why so many tests are skipped? For example, if I run

 py.test -v xarray/tests/test_backends.py  -rsx -k GenericNetCDFDataTest

I get

================================================== test session starts ==================================================
platform darwin -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /Users/rpa/anaconda/bin/python
cachedir: .cache
rootdir: /Users/rpa/RND/Public/xarray, inifile: setup.cfg
plugins: cov-2.5.1
collected 683 items 

xarray/tests/test_backends.py::GenericNetCDFDataTest::test_coordinates_encoding SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_cross_engine_read_write_netcdf3 PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_dataset_caching SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_dataset_compute SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_default_fill_value SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_encoding_kwarg SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_encoding_same_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_encoding_unlimited_dims PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_engine PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_invalid_dataarray_names_raise SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_load SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_orthogonal_indexing PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_pickle SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_pickle_dataarray SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_None_variable SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_boolean_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_coordinates SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_datetime_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_endian SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_example_1_netcdf SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_float64_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_mask_and_scale SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_object_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_string_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_strings_with_fill_value SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_test_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_timedelta_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_write_store PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_zero_dimensional_variable SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_coordinates_encoding SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_cross_engine_read_write_netcdf3 PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_dataset_caching SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_dataset_compute SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_default_fill_value SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_encoding_kwarg SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_encoding_same_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_encoding_unlimited_dims PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_engine PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_invalid_dataarray_names_raise SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_load SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_orthogonal_indexing PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_pickle SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_pickle_dataarray SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_None_variable SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_boolean_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_coordinates SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_datetime_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_endian SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_example_1_netcdf SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_float64_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_mask_and_scale SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_object_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_string_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_strings_with_fill_value SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_test_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_timedelta_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_write_store PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_zero_dimensional_variable SKIPPED
================================================ short test summary info ================================================
SKIP [2] xarray/tests/test_backends.py:382: requires pynio
SKIP [2] xarray/tests/test_backends.py:214: requires pynio
SKIP [2] xarray/tests/test_backends.py:178: requires pynio
SKIP [2] xarray/tests/test_backends.py:468: requires pynio
SKIP [2] xarray/tests/test_backends.py:439: requires pynio
SKIP [2] xarray/tests/test_backends.py:490: requires pynio
SKIP [2] xarray/tests/test_backends.py:428: requires pynio
SKIP [2] xarray/tests/test_backends.py:145: requires pynio
SKIP [2] xarray/tests/test_backends.py:197: requires pynio
SKIP [2] xarray/tests/test_backends.py:207: requires pynio
SKIP [2] xarray/tests/test_backends.py:230: requires pynio
SKIP [2] xarray/tests/test_backends.py:311: requires pynio
SKIP [2] xarray/tests/test_backends.py:300: requires pynio
SKIP [2] xarray/tests/test_backends.py:271: requires pynio
SKIP [2] xarray/tests/test_backends.py:409: requires pynio
SKIP [2] xarray/tests/test_backends.py:291: requires pynio
SKIP [2] xarray/tests/test_backends.py:286: requires pynio
SKIP [2] xarray/tests/test_backends.py:362: requires pynio
SKIP [2] xarray/tests/test_backends.py:235: requires pynio
SKIP [2] xarray/tests/test_backends.py:264: requires pynio
SKIP [2] xarray/tests/test_backends.py:334: requires pynio
SKIP [2] xarray/tests/test_backends.py:139: requires pynio
SKIP [2] xarray/tests/test_backends.py:280: requires pynio
SKIP [2] xarray/tests/test_backends.py:109: requires pynio

Those line numbers refer to all of the skipped methods. Why should I need pynio to run those tests?

It looks like the same thing is happening on travis: https://travis-ci.org/pydata/xarray/jobs/268805771#L1527

Maybe @pwolfram understands this stuff?

@shoyer
Copy link
Member

shoyer commented Aug 29, 2017

@rabernat I think this is #1531 -- require_pynio seems to have infected all our other requirements!

@shoyer
Copy link
Member

shoyer commented Aug 29, 2017

Is the goal here to be able to round-trip the file, such that calling .to_netcdf() produces an identical file to the original source file?

Yes, exactly.

I don't understand how encoding interacts with attributes? When is something an attribute vs. an encoding (add_offset for example)?

Typically, we store things in encoding that are attributes on the underlying NetCDF file, but no longer make sense to describe the decoded data. For example:

  • On the file, add_offset is an attribute.
  • If loaded with open_dataset(..., mask_and_scale=True), add_offset can be found in encoding, not attrs, because the data has already been offset.
  • If loaded with open_dataset(..., mask_and_scale=False), add_offset will still be on attrs (the data has not been offset).

How does xarray know whether the store automatically encodes / decodes the encodings vs. when it has to be done by xarray, e.g. by calling mask_and_scale

Currently, we assume that stores never do this, and always handle it ourselves. We might need a special exception for zarr and scale/offset encoding.

Does this mean that my ZarrStore should inherit from WritableCFDataStore instead of AbstractWritableDataStore?

Maybe, though again it will probably need slightly customized conventions for writing data (if we let zarr handling scale/offset encoding).

I don't yet understand how to make these elements work together properly, for example, do avoid applying the scale / offset function twice, as I mentioned above.

We have two options:

  1. Handle it all in xarray via the machinery in conventions.py. Never pass the arguments to do scale/offset encoding to zarr (just save them as attributes).
  2. Handle it all in zarr. We'll need special case logic to skip this part of encoding.

I think (2) would be the preferred way to do this.

@alimanfoo
Copy link
Contributor

Following this with interest.

Regarding autoclose, just to confirm that zarr doesn't really have any notion of whether something is open or closed. When using the DirectoryStore storage class (most common use case I imagine), all files are automatically closed, nothing is kept open. There are some storage classes (e.g., ZipStore) that do require an explicit close call to finalise the file on disk if you have been writing data, but I think you can ignore this in xarray and leave it up to the user to manage this themselves.

Out of interest, @shoyer do you still think there would be value in writing a wrapper for zarr analogous to h5netcdf? Or does this PR provide all the necessary functionality?

@martindurant
Copy link
Contributor

Worth pointing out here, that the zarr filter-set is extensible (I suppose hdf5 is too, but I don't think this is ever done in practice), but I don't think it makes any particular claims to performance.

I think both of the options above are reasonable, and there is no particular reason to exclude either: a zarr variable could look to xarray like floats but actually be stored as ints (i.e., arguments are passed to zarr), or it could look like ints which xarray expects to inflate to floats (i.e., stored as an attribute). I mean, if a user stores a float variable, but includes kwargs to zarr for scale/filter (or any other filter arguments), we should make no attempt to interrupt that.

The only question is, if the user wishes to apply scale/offset in xarray, which is their most likely intention? I would guess the latter, compute in xarray and use attributes, since xarray users probably don't know about zarr and its filters.

@martindurant
Copy link
Contributor

A further rather big advantage in zarr that I'm not aware of in cdf/hdf (I may be wrong) is not just null values, but not having a given block be written to disc at all if it only contains null data. This probably meshes perfectly well with most user's understanding of missing data/fill value.

@alimanfoo
Copy link
Contributor

FWIW all filter (codec) classes have been migrated from zarr to a separate packaged called numcodecs and will be imported from there in the next (2.2) zarr release. Here is FixedScaleOffset. Implementation is basic numpy, probably some room for optimization.

@rabernat
Copy link
Contributor Author

One path forward for now would be to ignore the filters like FixedScaleOffset that are not present in netCDF, let xarray handle the CF encoding / decoding, and just put the compressors (e.g. Blosc, Zlib) and their parameters in the xarray variable encoding.

If we think there is an advantage to using the zarr native filters, that could be added via a future PR once we have the basic backend working.

@alimanfoo: when do you anticipate the 2.2 zarr release to happen? Will the API change significantly? If so, I will wait for that to move forward here.

@shoyer
Copy link
Member

shoyer commented Aug 29, 2017

If we think there is an advantage to using the zarr native filters, that could be added via a future PR once we have the basic backend working.

The only advantage here would be for non-xarray users, who could use zarr to do this decoding/encoding automatically.

For what it's worth, the implementation of scale offsets in xarray looks basically equivalent to what's done in zarr. I don't think there's a performance difference either way.

A further rather big advantage in zarr that I'm not aware of in cdf/hdf (I may be wrong) is not just null values, but not having a given block be written to disc at all if it only contains null data.

If you use chunks, I believe HDF5/NetCDF4 do the same thing, e.g.,

In [10]: with h5py.File('one-chunk.h5') as f: f.create_dataset('foo', (100, 100), chunks=(100, 100))

In [11]: with h5py.File('many-chunk.h5') as f: f.create_dataset('foo', (100000, 100000), chunks=(100, 100))

In [12]: ls -l | grep chunk.h5
-rw-r--r--   1 shoyer  eng   1400 Aug 29 10:48 many-chunk.h5
-rw-r--r--   1 shoyer  eng   1400 Aug 29 10:48 one-chunk.h5

(Note the same file-size)

@alimanfoo
Copy link
Contributor

alimanfoo commented Aug 29, 2017 via email

@martindurant
Copy link
Contributor

@rabernat , is there anything I can do to help push this along?

@rabernat
Copy link
Contributor Author

rabernat commented Sep 7, 2017

I am stuck on figuring out how to develop a new test case for this. (It doesn't help that #1531 is messing up the backend tests.)

If @shoyer can give us a few hints about how to best implement a test class (i.e. what to subclass, etc.), I think that could jumpstart testing and move the PR forward.

I welcome contributions from others such as @martindurant on this. I won't have much time in the near future, since a new semester just dropped on me like a load of bricks.

@shoyer
Copy link
Member

shoyer commented Sep 7, 2017

@rabernat indeed, the backend tests are not terribly well organized right now. Probably the place to start is to inherit from DatasetIOTestCases and TestCase and then implement create_store and roundtrip. DaskTest abuses the "backend" notation a little bit, but these lines cover the essentials:

class DaskTest(TestCase, DatasetIOTestCases):
@contextlib.contextmanager
def create_store(self):
yield Dataset()
@contextlib.contextmanager
def roundtrip(self, data, save_kwargs={}, open_kwargs={},
allow_cleanup_failure=False):
yield data.chunk()

@martindurant
Copy link
Contributor

@shoyer , is martindurant@6c1fb6b a reasonable start ?

@rabernat
Copy link
Contributor Author

@martindurant: I may have some time to get back to working on this next week. (Especially if @jhamman can help me sort out the backend testing.) What is the status of your branch?

@@ -184,7 +185,7 @@ def sync(self):
import dask.array as da
import dask
if LooseVersion(dask.__version__) > LooseVersion('0.8.1'):
da.store(self.sources, self.targets, lock=GLOBAL_LOCK)
da.store(self.sources, self.targets, lock=self.lock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks good.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just gave this another spin. I think its more than ready to be merged as an experimental feature. We should be able to quickly iterate on some of the smaller issues/features via follow-on PRs.

@jakirkham
Copy link

Just to confirm, if writes are aligned with chunk boundaries in the destination array then no locking is required.

As a minor point to complement what Matthew and Alistair have already said, one can pretty easily rechunk beforehand so that the chunks will have a nice 1-to-1 non-overlapping mapping on disk. Not sure whether this strategy is good enough to make default. However have had no issues doing this myself. Also would expect it is better than holding one lock over the whole Zarr Array. Though there may be some strange edge cases that I have not encountered.

doc/io.rst Outdated
~~~~~~~~~~~~~~~~~~~~~

It is possible to read and write xarray datasets directly from / to cloud
storage buckets using zarr. This example uses the `gcsfs`_ pacakge to provide
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pacakge -> package

@fmaussion
Copy link
Member

Thanks for the tremendous work @rabernat , looking forward to testing this!

In the future it would be nice to shortly describe the advantages of zarr over netcdf for new users. A speed benchmark could help, too! This can be done once the backend has more maturity, and when we will refactor the I/O docs

@rabernat
Copy link
Contributor Author

Will merge later today if no further comments.

@rabernat rabernat merged commit 8fe7eb0 into pydata:master Dec 14, 2017
@shoyer
Copy link
Member

shoyer commented Dec 14, 2017

woohoo, thank you Ryan!

@martindurant
Copy link
Contributor

Question: how would one build a zarr-xarray dataset?

With zarr you can open an array that contains no data, and use set-slice notation to fill in the values (which is what dask's store essentially does).

If I have some pre-known coordinates and bigger-than-memory data arrays, how would I go about getting the values into the zarr structure? If this can't be done directly with the xarray interface, is there a way to call zarr's open/create/zeros such that the corresponding array will appear as a variable when the same dataset is opened with xarray?

@mrocklin
Copy link
Contributor

mrocklin commented Feb 11, 2018 via email

@jhamman
Copy link
Member

jhamman commented Feb 11, 2018

@martindurant - If I understand your question correctly, I think you should be able to follow a pretty standard xarray workflow:

ds = xr.Dataset()
ds['your_varname'] = xr.DataArray(some_dask_array,
                                  dims=['dimname0', 'dimname1', ...],
                                  coords=dict_of_preknown_coords)
# repeat for each variable you want in your dataset

ds.to_zarr(some_zarr_store)


# then to open
ds2 = xr.open_zarr(some_zarr_store)

Two things to note:

  1. if you are looking for decent performance when writing to a remote store, make sure you're working off xarray@master as WIP: Performance improvements for zarr backend #1800 fixed a number of choke points in the to_zarr implementation
  2. if you are pushing to GCS, some_zarr_store can be a GCSMap.

@martindurant
Copy link
Contributor

@jhamman , that partially solves what I mean, I can probably turn my data into dask arrays with some difficulty; but really I was hoping for something like the following:

    ds = xr.Dataset(coords={'b': np.arange(-4, 6, 0.005),
                        'l': np.arange(150, 72, -0.005),
                        'v': np.arange(58722.24288, -164706.4225401, -8.2446e2)},
    arr = ds.create_new_zero_array(dims=['l', 'b', 'v'])
    arr[0:10, :, :] = 1

and expect to be able to set the values of the new variable in the same way that you can with the equivalent zarr array. I can probably get around this by setting the values with da.zeros, finding the zarr array in the dataset, and then setting its values.

@shoyer
Copy link
Member

shoyer commented Feb 12, 2018

@martindurant that could probably be addressed most cleanly by improving __setitem__ support for dask.array.

@shoyer
Copy link
Member

shoyer commented Feb 12, 2018

See dask/dask#2000 for the dask issue. Once this works in dask it should be quite easy to implement in xarray, too.

@martindurant
Copy link
Contributor

It might be enough, in this case, to provide some helper function in zarr to create and fetch arrays that will show up as variables in xarray - this need not be specific to being used via dask. I am assuming with the work done in this PR, that there is an unambiguous way to determine if a zarr group can be interpreted as an xarray dataset, and that zarr then knows how to add things that look like variables (which generally in the zarr case don't involve writing any actual data until the parts of the array are filled in).

@jakirkham
Copy link

So Zarr supports storing structured arrays. Maybe that’s what you are looking for, @martindurant? Would suggest using the latest 2.2.0 RC though as it fixed a few issues in this regard (particularly with NumPy 1.14).

@martindurant
Copy link
Contributor

martindurant commented Feb 12, 2018

OK, so the way to do this in pure-zarr appears to be to simply create the appropriate zarr array and set it's dimensions attribute:

ds = xr.Dataset(coords={'b': np.arange(-4, 6, 0.005),
                        'l': np.arange(150, 72, -0.005),
                        'v': np.arange(58722.24288, -164706.4225401, -8.2446e2)},
ds.to_zarr(mapping)
g = zarr.open_group(mapping)
arr = g.zeros(..., shape like l, b, v)
arr.attrs['_ARRAY_DIMENSIONS'] = ['l', 'b', 'v']

xr..open_zarr(mapping) now shows the new array, without having to materialize any data into it, and arr can be written to piecemeal - without the convenience of the coordinate mapping, of course.

@rabernat
Copy link
Contributor Author

I'm enjoying this discussion. Zarr offers lots of new possibilities for appending / updating datasets that we should try to support. I personally would really like to be able to append / extend existing arrays from within xarray.

@martindurant
Copy link
Contributor

Yeah, ideally when adding a variable like

ds['myvar'] = xr.DataArray(data=da.zeros(..., chunks=(..)), dims=['l', 'b', 'v'])
ds.to_zarr(mapping)

we should be able to apply an optimization strategy in which the zarr array is created without filling in all those unnecessary zeros. This seems doable.

On the other hand, implementing

ds.myvar[slice, slice, slice] = some data
ds.to_zarr(mapping)

(which cannot be done currently with dask-arrays at all), in such a way that only partitions with data get updated - this seems really hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zarr as persistent store for xarray
8 participants