Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Zarr backend #1528

Merged
merged 85 commits into from
Dec 14, 2017
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
5cdf6c8
added HiddenKeyDict class
rabernat Aug 27, 2017
f305c25
new zarr backend
rabernat Aug 27, 2017
2ea21c5
added HiddenKeyDict class
rabernat Aug 27, 2017
d92bf2f
new zarr backend
rabernat Aug 27, 2017
79da971
add zarr to ci reqs
Oct 5, 2017
31e4409
add zarr api to docs
Oct 5, 2017
2ec5ee5
some zarr tests passing
rabernat Oct 6, 2017
bd21720
Merge pull request #1 from jhamman/zarr_backend
rabernat Oct 6, 2017
7e898fc
merged stuff from joe
rabernat Oct 6, 2017
af5ff6c
Merge branch 'master' of github.com:pydata/xarray into zarr_backend
Oct 6, 2017
9e7cc09
Merge branch 'zarr_backend' of github.com:rabernat/xray into zarr_bac…
Oct 6, 2017
3f01365
requires zarr decorator
Oct 6, 2017
41cf706
Merge pull request #2 from jhamman/zarr_backend
rabernat Oct 6, 2017
fd9fd0f
wip
rabernat Oct 7, 2017
9f16e8f
added chunking test
rabernat Oct 8, 2017
fe9ebe7
remove debuggin statements
rabernat Oct 8, 2017
c01cd09
fixed HiddenKeyDict
rabernat Oct 8, 2017
b3e5d76
added HiddenKeyDict class
rabernat Aug 27, 2017
45375b2
new zarr backend
rabernat Aug 27, 2017
0e79718
add zarr to ci reqs
Oct 5, 2017
3d39ade
add zarr api to docs
Oct 5, 2017
3d09c67
some zarr tests passing
rabernat Oct 6, 2017
0b4a27a
requires zarr decorator
Oct 6, 2017
f39035c
wip
rabernat Oct 7, 2017
6446ea2
added chunking test
rabernat Oct 8, 2017
9136064
remove debuggin statements
rabernat Oct 8, 2017
2966100
fixed HiddenKeyDict
rabernat Oct 8, 2017
6bedf22
wip
rabernat Oct 14, 2017
ced8267
finished merge
rabernat Oct 16, 2017
e461cdb
finished merge
rabernat Oct 16, 2017
049bf9e
create opener object
rabernat Oct 16, 2017
c169128
trying to get caching working
rabernat Oct 16, 2017
82ef456
caching still not working
rabernat Oct 16, 2017
3ee243e
merge conflicts
rabernat Nov 13, 2017
e20c29f
updating zarr backend with new indexing mixins
rabernat Nov 13, 2017
f82c8c1
added new zarr dev test env
rabernat Nov 13, 2017
43e539f
update travis
rabernat Nov 13, 2017
66299f0
move zarr-dev to travis allowed failures
rabernat Nov 13, 2017
2fce362
fix typo in env file
rabernat Nov 13, 2017
c19b81a
wip
rabernat Nov 17, 2017
68b8f07
fixed zarr auto_chunk
rabernat Nov 17, 2017
0ea0dad
refactored zarr tests
rabernat Nov 17, 2017
58b3bf0
new encoding test
rabernat Nov 17, 2017
9da22da
Merge branch 'master' of github.com:pydata/xarray into zarr_backend_jjh
Nov 17, 2017
a8b4785
cleanup and buildout ZarrArrayWrapper, vectorized indexing
Nov 17, 2017
2a6a776
Merge pull request #4 from jhamman/zarr_backend_jjh
rabernat Nov 17, 2017
021d3ba
more wip
rabernat Nov 27, 2017
5ef10d2
removed chaching test
rabernat Nov 17, 2017
e47d936
Merge remote-tracking branch 'origin/zarr_backend' into zarr_backend
rabernat Nov 27, 2017
a4b024e
very close to passing all tests
rabernat Nov 27, 2017
d8842a6
Merge remote-tracking branch 'upstream/master' into zarr_backend
rabernat Nov 28, 2017
54d116d
modified inheritance
rabernat Nov 29, 2017
94678f4
subclass AbstractWriteableDataStore
rabernat Nov 29, 2017
64942e5
Merge remote-tracking branch 'origin/zarr_backend' into zarr_backend
rabernat Dec 1, 2017
f584456
xfailed certain tests
rabernat Dec 1, 2017
c43284e
pr comments wip
rabernat Dec 4, 2017
9df6e50
removed autoclose
rabernat Dec 4, 2017
012e858
new test for chunk encoding
rabernat Dec 4, 2017
b1819f4
added another test
rabernat Dec 5, 2017
8eb98c9
tests for HiddenKeyDict
rabernat Dec 6, 2017
64bd76c
flake8
rabernat Dec 6, 2017
cffa158
Merge remote-tracking branch 'upstream/master' into zarr_backend
rabernat Dec 6, 2017
3b4a941
zarr version update
rabernat Dec 6, 2017
688f415
added more tests
rabernat Dec 6, 2017
c115a2b
added compressor test
rabernat Dec 6, 2017
4c92531
docs
rabernat Dec 6, 2017
61027eb
weird ascii character issue
rabernat Dec 6, 2017
bbaa776
doc fixes
rabernat Dec 6, 2017
c8f23a5
what's new
rabernat Dec 6, 2017
f0c76f7
more file encoding nightmares
rabernat Dec 6, 2017
a84e388
Tests for backends.zarr._replace_slices_with_arrays
shoyer Dec 6, 2017
37bc2f0
respond to @shoyer's review
rabernat Dec 6, 2017
8cd1707
final fixes
rabernat Dec 7, 2017
ac27411
put back @shoyer's original max function
rabernat Dec 7, 2017
618bf81
another try with 2.7-safe max function
rabernat Dec 7, 2017
e942130
put back @shoyer's original max function
rabernat Dec 7, 2017
b1fa690
bypass lock on ArrayWriter
rabernat Dec 8, 2017
4089d13
Merge branch 'zarr_backend' of github.com:rabernat/xarray into zarr_b…
rabernat Dec 8, 2017
ba200c1
eliminate read mode
rabernat Dec 8, 2017
8dafaf7
added zarr distributed integration test
rabernat Dec 8, 2017
85174cd
fixed max bug
rabernat Dec 8, 2017
c76a01b
change lock to False
rabernat Dec 11, 2017
c011c2d
fix doc typos
rabernat Dec 11, 2017
054ffeb
Merge branch 'master' into zarr_backend
rabernat Dec 12, 2017
f5633ca
Merge branch 'master' into zarr_backend
rabernat Dec 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import functools
import warnings
from itertools import product
from collections import MutableMapping

from .. import Variable
from ..core import indexing
from ..core.utils import FrozenOrderedDict, close_on_error, HiddenKeyDict
from ..core.pycompat import iteritems, bytes_type, unicode_type, OrderedDict

from .common import (WritableCFDataStore, AbstractWritableDataStore,
DataStorePickleMixin)




# most of the other stores have some kind of wrapper class like
# class BaseNetCDF4Array(NdimSizeLenMixin, DunderArrayMixin):
# class H5NetCDFArrayWrapper(BaseNetCDF4Array):
# class NioArrayWrapper(NdimSizeLenMixin, DunderArrayMixin):
# we problaby need something like this

# the first question is whether it should be based on BaseNetCDF4Array or
# NdimSizeLenMixing?

# or maybe we don't need wrappers at all? probably not true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we probably don't need a wrapper at all -- zarr already defines all these attributes!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time around I did add the wrapper.



# also most have a custom opener

# keyword args for zarr.group
# store=None, overwrite=False, chunk_store=None, synchronizer=None, path=None
# the group name is called "path" in the zarr lexicon

def _open_zarr_group(store, overwrite, chunk_store, synchronizer, path):
import zarr
zarr_group = zarr.group(store=store, overwrite=overwrite,
chunk_store=chunk_store, synchronizer=synchronizer, path=path)
return zarr_group


def _dask_chunks_to_zarr_chunks(chunks):
# zarr chunks needs to be uniform for each array
# http://zarr.readthedocs.io/en/latest/spec/v1.html#chunks
# dask chunks can be variable sized
# http://dask.pydata.org/en/latest/array-design.html#chunks
# this function dask chunks syntax to zarr chunks
if chunks is None:
return chunks

all_chunks = product(*chunks)
first_chunk = all_chunks.next()
for this_chunk in all_chunks:
if not (this_chunk == first_chunk):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use !=

raise ValueError("zarr requires uniform chunk sizes, found %s" %
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling rechunk() to make chunks uniform might be more user friendly here.

Note that zarr does allow chunks that overlap the edge of the array (i.e., the last chunk of a dask array). This use case might be important when storing arrays with unusual dimension sizes (e.g., prime numbers).

repr(chunks))
return first_chunk


def _get_zarr_dims_and_attrs(zarr_obj, dimension_key):
# Zarr arrays do not have dimenions. To get around this problem, we add
# an attribute that specifies the dimension. We have to hide this attribute
# when we send the attributes to the user.
# zarr_obj can be either a zarr group or zarr array
dimensions = zarr_obj.attrs.get(dimension_key)
attributes = HiddenKeyDict(zarr_obj.attrs, dimension_key)
return dimensions, attributes


class ZarrStore(AbstractWritableDataStore, DataStorePickleMixin):
"""Store for reading and writing data via zarr
"""

# need some special secret attributes to tell us the dimensions
_dimension_key = '_XARRAY_DIMENSIONS'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be _DIMENSION_KEY since it's a constant.

Also: maybe better to pick something more generic for the constant value, perhaps '_ARRAY_DIMENSIONS'?


def __init__(self, store=None, overwrite=False, chunk_store=None,
synchronizer=None, path=None, writer=None, autoclose=False):
opener = functools.partial(_open_zarr_group, store, overwrite,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to follow something closer to the model for NetCDFDataStore that I suggest over in #1508:

  • open classmethod constructs the backend object for typical use cases (e.g., from a file)
  • __init__ just wraps an existing zarr group.

This preserves a little bit more flexibility for downstream users.

chunk_store, synchronizer, path)
self.ds = opener()
if autoclose:
raise NotImplementedError('autoclose=True is not implemented '
'for the zarr backend')
self._autoclose = False
self._isopen = True
self._opener = opener

# initialize hidden dimension attribute
self.ds.attrs[self._dimension_key] = {}

# do we need to define attributes for all of the opener keyword args?
super(ZarrStore, self).__init__(writer)

def open_store_variable(self, name, zarr_array):
# I don't see why it is necessary to wrap self.ds[name]
# zarr seems to implement the required ndarray interface
# TODO: possibly wrap zarr array in dask with aligned chunks
data = indexing.LazilyIndexedArray(zarr_array)
dimensions, attributes = _get_zarr_dims_and_attrs(
zarr_array, self._dimension_key)
return Variable(dimensions, data, attributes)

def get_variables(self):
with self.ensure_open(autoclose=False):
return FrozenOrderedDict((k, self.open_store_variable(k, v))
for k, v in self.ds.arrays())

def get_attrs(self):
with self.ensure_open(autoclose=True):
_, attributes = _get_zarr_dims_and_attrs(self.ds,
self._dimension_key)
attrs = FrozenOrderedDict(attributes)
return attrs

def get_dimensions(self):
with self.ensure_open(autoclose=True):
dimensions, _ = _get_zarr_dims_and_attrs(self.ds,
self._dimension_key)
return dimensions

def set_dimension(self, name, length):
with self.ensure_open(autoclose=False):
self.ds.attrs[self._dimension_key][name] = length

def set_attribute(self, key, value):
with self.ensure_open(autoclose=False):
_, attributes = _get_zarr_dims_and_attrs(self.ds,
self._dimension_key)
attributes[key] = value

def prepare_variable(self, name, variable, check_encoding=False,
unlimited_dims=None):

attrs = variable.attrs.copy()
dims = variable.dims
dtype = variable.dtype
shape = variable.shape
chunks = _dask_chunks_to_zarr_chunks(variable.chunks)

# TODO: figure ouw how zarr should deal with unlimited dimensions
self.set_necessary_dimensions(variable, unlimited_dims=unlimited_dims)

# let's try keeping this fill value stuff
fill_value = attrs.pop('_FillValue', None)
if fill_value in ['\x00']:
fill_value = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I explain in my comment, I am a bit confused about how to handle fill_value. zarr has a built in fill_value mechanisms, and I don't know the best way to plug it into xarray.


# TODO: figure out what encoding is needed for zarr

### arguments for zarr.create
# zarr.creation.create(shape, chunks=None, dtype=None, compressor='default',
# fill_value=0, order='C', store=None, synchronizer=None, overwrite=False,
# path=None, chunk_store=None, filters=None, cache_metadata=True, **kwargs)

# TODO: figure out how to pass along all those other arguments

zarr_array = self.ds.create(name, shape=shape, dtype=dtype,
chunks=chunks, fill_value=fill_value)
zarr_array.attrs[self._dimension_key] = dims
_, attributes = _get_zarr_dims_and_attrs(zarr_array,
self._dimension_key)

for k, v in iteritems(attrs):
attributes[k] = v

return zarr_array, variable.data

# sync() and close() methods should not be needed with zarr
35 changes: 35 additions & 0 deletions xarray/core/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -489,3 +489,38 @@ def ensure_us_time_resolution(val):
elif np.issubdtype(val.dtype, np.timedelta64):
val = val.astype('timedelta64[us]')
return val


class HiddenKeyDict(MutableMapping):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs test coverage.

'''
Acts like a normal dictionary, but hides certain keys.
'''
# ``__init__`` method required to create instance from class.
def __init__(self, data, *hidden_keys):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer avoiding *args -- it gives more freedom to adjust APIs later (e.g., by adding keyword arguments)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suggest the best way to tell whether an argument is a string or list of strings? This is something I always need to do but don't know the "correct" pythonic way to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you know it is an iterable, isinstance(var, basestring) should do.

self._data = data
self._hidden_keys = hidden_keys

def _raise_if_hidden(self, key):
if key in self._hidden_keys:
raise KeyError('Key is hidden.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print the offending key in the error.


# The next five methods are requirements of the ABC.
def __setitem__(self, key, value):
self._raise_if_hidden(key)
self._data[key] = value

def __getitem__(self, key):
self._raise_if_hidden(key)
return self._data[key]

def __delitem__(self, key):
self._raise_if_hidden(key)
del self._data[key]

def __iter__(self):
for k in self._data:
if k not in self._hidden_keys:
yield k

def __len__(self):
return len(list(self.__iter__()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would certainly try to use len(self._data) here rather than iteration so this is still constant time (in practice it probably doesn't matter, though).