Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Zarr backend #1528

Merged
merged 85 commits into from
Dec 14, 2017
Merged
Show file tree
Hide file tree
Changes from 81 commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
5cdf6c8
added HiddenKeyDict class
rabernat Aug 27, 2017
f305c25
new zarr backend
rabernat Aug 27, 2017
2ea21c5
added HiddenKeyDict class
rabernat Aug 27, 2017
d92bf2f
new zarr backend
rabernat Aug 27, 2017
79da971
add zarr to ci reqs
Oct 5, 2017
31e4409
add zarr api to docs
Oct 5, 2017
2ec5ee5
some zarr tests passing
rabernat Oct 6, 2017
bd21720
Merge pull request #1 from jhamman/zarr_backend
rabernat Oct 6, 2017
7e898fc
merged stuff from joe
rabernat Oct 6, 2017
af5ff6c
Merge branch 'master' of github.com:pydata/xarray into zarr_backend
Oct 6, 2017
9e7cc09
Merge branch 'zarr_backend' of github.com:rabernat/xray into zarr_bac…
Oct 6, 2017
3f01365
requires zarr decorator
Oct 6, 2017
41cf706
Merge pull request #2 from jhamman/zarr_backend
rabernat Oct 6, 2017
fd9fd0f
wip
rabernat Oct 7, 2017
9f16e8f
added chunking test
rabernat Oct 8, 2017
fe9ebe7
remove debuggin statements
rabernat Oct 8, 2017
c01cd09
fixed HiddenKeyDict
rabernat Oct 8, 2017
b3e5d76
added HiddenKeyDict class
rabernat Aug 27, 2017
45375b2
new zarr backend
rabernat Aug 27, 2017
0e79718
add zarr to ci reqs
Oct 5, 2017
3d39ade
add zarr api to docs
Oct 5, 2017
3d09c67
some zarr tests passing
rabernat Oct 6, 2017
0b4a27a
requires zarr decorator
Oct 6, 2017
f39035c
wip
rabernat Oct 7, 2017
6446ea2
added chunking test
rabernat Oct 8, 2017
9136064
remove debuggin statements
rabernat Oct 8, 2017
2966100
fixed HiddenKeyDict
rabernat Oct 8, 2017
6bedf22
wip
rabernat Oct 14, 2017
ced8267
finished merge
rabernat Oct 16, 2017
e461cdb
finished merge
rabernat Oct 16, 2017
049bf9e
create opener object
rabernat Oct 16, 2017
c169128
trying to get caching working
rabernat Oct 16, 2017
82ef456
caching still not working
rabernat Oct 16, 2017
3ee243e
merge conflicts
rabernat Nov 13, 2017
e20c29f
updating zarr backend with new indexing mixins
rabernat Nov 13, 2017
f82c8c1
added new zarr dev test env
rabernat Nov 13, 2017
43e539f
update travis
rabernat Nov 13, 2017
66299f0
move zarr-dev to travis allowed failures
rabernat Nov 13, 2017
2fce362
fix typo in env file
rabernat Nov 13, 2017
c19b81a
wip
rabernat Nov 17, 2017
68b8f07
fixed zarr auto_chunk
rabernat Nov 17, 2017
0ea0dad
refactored zarr tests
rabernat Nov 17, 2017
58b3bf0
new encoding test
rabernat Nov 17, 2017
9da22da
Merge branch 'master' of github.com:pydata/xarray into zarr_backend_jjh
Nov 17, 2017
a8b4785
cleanup and buildout ZarrArrayWrapper, vectorized indexing
Nov 17, 2017
2a6a776
Merge pull request #4 from jhamman/zarr_backend_jjh
rabernat Nov 17, 2017
021d3ba
more wip
rabernat Nov 27, 2017
5ef10d2
removed chaching test
rabernat Nov 17, 2017
e47d936
Merge remote-tracking branch 'origin/zarr_backend' into zarr_backend
rabernat Nov 27, 2017
a4b024e
very close to passing all tests
rabernat Nov 27, 2017
d8842a6
Merge remote-tracking branch 'upstream/master' into zarr_backend
rabernat Nov 28, 2017
54d116d
modified inheritance
rabernat Nov 29, 2017
94678f4
subclass AbstractWriteableDataStore
rabernat Nov 29, 2017
64942e5
Merge remote-tracking branch 'origin/zarr_backend' into zarr_backend
rabernat Dec 1, 2017
f584456
xfailed certain tests
rabernat Dec 1, 2017
c43284e
pr comments wip
rabernat Dec 4, 2017
9df6e50
removed autoclose
rabernat Dec 4, 2017
012e858
new test for chunk encoding
rabernat Dec 4, 2017
b1819f4
added another test
rabernat Dec 5, 2017
8eb98c9
tests for HiddenKeyDict
rabernat Dec 6, 2017
64bd76c
flake8
rabernat Dec 6, 2017
cffa158
Merge remote-tracking branch 'upstream/master' into zarr_backend
rabernat Dec 6, 2017
3b4a941
zarr version update
rabernat Dec 6, 2017
688f415
added more tests
rabernat Dec 6, 2017
c115a2b
added compressor test
rabernat Dec 6, 2017
4c92531
docs
rabernat Dec 6, 2017
61027eb
weird ascii character issue
rabernat Dec 6, 2017
bbaa776
doc fixes
rabernat Dec 6, 2017
c8f23a5
what's new
rabernat Dec 6, 2017
f0c76f7
more file encoding nightmares
rabernat Dec 6, 2017
a84e388
Tests for backends.zarr._replace_slices_with_arrays
shoyer Dec 6, 2017
37bc2f0
respond to @shoyer's review
rabernat Dec 6, 2017
8cd1707
final fixes
rabernat Dec 7, 2017
ac27411
put back @shoyer's original max function
rabernat Dec 7, 2017
618bf81
another try with 2.7-safe max function
rabernat Dec 7, 2017
e942130
put back @shoyer's original max function
rabernat Dec 7, 2017
b1fa690
bypass lock on ArrayWriter
rabernat Dec 8, 2017
4089d13
Merge branch 'zarr_backend' of github.com:rabernat/xarray into zarr_b…
rabernat Dec 8, 2017
ba200c1
eliminate read mode
rabernat Dec 8, 2017
8dafaf7
added zarr distributed integration test
rabernat Dec 8, 2017
85174cd
fixed max bug
rabernat Dec 8, 2017
c76a01b
change lock to False
rabernat Dec 11, 2017
c011c2d
fix doc typos
rabernat Dec 11, 2017
054ffeb
Merge branch 'master' into zarr_backend
rabernat Dec 12, 2017
f5633ca
Merge branch 'master' into zarr_backend
rabernat Dec 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ matrix:
env: CONDA_ENV=py36-pynio-dev
- python: 3.6
env: CONDA_ENV=py36-rasterio1.0alpha
- python: 3.6
env: CONDA_ENV=py36-zarr-dev
allow_failures:
- python: 3.6
env:
Expand All @@ -67,6 +69,8 @@ matrix:
env: CONDA_ENV=py36-pynio-dev
- python: 3.6
env: CONDA_ENV=py36-rasterio1.0alpha
- python: 3.6
env: CONDA_ENV=py36-zarr-dev

before_install:
- if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then
Expand Down
1 change: 1 addition & 0 deletions ci/requirements-py27-cdat+pynio.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ dependencies:
- seaborn
- toolz
- rasterio
- zarr
- pip:
- coveralls
- pytest-cov
1 change: 1 addition & 0 deletions ci/requirements-py27-windows.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,4 @@ dependencies:
- seaborn
- toolz
- rasterio
- zarr
1 change: 1 addition & 0 deletions ci/requirements-py35.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ dependencies:
- seaborn
- toolz
- rasterio
- zarr
- pip:
- coveralls
- pytest-cov
1 change: 1 addition & 0 deletions ci/requirements-py36-windows.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ dependencies:
- seaborn
- toolz
- rasterio
- zarr
20 changes: 20 additions & 0 deletions ci/requirements-py36-zarr-dev.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: test_env
channels:
- conda-forge
dependencies:
- python=3.6
- dask
- distributed
- matplotlib
- pytest
- flake8
- numpy
- pandas
- scipy
- seaborn
- toolz
- bottleneck
- pip:
- coveralls
- pytest-cov
- git+https://github.com/alimanfoo/zarr.git
1 change: 1 addition & 0 deletions ci/requirements-py36.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ dependencies:
- toolz
- rasterio
- bottleneck
- zarr
- pip:
- coveralls
- pytest-cov
Expand Down
2 changes: 2 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -415,7 +415,9 @@ Dataset methods
open_dataset
open_mfdataset
open_rasterio
open_zarr
Dataset.to_netcdf
Dataset.to_zarr
save_mfdataset
Dataset.to_array
Dataset.to_dataframe
Expand Down
2 changes: 1 addition & 1 deletion doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
print("python exec:", sys.executable)
print("sys.path:", sys.path)
for name in ('numpy scipy pandas matplotlib dask IPython seaborn '
'cartopy netCDF4 rasterio').split():
'cartopy netCDF4 rasterio zarr').split():
try:
module = importlib.import_module(name)
if name == 'matplotlib':
Expand Down
1 change: 1 addition & 0 deletions doc/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ dependencies:
- cartopy=0.15.1
- rasterio=0.36.0
- sphinx-gallery
- zarr
1 change: 1 addition & 0 deletions doc/installing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ For netCDF and IO
reading and writing netCDF4 files that does not use the netCDF-C libraries
- `pynio <https://www.pyngl.ucar.edu/Nio.shtml>`__: for reading GRIB and other
geoscience specific file formats
- `zarr <http://zarr.readthedocs.io/`__: for chunked, compressed, N-dimensional arrays.

For accelerating xarray
~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
100 changes: 99 additions & 1 deletion doc/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,7 @@ string encoding for character arrays in netCDF files was
Technically, you can use
`any string encoding recognized by Python <https://docs.python.org/3/library/codecs.html#standard-encodings>`_ if you feel the need to deviate from UTF-8,
by setting the ``_Encoding`` field in ``encoding``. But
`we don't recommend it<http://utf8everywhere.org/>`_.
`we don't recommend it <http://utf8everywhere.org/>`_.

.. warning::

Expand Down Expand Up @@ -502,6 +502,103 @@ longitudes and latitudes.
.. _test files: https://github.com/mapbox/rasterio/blob/master/tests/data/RGB.byte.tif
.. _pyproj: https://github.com/jswhit/pyproj

.. _io.zarr:

Zarr
----

`Zarr`_ is a Python package providing an implementation of chunked, compressed,
N-dimensional arrays.
Zarr has the ability to store arrays in a range of ways, including in memory,
in files, and in cloud-based object storage such as `Amazon S3`_ and
`Google Cloud Storage`_.
Xarray's Zarr backend allows xarray to leverage these capabilities.

.. warning::

Zarr support is still an experimental feature. Please report any bugs or
unexepected behavior via github issues.

Xarray can't open just any zarr dataset, because xarray requires special
metadata (attributes) describing the dataset dimensions and coordinates.
At this time, xarray can only open zarr datasets that have been written by
xarray. To write a dataset to zarr using files, we use the
:py:attr:`Dataset.to_zarr <xarray.Dataset.to_zarr>` method.
To write to a local directory, we pass a path to a directory

.. ipython:: python
:suppress:

! rm -rf path/to/directory.zarr

.. ipython:: python

ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))},
coords={'x': [10, 20, 30, 40],
'y': pd.date_range('2000-01-01', periods=5),
'z': ('x', list('abcd'))})
ds.to_zarr('path/to/directory.zarr')

(The suffix ``.zarr`` is optional--just a reminder that a zarr store lives
there.) If the directory does not exist, it will be created. If a a zarr
store is already present at that path an error will be raised, preventing it
from being overwritten. To override this behavior and overwrite and existing
store, add ``mode='w'`` when invoking ``to_zarr``.

To read back a zarr dataset that has been created this way, we use the
:py:func:`~xarray.open_zarr` method:

.. ipython:: python

ds_zarr = xr.open_zarr('path/to/directory.zarr')
ds_zarr

Cloud Storage Buckets
~~~~~~~~~~~~~~~~~~~~~

It is possible to read and write xarray datasets directly from / to cloud
storage buckets using zarr. This example uses the `gcsfs`_ pacakge to provide
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pacakge -> package

a ``MutableMapping`` interface to `Google Cloud Storage`_, which we can then
pass to xarray::

import gcsfs
fs = gcsfs.GCSFileSystem(project='<project-name>', token=None)
gcsmap = gcsfs.mapping.GCSMap('<bucket-name>', gcs=fs, check=True, create=False)
# write to the bucket
ds.to_zarr(store=gcsmap)
# read it back
ds_gcs = xr.open_zarr(gcsmap, mode='r')

.. _Zarr: http://zarr.readthedocs.io/
.. _Amazon S3: https://aws.amazon.com/s3/
.. _Google Cloud Storage: https://cloud.google.com/storage/
.. _gcsfs: https://github.com/dask/gcsfs

Zarr Compressors and Filters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are many different options for compression and filtering possible with
zarr. These are described in the
`zarr documentation <http://zarr.readthedocs.io/en/stable/tutorial.html#compressors>`_.
These options can be passed to the ``to_zarr`` method as variable encoding.
For example:

.. ipython:: python
:suppress:

! rm -rf foo.zarr

.. ipython:: python

import zarr
compressor = zarr.Blosc(cname='zstd', clevel=3, shuffle=2)
ds.to_zarr('foo.zarr', encoding={'foo': {'compressor': compressor}})

.. note::

Not all native zarr compression and filtering options have been tested with
xarray.

.. _io.pynio:

Formats supported by PyNIO
Expand Down Expand Up @@ -529,6 +626,7 @@ exporting your objects to pandas and using its broad range of `IO tools`_.
.. _IO tools: http://pandas.pydata.org/pandas-docs/stable/io.html



Combining multiple files
------------------------

Expand Down
4 changes: 4 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ Enhancements
- :py:func:`~plot.contourf()` learned to contour 2D variables that have both a 1D co-ordinate (e.g. time) and a 2D co-ordinate (e.g. depth as a function of time).
By `Deepak Cherian <https://github.com/dcherian>`_.

- Support for using `Zarr`_ as storage layer for xarray.
By `Ryan Abernathey <https://github.com/rabernat>`_.

.. _Zarr: http://zarr.readthedocs.io/

Bug fixes
~~~~~~~~~
Expand Down
1 change: 1 addition & 0 deletions xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from .backends.api import (open_dataset, open_dataarray, open_mfdataset,
save_mfdataset)
from .backends.rasterio_ import open_rasterio
from .backends.zarr import open_zarr

from .conventions import decode_cf, SerializationWarning

Expand Down
1 change: 1 addition & 0 deletions xarray/backends/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@
from .pynio_ import NioDataStore
from .scipy_ import ScipyDataStore
from .h5netcdf_ import H5NetCDFStore
from .zarr import ZarrStore
26 changes: 26 additions & 0 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -713,3 +713,29 @@ def save_mfdataset(datasets, paths, mode='w', format=None, groups=None,
finally:
for store in stores:
store.close()


def to_zarr(dataset, store=None, mode='w-', synchronizer=None, group=None,
encoding=None):
"""This function creates an appropriate datastore for writing a dataset to
a zarr ztore

See `Dataset.to_zarr` for full API docs.
"""
if isinstance(store, path_type):
store = str(store)
if encoding is None:
encoding = {}

# validate Dataset keys, DataArray names, and attr keys/values
_validate_dataset_names(dataset)
_validate_attrs(dataset)

store = backends.ZarrStore.open_group(store=store, mode=mode,
synchronizer=synchronizer,
group=group, writer=None)

# I think zarr stores should always be sync'd immediately
# TODO: figure out how to properly handle unlimited_dims
dataset.dump_to_store(store, sync=True, encoding=encoding)
return store
5 changes: 3 additions & 2 deletions xarray/backends/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,9 +164,10 @@ def __exit__(self, exception_type, exception_value, traceback):


class ArrayWriter(object):
def __init__(self):
def __init__(self, lock=GLOBAL_LOCK):
self.sources = []
self.targets = []
self.lock = lock

def add(self, source, target):
if isinstance(source, dask_array_type):
Expand All @@ -184,7 +185,7 @@ def sync(self):
import dask.array as da
import dask
if LooseVersion(dask.__version__) > LooseVersion('0.8.1'):
da.store(self.sources, self.targets, lock=GLOBAL_LOCK)
da.store(self.sources, self.targets, lock=self.lock)
Copy link
Contributor Author

@rabernat rabernat Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I made this modification to the ArrayWriter class to allow us to bypass the use of a lock when writing, since (the way we have implemented it), the writes to the zarr store should all be thread safe and not require any lock. Later on, when we initialize the ZarrStore, we create a writer for it as ArrayWriter(lock=None). @mrocklin, does this seem like the right approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can guarantee that no two tasks will write to the same block in Zarr then yes, I think that it is appropriate to avoid locking. This is based on old information though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another observation: this call to dask.array.store does not show up on my distributed dashboard task / status pages. The writes are obviously happening, but I can't follow their progress. Since we are going to be using this to move some very large datasets, interactive monitoring via the dashboard is quite important. Do you have any idea why this is not showing up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can guarantee that no two tasks will write to the same block in Zarr then yes, I think that it is appropriate to avoid locking.

As of this PR, we do not allow multiple dask chunks per zarr chunk. That scenario is covered by the test suite. It may change in the future, but that's how it is for now.

Once we cross that bridge, we will also have to deal with the fact that both zarr and dask have their own locking (aka "synchronization" in zarr parlance) enforcement mechanisms. We will presumably have to pick one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another observation: this call to dask.array.store does not show up on my distributed dashboard task / status pages. The writes are obviously happening, but I can't follow their progress. Since we are going to be using this to move some very large datasets, interactive monitoring via the dashboard is quite important. Do you have any idea why this is not showing up?

There is no reason that a task run on the distributed system will not show up on the dashboard. My first guess is that somehow you're using a local scheduler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of this PR, we do not allow multiple dask chunks per zarr chunk. That scenario is covered by the test suite. It may change in the future, but that's how it is for now.

Once we cross that bridge, we will also have to deal with the fact that both zarr and dask have their own locking (aka "synchronization" in zarr parlance) enforcement mechanisms. We will presumably have to pick one.

I suspect that we will never change this behavior. I don't think we should ever have multiple dask chunks write to one zarr chunk. Any to_zarr storage function should rechunk explicitly.

If for some reason we did need to synchronize, Dask provides a distributed locking mechanism that could be keyed by chunk label.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason that a task run on the distributed system will not show up on the dashboard. My first guess is that somehow you're using a local scheduler.

I was not using a local scheduler. After digging further, I can see the tasks on the distributed dashboard using a regular zarr.DirectoryStore, but not when I pass a gcsfs.mapping.GCSMap to to_zarr. Is there any reasons these two should behave differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: it does eventually show up, it just takes a really long time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks good.

else:
da.store(self.sources, self.targets)
self.sources = []
Expand Down
Loading