-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sparse option to reindex and unstack #3542
Conversation
# Conflicts: # xarray/core/dataarray.py # xarray/core/dataset.py # xarray/tests/test_dataset.py
Hello @fujiisoup! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-11-18 09:02:36 UTC |
@@ -993,6 +993,32 @@ def chunk(self, chunks=None, name=None, lock=False): | |||
|
|||
return type(self)(self.dims, data, self._attrs, self._encoding, fastpath=True) | |||
|
|||
def _as_sparse(self, sparse_format=_default, fill_value=dtypes.NA): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this is a private method.
Probably we can expose it to the public and add the same method to DataArray and Dataset as well in the future.
data = as_sparse(self.data.astype(dtype), fill_value=fill_value) | ||
return self._replace(data=data) | ||
|
||
def _to_dense(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also private, as is _as_sparse
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make these public in DataArray
and Dataset
. See discussion here: #3245 (comment)
Can be left for a future PR though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xarray/tests/test_dataset.py
Outdated
actual = data.reindex(dim3=dim3, sparse=True) | ||
expected = data.reindex(dim3=dim3, sparse=False) | ||
for k, v in data.data_vars.items(): | ||
np.testing.assert_equal(actual[k].data.todense(), expected[k].data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, assert_equal
cannot be used as we need to explicitly densify the array for the comparison.
@@ -1862,6 +1863,17 @@ def test_getitem_with_mask_nd_indexer(self): | |||
) | |||
|
|||
|
|||
@requires_sparse | |||
class TestVariableWithSparse: | |||
# TODO inherit VariableSubclassobjects to cover more tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
actual = ds["var"].unstack("index", sparse=True) | ||
expected = ds["var"].unstack("index") | ||
assert actual.variable._to_dense().equals(expected.variable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we test whether actual.variable
is actually sparse?
xarray/tests/test_dataset.py
Outdated
actual = data.reindex(dim3=dim3, sparse=True, fill_value=-10) | ||
expected = data.reindex(dim3=dim3, sparse=False, fill_value=-10) | ||
for k, v in data.data_vars.items(): | ||
np.testing.assert_equal(actual[k].data.todense(), expected[k].data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we generally use assert_array_equal
for numpy arrays (but I can't immediately recall the difference...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these actually end up doing the exact same checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values
property does not work for the sparse-backed Variable, resulting in the failure of assert_array_equal
.
I'll add TODO comment for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fujiisoup it's great to have you back!
xarray/tests/test_dataset.py
Outdated
actual = data.reindex(dim3=dim3, sparse=True, fill_value=-10) | ||
expected = data.reindex(dim3=dim3, sparse=False, fill_value=-10) | ||
for k, v in data.data_vars.items(): | ||
np.testing.assert_equal(actual[k].data.todense(), expected[k].data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these actually end up doing the exact same checks.
xarray/core/variable.py
Outdated
|
||
if sparse_format is _default: | ||
sparse_format = "coo" | ||
as_sparse = getattr(sparse, "as_{}".format(sparse_format.lower())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of not hard-coding supported sparse formats, but I wonder if we could be a little more careful here if AttributeError is raised. We should probably catch and re-raise Attribute error with a more informative message if this fails.
Otherwise, I expect we might see bug reports from confused users, e.g., when sparse_format='csr'
raises a confusing message.
xarray/core/duck_array_ops.py
Outdated
@@ -251,6 +253,9 @@ def count(data, axis=None): | |||
|
|||
def where(condition, x, y): | |||
"""Three argument where() with better dtype promotion rules.""" | |||
# sparse support | |||
if isinstance(x, sparse_array_type) or isinstance(y, sparse_array_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little surprised this is necessary. Does sparse
not support __array_function__
for np.where
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yes. sparse looks not working with np.result_type
and astype(copy=False)
.
I'll add a TODO here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have the latest version of sparse installed?
when I test this on my machine, it works:
In [13]: import sparse
In [14]: import numpy as np
In [15]: import xarray
In [16]: x = sparse.COO(np.arange(3))
In [17]: xarray.core.duck_array_ops.where(x > 1, x, x)
Out[17]: <COO: shape=(3,), dtype=int64, nnz=2, fill_value=0>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. You are right.
I was running with sparse 0.7.0. With 0.8.0, it is running.
""" | ||
import sparse | ||
|
||
# TODO what to do if dask-backended? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully sparse will raise an error if you try to convert a dask array into a sparse array! If not, we should do that ourselves.
Long term, the best solution would be to convert a dask array from dense chunks to sparse chunks.
Thank you @fujiisoup ! I think it could be a little cleaner to entirely avoid |
* master: (24 commits) Tweaks to release instructions (pydata#3555) Clarify conda environments for new contributors (pydata#3551) Revert to dev version 0.14.1 whatsnew (pydata#3547) sparse option to reindex and unstack (pydata#3542) Silence sphinx warnings (pydata#3516) Numpy 1.18 support (pydata#3537) tweak whats-new. (pydata#3540) small simplification of rename from pydata#3532 (pydata#3539) Added fill_value for unstack (pydata#3541) Add DatasetGroupBy.quantile (pydata#3527) ensure rename does not change index type (pydata#3532) Leave empty slot when not using accessors interpolate_na: Add max_gap support. (pydata#3302) units & deprecation merge (pydata#3530) Fix set_index when an existing dimension becomes a level (pydata#3520) add Variable._replace (pydata#3528) Tests for module-level functions with units (pydata#3493) Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502) FUNDING.yml (pydata#3523) ...
* upstream/master: (22 commits) Resolve the version issues on RTD (pydata#3589) Add bottleneck & rasterio git tip to upstream-dev CI (pydata#3585) update whats-new.rst (pydata#3581) Examples for quantile (pydata#3576) add cftime intersphinx entries (pydata#3577) Add pyXpcm to Related Projects doc page (pydata#3578) Reimplement quantile with apply_ufunc (pydata#3559) add environment file for binderized examples (pydata#3568) Add drop to api.rst under pending deprecations (pydata#3561) replace duplicate method _from_vars_and_coord_names (pydata#3565) propagate indexes in to_dataset, from_dataset (pydata#3519) Switch examples to notebooks + scipy19 docs improvements (pydata#3557) fix whats-new.rst (pydata#3554) Tweaks to release instructions (pydata#3555) Clarify conda environments for new contributors (pydata#3551) Revert to dev version 0.14.1 whatsnew (pydata#3547) sparse option to reindex and unstack (pydata#3542) Silence sphinx warnings (pydata#3516) Numpy 1.18 support (pydata#3537) ...
* upstream/master: (35 commits) fix plotting with transposed nondim coords. (pydata#3441) make coarsen reductions consistent with reductions on other classes (pydata#3500) Resolve the version issues on RTD (pydata#3589) Add bottleneck & rasterio git tip to upstream-dev CI (pydata#3585) update whats-new.rst (pydata#3581) Examples for quantile (pydata#3576) add cftime intersphinx entries (pydata#3577) Add pyXpcm to Related Projects doc page (pydata#3578) Reimplement quantile with apply_ufunc (pydata#3559) add environment file for binderized examples (pydata#3568) Add drop to api.rst under pending deprecations (pydata#3561) replace duplicate method _from_vars_and_coord_names (pydata#3565) propagate indexes in to_dataset, from_dataset (pydata#3519) Switch examples to notebooks + scipy19 docs improvements (pydata#3557) fix whats-new.rst (pydata#3554) Tweaks to release instructions (pydata#3555) Clarify conda environments for new contributors (pydata#3551) Revert to dev version 0.14.1 whatsnew (pydata#3547) sparse option to reindex and unstack (pydata#3542) ...
black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new APIAdded
sparse
option toreindex
andunstack
.I just added a minimal set of codes necessary to
unstack
andreindex
.There is still a lot of space to complete the sparse support as discussed in #3245.