Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster unstacking to sparse #5577

Merged
merged 17 commits into from
Dec 3, 2021
Merged

Faster unstacking to sparse #5577

merged 17 commits into from
Dec 3, 2021

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Jul 5, 2021

  • Tests added
  • Passes pre-commit run --all-files
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

From 7s to 25 ms and 3.5GB to 850MB memory usage =) by passing the coordinate locations directly to the sparse constructor.

asv run -e --bench unstacking.UnstackingSparse.time_unstack_to_sparse  --cpu-affinity=3 HEAD
[  0.00%] · For xarray commit c9251e1c <sparse-unstack>:
[  0.00%] ·· Building for conda-py3.8-bottleneck-dask-distributed-netcdf4-numpy-pandas-scipy-sparse
[  0.00%] ·· Benchmarking conda-py3.8-bottleneck-dask-distributed-netcdf4-numpy-pandas-scipy-sparse
[  0.01%] ··· Running (unstacking.UnstackingSparse.time_unstack_to_sparse_2d--)..
[  0.02%] ··· unstacking.UnstackingSparse.time_unstack_to_sparse_2d    623±30μs
[  0.02%] ··· unstacking.UnstackingSparse.time_unstack_to_sparse_3d    22.8±2ms
[  0.06%] ··· unstacking.UnstackingSparse.peakmem_unstack_to_sparse_2d    793M
[  0.06%] ··· unstacking.UnstackingSparse.peakmem_unstack_to_sparse_3d    794M


[  0.04%] · For xarray commit 80905135 <main>:
[  0.04%] ·· Building for conda-py3.8-bottleneck-dask-distributed-netcdf4-numpy-pandas-scipy-sparse..
[  0.04%] ·· Benchmarking conda-py3.8-bottleneck-dask-distributed-netcdf4-numpy-pandas-scipy-sparse
[  0.05%] ··· Running (unstacking.UnstackingSparse.time_unstack_to_sparse_2d--)..
[  0.06%] ··· unstacking.UnstackingSparse.time_unstack_to_sparse_2d    596±30ms
[  0.06%] ··· unstacking.UnstackingSparse.time_unstack_to_sparse_3d    7.72±0.1s
[  0.02%] ··· unstacking.UnstackingSparse.peakmem_unstack_to_sparse_2d    867M
[  0.02%] ··· unstacking.UnstackingSparse.peakmem_unstack_to_sparse_3d    3.56G

cc @bonnland

xarray/core/variable.py Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

github-actions bot commented Jul 5, 2021

Unit Test Results

         6 files           6 suites   53m 48s ⏱️
16 281 tests 14 545 ✔️ 1 736 💤 0
90 882 runs  82 702 ✔️ 8 180 💤 0

Results for commit 267a14f.

♻️ This comment has been updated with latest results.

@dcherian dcherian added the topic-arrays related to flexible array support label Jul 5, 2021
@max-sixty
Copy link
Collaborator

From 7s to 25 ms

Casual!

xarray/core/variable.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
doc/whats-new.rst Outdated Show resolved Hide resolved
dcherian and others added 5 commits July 7, 2021 09:21
* upstream/main: (34 commits)
  Use same bool validator as other inputs (pydata#5703)
  conditionally disable bottleneck (pydata#5560)
  Refactor index vs. coordinate variable(s) (pydata#5636)
  pre-commit: autoupdate hook versions (pydata#5685)
  Flexible Indexes: Avoid len(index) in map_blocks (pydata#5670)
  Speed up _mapping_repr (pydata#5661)
  update the link to `scipy`'s intersphinx file (pydata#5665)
  Bump styfle/cancel-workflow-action from 0.9.0 to 0.9.1 (pydata#5663)
  pre-commit: autoupdate hook versions (pydata#5660)
  fix the binder environment (pydata#5650)
  Update api.rst (pydata#5639)
  Kwargs to rasterio open (pydata#5609)
  Bump codecov/codecov-action from 1 to 2.0.2 (pydata#5633)
  new blank whats-new for v0.19.1
  v0.19.0 release notes (pydata#5632)
  remove deprecations scheduled for 0.19 (pydata#5630)
  Make typing-extensions optional (pydata#5624)
  Plots get labels from pint arrays (pydata#5561)
  Add to_numpy() and as_numpy() methods (pydata#5568)
  pin fsspec (pydata#5627)
  ...
@dcherian dcherian added the run-benchmark Run the ASV benchmark workflow label Oct 28, 2021
@Illviljan
Copy link
Contributor

       before           after         ratio
     [36f05d70]       [0310ebec]
-           2.98G             204M     0.07  unstacking.UnstackingSparse.peakmem_unstack_to_sparse_3d [fv-az292-755/conda-py3.8-bottleneck-dask-distributed-netcdf4-numpy-pandas-scipy-sparse]
-              3G             204M     0.07  unstacking.UnstackingSparse.peakmem_unstack_to_sparse_3d [fv-az292-755/conda-py3.8-dask-distributed-netcdf4-numpy-pandas-scipy-sparse]
-      10.2±0.02s         29.7±2ms     0.00  unstacking.UnstackingSparse.time_unstack_to_sparse_3d [fv-az292-755/conda-py3.8-bottleneck-dask-distributed-netcdf4-numpy-pandas-scipy-sparse]
-      10.1±0.05s       27.4±0.6ms     0.00  unstacking.UnstackingSparse.time_unstack_to_sparse_3d [fv-az292-755/conda-py3.8-dask-distributed-netcdf4-numpy-pandas-scipy-sparse]
-        714±20ms         945±30μs     0.00  unstacking.UnstackingSparse.time_unstack_to_sparse_2d [fv-az292-755/conda-py3.8-bottleneck-dask-distributed-netcdf4-numpy-pandas-scipy-sparse]
-         721±8ms         923±30μs     0.00  unstacking.UnstackingSparse.time_unstack_to_sparse_2d [fv-az292-755/conda-py3.8-dask-distributed-netcdf4-numpy-pandas-scipy-sparse]

Quite the improvement indeed. :)

* upstream/main: (39 commits)
  Fixed a mispelling of dimension in dataarray documentation for from_dict (pydata#6020)
  [pre-commit.ci] pre-commit autoupdate (pydata#6014)
  [pre-commit.ci] pre-commit autoupdate (pydata#5990)
  Use set_options for asv bottleneck tests (pydata#5986)
  Fix module name retrieval in `backend.plugins.remove_duplicates()`, plugin tests (pydata#5959)
  Check for py version instead of try/except when importing entry_points (pydata#5988)
  Add "see also" in to_dataframe docs (pydata#5978)
  Alternate method using inline css to hide regular html output in an untrusted notebook (pydata#5880)
  Fix mypy issue with entry_points (pydata#5979)
  Remove pre-commit auto update (pydata#5958)
  Do not change coordinate inplace when throwing error (pydata#5957)
  Create CITATION.cff (pydata#5956)
  Add groupby & resample benchmarks (pydata#5922)
  Fix plot.line crash for data of shape (1, N) in _title_for_slice on format_item (pydata#5948)
  Disable unit test comments (pydata#5946)
  Publish test results from workflow_run only (pydata#5947)
  Generator for groupby reductions (pydata#5871)
  whats-new dev
  whats-new for 0.20.1 (pydata#5943)
  Docs: fix URL for PTSA (pydata#5935)
  ...
@dcherian dcherian added the plan to merge Final call for comments label Nov 24, 2021
@dcherian
Copy link
Contributor Author

dcherian commented Dec 2, 2021

@pydata/xarray I'm planning to merge on Friday. It's been sitting around for a while and is a giant improvement.

* upstream/main:
  fix grammatical typo in docs (pydata#6034)
  Use condas dask-core in ci instead of dask to speedup ci and reduce dependencies (pydata#6007)
  Use complex nan by default when interpolating out of bounds (pydata#6019)
  Simplify missing value handling in xarray.corr (pydata#6025)
  Add pyXpcm to Related Projects doc page (pydata#6031)
  Make xr.corr and xr.map_blocks work without dask (pydata#5731)
doc/whats-new.rst Outdated Show resolved Hide resolved
@dcherian dcherian merged commit cdfcf37 into pydata:main Dec 3, 2021
@dcherian dcherian deleted the sparse-unstack branch December 3, 2021 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs review plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow topic-arrays related to flexible array support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants