Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some alignment optimizations #7382

Merged
merged 6 commits into from
Jan 5, 2023
Merged

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Dec 15, 2022

  • Benchmark added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

May fix some performance regressions, e.g., see #7376 (comment).

@ravwojdyla with this PR ds.assign(foo=~ds["d3"]) in your example should be much faster (on par with version 2022.3.0).

This may happen in some (rare?) cases where the objects to align share
the same indexes.
If all unindexed dimension sizes match the indexed dimension sizes in
the objects to align, we don't need re-indexing.
@benbovy
Copy link
Member Author

benbovy commented Dec 15, 2022

Quick benchmark taking the example in #7376 (it seems even much faster than in version 2022.3.0!)

# version 2022.3.0
%timeit ds.assign(foo=~ds["d3"])
# 22.5 ms ± 1.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

# main branch
%timeit ds.assign(foo=~ds["d3"])
# 193 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# this PR
%timeit ds.assign(foo=~ds["d3"])
# 1.01 ms ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Dec 15, 2022
@Illviljan
Copy link
Contributor

Illviljan commented Dec 15, 2022

No benchmark is catching this? Maybe we can add a small one in https://github.com/pydata/xarray/blob/main/asv_bench/benchmarks/indexing.py ?

@@ -1419,6 +1419,11 @@ def check_variables():
)

indexes = [e[0] for e in elements]

same_objects = all(indexes[0] is other_idx for other_idx in indexes[1:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
same_objects = all(indexes[0] is other_idx for other_idx in indexes[1:])
indexes_0 = indexes[0]
same_objects = all(indexes_0 is other_idx for other_idx in indexes[1:])

No need to use getitem several times for the same value. Similar thing can be done in other places in the function as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List indexing is O(1), so should be only a marginal speedup :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think the gain in perf would be negligible.

@benbovy
Copy link
Member Author

benbovy commented Dec 19, 2022

I don't know if the optimizations added here will benefit a large set of use cases (it took 6 months before seeing an issue report), but it is worth for at least a few of them. This is ready I think (added some benchmarks).

@benbovy benbovy added the plan to merge Final call for comments label Dec 20, 2022
@Illviljan Illviljan enabled auto-merge (squash) January 5, 2023 20:45
@Illviljan Illviljan disabled auto-merge January 5, 2023 21:25
@Illviljan Illviljan merged commit d6d2450 into pydata:main Jan 5, 2023
@Illviljan
Copy link
Contributor

Thanks, @benbovy !

dcherian added a commit to dcherian/xarray that referenced this pull request Jan 18, 2023
* main: (41 commits)
  v2023.01.0 whats-new (pydata#7440)
  explain keep_attrs in docstring of apply_ufunc (pydata#7445)
  Add sentence to open_dataset docstring (pydata#7438)
  pin scipy version in doc environment (pydata#7436)
  Improve performance for backend datetime handling (pydata#7374)
  fix typo (pydata#7433)
  Add lazy backend ASV test (pydata#7426)
  Pull Request Labeler - Workaround sync-labels bug (pydata#7431)
  see also : groupby in resample doc and vice-versa (pydata#7425)
  Some alignment optimizations (pydata#7382)
  Make `broadcast` and `concat` work with the Array API (pydata#7387)
  remove `numbagg` and `numba` from the upstream-dev CI (pydata#7416)
  [pre-commit.ci] pre-commit autoupdate (pydata#7402)
  Preserve original dtype when accessing MultiIndex levels (pydata#7393)
  [pre-commit.ci] pre-commit autoupdate (pydata#7389)
  [pre-commit.ci] pre-commit autoupdate (pydata#7360)
  COMPAT: Adjust CFTimeIndex.get_loc for pandas 2.0 deprecation enforcement (pydata#7361)
  Avoid loading entire dataset by getting the nbytes in an array (pydata#7356)
  `keep_attrs` for pad (pydata#7267)
  Bump pypa/gh-action-pypi-publish from 1.5.1 to 1.6.4 (pydata#7375)
  ...
@benbovy benbovy deleted the some-align-optimizations branch August 30, 2023 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow topic-indexing topic-performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants