Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(dot-columns): speed up heavily used .columns #9921

Merged
merged 3 commits into from
Sep 14, 2024

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Aug 26, 2024

Description of changes

This PR changes the Table.columns property to return the underlying
tuple of names from the operation's schema attribute.

Right now, every .columns access incurs the overhead of creating a list,
which remained for backwards compatibility.

I was refactoring selectors in #9917, and noticed some performance differences
that boiled down to accessing this attribute many times in the case of
selectors operating on very wide tables.

The other approach I considered was t continue returning a list, and speed up
the access using functools.cached_property.

While this doesn't break the API, it seemed incredibly wasteful to store two
copies of every table's columns, especially in these slower cases where
tables have a ton of columns.

BREAKING CHANGE: Table.columns is now a tuple instead of a list. Your usage of .columns may require updating if you're depending specifically on .columns being a list.

@cpcloud cpcloud added this to the 10.0 milestone Aug 26, 2024
@cpcloud cpcloud added ux User experience related issues performance Issues related to ibis's performance breaking change Changes that introduce an API break at any level labels Aug 26, 2024
@cpcloud
Copy link
Member Author

cpcloud commented Aug 26, 2024

The difference in performance is pretty stark:

---------------------------------------------------------------------------------- benchmark 'test_dot_columns[1000000]': 2 tests ---------------------------------------------------------------------------------
Name (time in ms)                                Min                Max               Mean            StdDev             Median               IQR            Outliers             OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_dot_columns[1000000] (0001_d552afa)     10.3734 (>1000.0)  14.1146 (>1000.0)  10.6059 (>1000.0)  0.4976 (>1000.0)  10.4641 (>1000.0)  0.1202 (>1000.0)      5;11         94.2872 (0.00)         90           1
test_dot_columns[1000000] (NOW)               0.0003 (1.0)       0.0090 (1.0)       0.0003 (1.0)      0.0001 (1.0)       0.0003 (1.0)      0.0000 (1.0)      946;2493  2,904,766.3008 (1.0)      185529           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------- benchmark 'test_dot_columns[100000]': 2 tests ------------------------------------------------------------------------------------
Name (time in us)                                Min                 Max                Mean             StdDev              Median               IQR             Outliers  OPS (Kops/s)            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_dot_columns[100000] (0001_d552afa)     197.4140 (853.20)   278.9990 (174.87)   207.2664 (849.73)   13.9447 (>1000.0)  201.0715 (829.57)   7.7540 (>1000.0)    186;200        4.8247 (0.00)       1642           1
test_dot_columns[100000] (NOW)                0.2314 (1.0)        1.5954 (1.0)        0.2439 (1.0)       0.0137 (1.0)        0.2424 (1.0)      0.0057 (1.0)      2999;3160    4,099.6842 (1.0)      193051          21
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------- benchmark 'test_dot_columns[10000]': 2 tests ---------------------------------------------------------------------------------
Name (time in us)                              Min                Max               Mean            StdDev             Median               IQR             Outliers  OPS (Kops/s)            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_dot_columns[10000] (0001_d552afa)     19.5370 (61.05)    48.8320 (6.99)     21.2852 (61.88)    1.5586 (25.24)    21.0490 (61.73)    0.4710 (469.09)   2090;3084       46.9810 (0.02)      35120           1
test_dot_columns[10000] (NOW)               0.3200 (1.0)       6.9830 (1.0)       0.3440 (1.0)      0.0617 (1.0)       0.3410 (1.0)      0.0010 (1.0)      601;33246    2,907.2263 (1.0)      106519           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
========================================================================================= 3 passed in 9.38s ==========================================================================================

Sure, we're in microsecond territory, but this property is used quite a bit, and hit pretty heavily in selector processing, which tends to invoke .columns in some form of loop.

@cpcloud cpcloud requested a review from jcrist August 26, 2024 13:44
@cpcloud
Copy link
Member Author

cpcloud commented Aug 26, 2024

Another approach is to replace the use of .columns in internals, (perhaps with a private _columns property), which is annoying but also not a breaking change.

The downside there is that with .columns and ._columns, developers can easily choose the wrong one and get bad performance.

@cpcloud
Copy link
Member Author

cpcloud commented Aug 26, 2024

Moving to draft to avoid merging before 9.4

@cpcloud cpcloud marked this pull request as draft August 26, 2024 15:07
@cpcloud cpcloud force-pushed the speedup-dot-columns branch from 6d12c19 to d938306 Compare September 13, 2024 17:45
@cpcloud cpcloud marked this pull request as ready for review September 13, 2024 17:46
@cpcloud cpcloud force-pushed the speedup-dot-columns branch 2 times, most recently from 6c6ea0d to a10640e Compare September 13, 2024 18:07
@cpcloud cpcloud added the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Sep 13, 2024
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Sep 13, 2024
@cpcloud cpcloud force-pushed the speedup-dot-columns branch from a10640e to 881ef26 Compare September 14, 2024 08:18
@cpcloud
Copy link
Member Author

cpcloud commented Sep 14, 2024

The 3 cloud tests that were failing are now passing.

ibis/backends/snowflake/tests/test_client.py::test_insert PASSED         [ 33%]
ibis/backends/bigquery/tests/system/test_client.py::test_parted_column[date] PASSED [ 66%]
ibis/backends/bigquery/tests/system/test_client.py::test_parted_column[timestamp] PASSED [100%]

========================================================================================= 3 passed in 39.24s =========================================================================================

@cpcloud cpcloud merged commit ff3550c into ibis-project:main Sep 14, 2024
72 checks passed
@cpcloud cpcloud deleted the speedup-dot-columns branch September 14, 2024 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that introduce an API break at any level performance Issues related to ibis's performance ux User experience related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants