Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to eager asof join to speed up arrow queries #583

Merged
merged 6 commits into from
Dec 18, 2022

Conversation

jleibs
Copy link
Member

@jleibs jleibs commented Dec 16, 2022

The previous join operation was quite expensive, asof_join is much less flexible, but significantly more performant.

Before:

obj_mono_points/insert  time:   [107.87 ms 108.56 ms 109.41 ms]
                        thrpt:  [91.400 Kelem/s 92.118 Kelem/s 92.700 Kelem/s]

obj_mono_points/query   time:   [21.120 ms 21.316 ms 21.533 ms]
                        thrpt:  [4.6440 Kelem/s 4.6914 Kelem/s 4.7348 Kelem/s]

obj_batch_points/insert time:   [492.58 µs 493.57 µs 494.80 µs]
                        thrpt:  [20.210 Melem/s 20.261 Melem/s 20.301 Melem/s]

obj_batch_points/query  time:   [285.54 µs 290.11 µs 295.94 µs]
                        thrpt:  [337.91 Kelem/s 344.70 Kelem/s 350.22 Kelem/s]

After:

obj_mono_points/insert  time:   [107.06 ms 107.28 ms 107.52 ms]
                        thrpt:  [93.006 Kelem/s 93.212 Kelem/s 93.407 Kelem/s]
                 change:
                        time:   [-1.9755% -1.1735% -0.4966%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4991% +1.1875% +2.0153%]
                        Change within noise threshold.

obj_mono_points/query   time:   [403.77 µs 406.58 µs 410.60 µs]
                        thrpt:  [243.55 Kelem/s 245.95 Kelem/s 247.67 Kelem/s]
                 change:
                        time:   [-98.111% -98.088% -98.064%] (p = 0.00 < 0.05)
                        thrpt:  [+5065.6% +5130.8% +5194.2%]
                        Performance has improved.

obj_batch_points/insert time:   [496.75 µs 498.94 µs 502.60 µs]
                        thrpt:  [19.897 Melem/s 20.043 Melem/s 20.131 Melem/s]
                 change:
                        time:   [+0.4017% +0.9136% +1.5290%] (p = 0.00 < 0.05)
                        thrpt:  [-1.5060% -0.9053% -0.4001%]
                        Change within noise threshold.

obj_batch_points/query  time:   [6.5295 µs 6.5356 µs 6.5426 µs]
                        thrpt:  [15.285 Melem/s 15.301 Melem/s 15.315 Melem/s]
                 change:
                        time:   [-97.699% -97.638% -97.578%] (p = 0.00 < 0.05)
                        thrpt:  [+4028.7% +4133.6% +4245.8%]
                        Performance has improved.

Checklist

  • I have read and agree to Contributor Guide and the Code of Conduct
  • I've included a screenshot or gif (if applicable)
  • I've added a line to CHANGELOG.md (if this is a big enough change to warrant it)

@jleibs jleibs marked this pull request as ready for review December 16, 2022 16:30
@jleibs jleibs force-pushed the jleibs/query_join_asof branch 2 times, most recently from ac112e7 to 7a07aa8 Compare December 16, 2022 18:23
Copy link
Member

@emilk emilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, and if it passes the tests I guess it works :)

.iter()
.zip(results)
.filter_map(|(component, col)| col.map(|col| (component, col)))
.map(|(&component, col)| Series::try_from((component, col)))
.map(|(&component, col)| {
// TODO(jleibs): Is it possible to have multiple rows here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple rows per component? I don't quite understand the context for the question.

adding a debug_assert! is a good start to answering that question

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call & the question both go away with #590

if df.column(Instance::NAME).is_ok() {
// If we have an InstanceKey column already, make sure that it's sorted.
// TODO(jleibs): can remove this once we have a sort guarantee from the store
Ok(df.sort([Instance::NAME], false)?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ok(df.sort([Instance::NAME], false)?)
let reverse = false;
Ok(df.sort([Instance::NAME], reverse)?)

@teh-cmc teh-cmc self-requested a review December 18, 2022 00:29
Base automatically changed from jleibs/arrow_bench to main December 18, 2022 14:28
@jleibs jleibs force-pushed the jleibs/query_join_asof branch from 7a07aa8 to f6e62fb Compare December 18, 2022 14:30
@jleibs jleibs merged commit 676343a into main Dec 18, 2022
@jleibs jleibs deleted the jleibs/query_join_asof branch December 18, 2022 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants