Enable larger join tests #645

hendrikmakait · 2023-01-03T19:20:28Z

Closes Enable larger joins for p2p in test_join.py #641

ncclementi · 2023-01-11T16:34:13Z

@hendrikmakait is this still a draft PR, or do we want to get this one going?

hendrikmakait · 2023-01-11T16:51:26Z

@ncclementi: I still have to check why the tests don't run on CI; likely the pytest worker gets OOM-killed.

hendrikmakait · 2023-01-20T16:12:23Z

@ncclementi: As it turns out, one issue with the tests was that the partition size was very small (~8 MiB). I've adjusted the tests now to use larger partitions, let's see what CI says.

hendrikmakait · 2023-01-20T16:54:12Z

I've also dropped the 10x join since it would mean spilling 10-20x the cluster memory to disk which would slow tests down significantly. We could (and should) add the occasional BIG data integration test run.

hendrikmakait · 2023-01-23T19:48:20Z

test_join_big shows unnecessarily high memory usage with p2p due (dask/distributed#7496). It is still performing better, so I count that as a win.

test_join_big_small does not show any effect due to early materialization of the smaller dataframe which circumvents a distributed join (#669).

ncclementi · 2023-01-23T22:34:22Z

test_join_big shows unnecessarily high memory usage with p2p due (dask/distributed#7496). It is still performing better, so I count that as a win.

This makes sense. Hopefully, when things get fixed we should see a nice drop in the memory.

ncclementi

This LGTM, the failure is unrelated.
Only comment is that we should probably add a separate issue to track the case for size=10 as part of integration tests, and then mentioned it on this PR just for completion.

ncclementi · 2023-01-23T22:35:30Z

tests/benchmarks/test_join.py


        # Control cardinality on column to join - this produces cardinality ~ to len(df)
-        df2_big["x2"] = df2_big["x"] * 1e9
-        df2_big = df2_big.astype({"x2": "int"})
+        df2_big["predicate"] = df2_big["0"] * 1e9


Out of curiosity, why did you choose the name "predicate"?

That's me coming from a DB background. I wanted a name that's more descriptive in this context than x2 and since it's the column used in the join predicate (i.e., the expression used to merge the tables/dataframes), that's what I ended up with. This could also be merge_col or something like that if you find that easier to understand.

I see. That's good, no need to change it, it was more out of curiosity and I just learn something new :)
I'm merging this in.

ncclementi · 2023-01-24T15:05:05Z

Thanks @hendrikmakait, let's open a separate issue to track integration tests for the bigger case.

hendrikmakait added 2 commits January 3, 2023 20:16

Enable larger join tests

17a0ecd

Merge branch 'main' into enable-larger-p2p-join-tests

3f16bdb

hendrikmakait self-assigned this Jan 4, 2023

Merge branch 'main' into enable-larger-p2p-join-tests

fe968b5

Adjust join tests

92be440

Remove 10x merge

8b4517d

Fix

e00dcab

hendrikmakait marked this pull request as ready for review January 23, 2023 09:55

hendrikmakait marked this pull request as draft January 23, 2023 13:53

hendrikmakait mentioned this pull request Jan 23, 2023

P2P shuffling and queuing combined may cause high memory usage with dask.dataframe.merge dask/distributed#7496

Closed

hendrikmakait marked this pull request as ready for review January 23, 2023 19:48

hendrikmakait requested a review from ncclementi January 23, 2023 19:48

ncclementi approved these changes Jan 23, 2023

View reviewed changes

ncclementi merged commit 8da363e into main Jan 24, 2023

hendrikmakait deleted the enable-larger-p2p-join-tests branch March 21, 2024 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable larger join tests #645

Enable larger join tests #645

hendrikmakait commented Jan 3, 2023

ncclementi commented Jan 11, 2023

hendrikmakait commented Jan 11, 2023

hendrikmakait commented Jan 20, 2023

hendrikmakait commented Jan 20, 2023 •

edited

Loading

hendrikmakait commented Jan 23, 2023

ncclementi commented Jan 23, 2023

ncclementi left a comment •

edited

Loading

ncclementi Jan 23, 2023

hendrikmakait Jan 24, 2023 •

edited

Loading

ncclementi Jan 24, 2023

ncclementi commented Jan 24, 2023

Enable larger join tests #645

Enable larger join tests #645

Conversation

hendrikmakait commented Jan 3, 2023

ncclementi commented Jan 11, 2023

hendrikmakait commented Jan 11, 2023

hendrikmakait commented Jan 20, 2023

hendrikmakait commented Jan 20, 2023 • edited Loading

hendrikmakait commented Jan 23, 2023

ncclementi commented Jan 23, 2023

ncclementi left a comment • edited Loading

Choose a reason for hiding this comment

ncclementi Jan 23, 2023

Choose a reason for hiding this comment

hendrikmakait Jan 24, 2023 • edited Loading

Choose a reason for hiding this comment

ncclementi Jan 24, 2023

Choose a reason for hiding this comment

ncclementi commented Jan 24, 2023

hendrikmakait commented Jan 20, 2023 •

edited

Loading

ncclementi left a comment •

edited

Loading

hendrikmakait Jan 24, 2023 •

edited

Loading