Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable larger join tests #645

Merged
merged 6 commits into from
Jan 24, 2023
Merged

Enable larger join tests #645

merged 6 commits into from
Jan 24, 2023

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait self-assigned this Jan 4, 2023
@ncclementi
Copy link
Contributor

@hendrikmakait is this still a draft PR, or do we want to get this one going?

@hendrikmakait
Copy link
Member Author

@ncclementi: I still have to check why the tests don't run on CI; likely the pytest worker gets OOM-killed.

@hendrikmakait
Copy link
Member Author

@ncclementi: As it turns out, one issue with the tests was that the partition size was very small (~8 MiB). I've adjusted the tests now to use larger partitions, let's see what CI says.

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Jan 20, 2023

I've also dropped the 10x join since it would mean spilling 10-20x the cluster memory to disk which would slow tests down significantly. We could (and should) add the occasional BIG data integration test run.

@hendrikmakait
Copy link
Member Author

test_join_big shows unnecessarily high memory usage with p2p due (dask/distributed#7496). It is still performing better, so I count that as a win.

test_join_big_small does not show any effect due to early materialization of the smaller dataframe which circumvents a distributed join (#669).

@hendrikmakait hendrikmakait marked this pull request as ready for review January 23, 2023 19:48
@ncclementi
Copy link
Contributor

test_join_big shows unnecessarily high memory usage with p2p due (dask/distributed#7496). It is still performing better, so I count that as a win.

This makes sense. Hopefully, when things get fixed we should see a nice drop in the memory.

Copy link
Contributor

@ncclementi ncclementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, the failure is unrelated.
Only comment is that we should probably add a separate issue to track the case for size=10 as part of integration tests, and then mentioned it on this PR just for completion.


# Control cardinality on column to join - this produces cardinality ~ to len(df)
df2_big["x2"] = df2_big["x"] * 1e9
df2_big = df2_big.astype({"x2": "int"})
df2_big["predicate"] = df2_big["0"] * 1e9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why did you choose the name "predicate"?

Copy link
Member Author

@hendrikmakait hendrikmakait Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's me coming from a DB background. I wanted a name that's more descriptive in this context than x2 and since it's the column used in the join predicate (i.e., the expression used to merge the tables/dataframes), that's what I ended up with. This could also be merge_col or something like that if you find that easier to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That's good, no need to change it, it was more out of curiosity and I just learn something new :)
I'm merging this in.

@ncclementi ncclementi merged commit 8da363e into main Jan 24, 2023
@ncclementi
Copy link
Contributor

Thanks @hendrikmakait, let's open a separate issue to track integration tests for the bigger case.

@hendrikmakait hendrikmakait deleted the enable-larger-p2p-join-tests branch March 21, 2024 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable larger joins for p2p in test_join.py
2 participants