-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable larger join tests #645
Conversation
@hendrikmakait is this still a draft PR, or do we want to get this one going? |
@ncclementi: I still have to check why the tests don't run on CI; likely the pytest worker gets OOM-killed. |
@ncclementi: As it turns out, one issue with the tests was that the partition size was very small (~8 MiB). I've adjusted the tests now to use larger partitions, let's see what CI says. |
I've also dropped the 10x join since it would mean spilling 10-20x the cluster memory to disk which would slow tests down significantly. We could (and should) add the occasional BIG data integration test run. |
|
This makes sense. Hopefully, when things get fixed we should see a nice drop in the memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, the failure is unrelated.
Only comment is that we should probably add a separate issue to track the case for size=10 as part of integration tests, and then mentioned it on this PR just for completion.
|
||
# Control cardinality on column to join - this produces cardinality ~ to len(df) | ||
df2_big["x2"] = df2_big["x"] * 1e9 | ||
df2_big = df2_big.astype({"x2": "int"}) | ||
df2_big["predicate"] = df2_big["0"] * 1e9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why did you choose the name "predicate"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's me coming from a DB background. I wanted a name that's more descriptive in this context than x2
and since it's the column used in the join predicate (i.e., the expression used to merge the tables/dataframes), that's what I ended up with. This could also be merge_col
or something like that if you find that easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. That's good, no need to change it, it was more out of curiosity and I just learn something new :)
I'm merging this in.
Thanks @hendrikmakait, let's open a separate issue to track integration tests for the bigger case. |
p2p
intest_join.py
#641