-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
P2P HashJoin #7514
P2P HashJoin #7514
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 24 files ± 0 24 suites ±0 10h 43m 2s ⏱️ + 37m 14s For more details on these failures, see this check. Results for commit 9d2828d. ± Comparison against base commit 76d0104. ♻️ This comment has been updated with latest results. |
cc @rjzamora @jrbourbeau I wouldn't mind a review from one of you two. Getting this running took a little more code than I wanted to but it should be working now. I'll see how I can remove a bit of boilerplate. I encountered many problems about serialization/stringification of arguments. Are there any best practices around this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test cases and executed merge logic look reasonable. The HLG stuff is somewhat beyond me at the moment.
dd._compat.assert_numpy_array_equal(av, bv) | ||
|
||
|
||
@pytest.mark.parametrize("how", ["inner", "left", "right", "outer"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We parametrize multiple tests with how
, should we turn this into a fixture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied most of the tests over so I didn't feel like rewriting them like this
I intend to move forward with this before finishing #7564 We can follow up with a simplification of the layer class afterwards |
cc523d4
to
fd700f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming CI is green.
@@ -0,0 +1,439 @@ | |||
# mypy: ignore-errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? If so, let's add a follow-up ticket to clean this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be a follow up task to clean up this layer after the pack/unpack protocol is dropped. I didn't want to go through this exercise for almost-dead-code.
Refactoring the layer makes sense as a follow up ticket but I don't think we should have a ticket just for mypy.
# FIXME: Serializing custom objects to PyArrow is not supported in P2P shuffling | ||
if pd.api.types.is_object_dtype(column): | ||
raise TypeError( | ||
f"p2p does not support custom objects found in column '{name}'." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hendrikmakait Sorry I missed this in review of #7425 but is_object_dtype
is way too generic to raise here. This includes very simple string columns and we should definitely be able to shuffle dataframes with string columns. Since this is purely about failing early, I removed this check.
Partially closes #7496
The merge itself is pretty trivial but the boilerplate for the layer is a bit heavy...
I'll pass over it again in hopes to simplify it but the code is already ready for reviews.
I'll need to think about how to test this best. I would at least like to have a couple of basic tests, e.g. to ensure all parameters are properly serialized and forwarded, etc.
cc @hendrikmakait