Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P2P HashJoin #7514

Merged
merged 3 commits into from
Feb 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions distributed/shuffle/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
from __future__ import annotations

from distributed.shuffle._arrow import check_minimal_arrow_version
from distributed.shuffle._merge import HashJoinP2PLayer, hash_join_p2p
from distributed.shuffle._scheduler_extension import ShuffleSchedulerExtension
from distributed.shuffle._shuffle import P2PShuffleLayer, rearrange_by_column_p2p
from distributed.shuffle._worker_extension import ShuffleWorkerExtension

__all__ = [
"check_minimal_arrow_version",
"hash_join_p2p",
"HashJoinP2PLayer",
"P2PShuffleLayer",
"rearrange_by_column_p2p",
"ShuffleSchedulerExtension",
Expand Down
5 changes: 0 additions & 5 deletions distributed/shuffle/_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,6 @@ def check_dtype_support(meta_input: pd.DataFrame) -> None:
raise TypeError(
f"p2p does not support data of type '{column.dtype}' found in column '{name}'."
)
# FIXME: Serializing custom objects to PyArrow is not supported in P2P shuffling
if pd.api.types.is_object_dtype(column):
raise TypeError(
f"p2p does not support custom objects found in column '{name}'."
)
Comment on lines -23 to -27
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hendrikmakait Sorry I missed this in review of #7425 but is_object_dtype is way too generic to raise here. This includes very simple string columns and we should definitely be able to shuffle dataframes with string columns. Since this is purely about failing early, I removed this check.

# FIXME: PyArrow does not support sparse data: https://issues.apache.org/jira/browse/ARROW-8679
if pd.api.types.is_sparse(column):
raise TypeError("p2p does not support sparse data found in column '{name}'")
Expand Down
Loading