Enforce dtypes in P2P shuffle #7879

hendrikmakait · 2023-06-02T15:33:47Z

Closes #7420
Closes dask/dask#10326
@jrbourbeau: As discussed offline yesterday, here's the version that uses meta instead of a pa.Schema

Tests added / passed
Passes pre-commit run --all-files

hendrikmakait · 2023-06-02T15:35:11Z

cc @wence- to keep you in the loop.

hendrikmakait · 2023-06-02T15:36:18Z

distributed/shuffle/_arrow.py

-    return pa.concat_tables(shards)
+    table = pa.concat_tables(shards)
+    df = table.to_pandas(self_destruct=True)
+    return df.astype(meta.dtypes)


This might not be the most performant version, but I also don't know if it's much of a problem. I'll run an A/B test on the existing test suite.

My guess is it'd be good to avoid a cast is possible (at least for strings) via type_mapper= in pa.Table.to_pandas(). For example, as is, this will create object-backed string[python] columns first and then cast them to string[pyarrow].

This definitely isn't a blocker for this PR, but let's add a # TODO: comment if we don't include that logic now

I'll create a follow-up ticket once this is merged.

This seems like a good first pass. Thanks for putting this together so quickly. My guess is that generating the object dtype columns will slow things down, both for normal reasons, and for GIL + networking reasons.

I like that we didn't feel a need to block on this though.

Yes this slows stuff down significantly, mapping the strings (pa.string and pa.large_string) to pd.ArrowDtype makes this zero copy and should speed up follow up as types

hendrikmakait · 2023-06-02T15:37:02Z

distributed/shuffle/_merge.py

    left = ext.get_output_partition(
        shuffle_id_left, barrier_left, output_partition
-    ).drop(columns=_HASH_COLUMN_NAME)
+    ).drop(columns=_HASH_COLUMN_NAME, errors="ignore")


This is inelegant, but so is adding the hash column to the meta. If anybody has strong preferences, please speak up.

This seems fine to me. Thanks for adding the informative comment 👌

jrbourbeau

Nice! Thanks @hendrikmakait. Overall the changes here look good to me. I left some minor comments / questions

Have you tried this out with the example in dask/dask#10326?

jrbourbeau · 2023-06-02T15:43:40Z

distributed/shuffle/_arrow.py

@@ -54,7 +54,9 @@ def convert_partition(data: bytes) -> pa.Table:
    while file.tell() < end:
        sr = pa.RecordBatchStreamReader(file)
        shards.append(sr.read_all())
-    return pa.concat_tables(shards)
+    table = pa.concat_tables(shards)
+    df = table.to_pandas(self_destruct=True)


I'm a little nervous about self_destruct=True as the docstring (https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas) says it's experimental and "If you use the object after calling to_pandas with this option it will crash your program".

I took it as a recommendation from https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas

From what I understand, it would wreck havoc if we used table after the call to to_pandas. Given that we return after the next line, that shouldn't be a problem.

Thanks for pointing to the extra docs. I guess I was nervous about pyarrow-backed dtypes in pandas specifically. But I would hope in that case memory wouldn't be deallocated if pandas was using it after the pyarrow -> pandas handoff. If we're running this against a few use cases that do things after a shuffle (which is sounds like we are) then we should have sufficient coverage

jrbourbeau · 2023-06-02T15:46:38Z

distributed/shuffle/_arrow.py

-    return pa.concat_tables(shards)
+    table = pa.concat_tables(shards)
+    df = table.to_pandas(self_destruct=True)
+    return df.astype(meta.dtypes)


My guess is it'd be good to avoid a cast is possible (at least for strings) via type_mapper= in pa.Table.to_pandas(). For example, as is, this will create object-backed string[python] columns first and then cast them to string[pyarrow].

This definitely isn't a blocker for this PR, but let's add a # TODO: comment if we don't include that logic now

jrbourbeau · 2023-06-02T15:49:17Z

distributed/shuffle/_merge.py

    left = ext.get_output_partition(
        shuffle_id_left, barrier_left, output_partition
-    ).drop(columns=_HASH_COLUMN_NAME)
+    ).drop(columns=_HASH_COLUMN_NAME, errors="ignore")


This seems fine to me. Thanks for adding the informative comment 👌

jrbourbeau · 2023-06-02T15:51:07Z

distributed/shuffle/_shuffle.py

@@ -100,13 +102,13 @@ def rearrange_by_column_p2p(
 ) -> DataFrame:
    from dask.dataframe import DataFrame

-    check_dtype_support(df._meta)
+    meta = df._meta.copy()


Why the copy here?

I see this is so we can reuse the same meta a few lines below. I'm now wondering why we need a copy there

I think we can skip the copy here as long as we have the other one in place.

jrbourbeau · 2023-06-02T15:55:32Z

distributed/shuffle/_worker_extension.py


            out = await self.offload(_)
        except KeyError:
-            out = self.schema.empty_table().to_pandas()
+            out = self.meta.copy()


Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      20 files ±0       20 suites ±0 11h 29m 14s ⏱️ - 18m 24s
  3 659 tests ±0   3 551 ✔️ +5   108 💤 ±0 0 ❌ - 5
35 380 runs ±0 33 614 ✔️ +4 1 766 💤 +1 0 ❌ - 5

Results for commit d2a69cd. ± Comparison against base commit 8301cb7.

jrbourbeau

Thanks @hendrikmakait!

hendrikmakait added 2 commits June 2, 2023 17:09

Fix string roundtrip in shuffle

ab17955

Fix roundtrip in merge

7a90f3e

hendrikmakait requested a review from fjetter as a code owner June 2, 2023 15:33

hendrikmakait commented Jun 2, 2023

View reviewed changes

hendrikmakait added needs review Needs review from a contributor. shuffle labels Jun 2, 2023

jrbourbeau reviewed Jun 2, 2023

View reviewed changes

Fix tests

57868dd

jrbourbeau removed the needs review Needs review from a contributor. label Jun 2, 2023

Fix copies and meta

d2a69cd

jrbourbeau approved these changes Jun 2, 2023

View reviewed changes

jrbourbeau merged commit 57639c1 into dask:main Jun 2, 2023

mrocklin mentioned this pull request Jun 2, 2023

p2p shuffled pandas data takes more memory dask/dask#10326

Closed

fjetter mentioned this pull request Sep 14, 2023

Categorical column turned to NaN after P2P Shuffle #8186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce dtypes in P2P shuffle #7879

Enforce dtypes in P2P shuffle #7879

hendrikmakait commented Jun 2, 2023 •

edited

Loading

hendrikmakait commented Jun 2, 2023

hendrikmakait Jun 2, 2023

jrbourbeau Jun 2, 2023

hendrikmakait Jun 2, 2023

mrocklin Jun 2, 2023

phofl Jun 12, 2023

hendrikmakait Jun 2, 2023

jrbourbeau Jun 2, 2023

jrbourbeau left a comment

jrbourbeau Jun 2, 2023

hendrikmakait Jun 2, 2023

jrbourbeau Jun 2, 2023

jrbourbeau Jun 2, 2023

jrbourbeau Jun 2, 2023

jrbourbeau Jun 2, 2023

jrbourbeau Jun 2, 2023

hendrikmakait Jun 2, 2023

jrbourbeau Jun 2, 2023

hendrikmakait Jun 2, 2023

jrbourbeau Jun 2, 2023

jrbourbeau commented Jun 2, 2023

hendrikmakait commented Jun 2, 2023

hendrikmakait commented Jun 2, 2023

github-actions bot commented Jun 2, 2023

jrbourbeau left a comment

+                          f"col{next(counter)}": pd.array(
+                              ["lorem ipsum"] * 100,
+                              dtype="string[python]",

Enforce dtypes in P2P shuffle #7879

Enforce dtypes in P2P shuffle #7879

Conversation

hendrikmakait commented Jun 2, 2023 • edited Loading

hendrikmakait commented Jun 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrbourbeau commented Jun 2, 2023

hendrikmakait commented Jun 2, 2023

hendrikmakait commented Jun 2, 2023

github-actions bot commented Jun 2, 2023

Unit Test Results

jrbourbeau left a comment

Choose a reason for hiding this comment

hendrikmakait commented Jun 2, 2023 •

edited

Loading