p2p shuffled pandas data takes more memory #10326

mrocklin · 2023-06-01T09:41:34Z

I observe that after I call set_index on the uber-lyft data with p2p that my dataset takes up more memory than before. When I use tasks, it doesn't. cc @hendrikmakait

Reproducible (but not minimal) example:

import dask
from dask.distributed import wait
import dask.dataframe as dd

dask.config.set({"dataframe.convert-string": True})  # use PyArrow strings by default

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
)

print(df.memory_usage(deep=True).sum().compute() / 1e9)  # about 100

df = df.set_index("request_datetime", shuffle="p2p").persist()

print(df.memory_usage(deep=True).sum().compute() / 1e9)  # about 200

If you try wth shuffle="tasks" it doesn't expand that much. I haven't tried this without Arrow.

The text was updated successfully, but these errors were encountered:

mrocklin · 2023-06-01T09:46:29Z

Without arrow involved the memory does not expand (it's about 200 both before and after)

hendrikmakait · 2023-06-01T12:04:43Z

I haven't looked deeper into this, but from the description it sounds like this is caused by dask/distributed#7420

mrocklin · 2023-06-01T12:21:18Z

Ah yes, that would do it.

…

On Thu, Jun 1, 2023, 1:04 PM Hendrik Makait ***@***.***> wrote: I haven't looked deeper into this, but from the description it sounds like this is caused by dask/distributed#7420 <dask/distributed#7420> — Reply to this email directly, view it on GitHub <#10326 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTEYYFP5WIQNBXBLELDXJCAOLANCNFSM6AAAAAAYWUZ5IY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

jrbourbeau · 2023-06-01T15:57:01Z

I think this has to do with when we convert pa.Table object to pandas objects with pa.Table.to_pandas(). pa.Table represents both string[python] and string[pyarrow] as pa.string() (which makes sense). By default when pandas creates a string column, it uses string[python], not string[pyarrow] (there's a mode.string_storage pandas option to control that default).

@hendrikmakait and I chatted offline about a possible fix. Right now we're just using a pyarrow schema to keep track of dtypes. I think we should also keep track of meta on the pandas side so we can handle the string case properly. My guess is we grab ._meta at the beginning of a the shuffle process and pipe it down to where those pa.Table.to_pandas() calls happen. We can then do something like (psuedocode below):

if <only-a-single-string-type-present>:
    df = table.to_pandas(type_mapper={pa.string(): <pandas-string-type>})
else:
    # Mixed string type case
    df = table.to_pandas().astype(meta.dtypes.to_dict())

mrocklin · 2023-06-02T21:06:49Z

Thanks for resolving this quickly. It's nice seeing things happen in a day.

…

On Fri, Jun 2, 2023 at 7:16 PM James Bourbeau ***@***.***> wrote: Closed #10326 <#10326> as completed via dask/distributed#7879 <dask/distributed#7879>. — Reply to this email directly, view it on GitHub <#10326 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTHI6MNG3S4DPTMPVS3XJIUZHANCNFSM6AAAAAAYWUZ5IY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

github-actions bot added the needs triage Needs a response from a contributor label Jun 1, 2023

hendrikmakait added the bug Something is broken label Jun 1, 2023

hendrikmakait mentioned this issue Jun 1, 2023

string[pyarrow] dtype does not roundtrip in P2P shuffling dask/distributed#7420

Closed

jrbourbeau mentioned this issue Jun 2, 2023

Enforce dtypes in P2P shuffle dask/distributed#7879

Merged

2 tasks

jrbourbeau closed this as completed in dask/distributed#7879 Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p2p shuffled pandas data takes more memory #10326

p2p shuffled pandas data takes more memory #10326

mrocklin commented Jun 1, 2023

mrocklin commented Jun 1, 2023

hendrikmakait commented Jun 1, 2023

mrocklin commented Jun 1, 2023 via email

jrbourbeau commented Jun 1, 2023

mrocklin commented Jun 2, 2023 via email

p2p shuffled pandas data takes more memory #10326

p2p shuffled pandas data takes more memory #10326

Comments

mrocklin commented Jun 1, 2023

mrocklin commented Jun 1, 2023

hendrikmakait commented Jun 1, 2023

mrocklin commented Jun 1, 2023 via email

jrbourbeau commented Jun 1, 2023

mrocklin commented Jun 2, 2023 via email