Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p shuffled pandas data takes more memory #10326

Closed
mrocklin opened this issue Jun 1, 2023 · 5 comments · Fixed by dask/distributed#7879
Closed

p2p shuffled pandas data takes more memory #10326

mrocklin opened this issue Jun 1, 2023 · 5 comments · Fixed by dask/distributed#7879
Labels
bug Something is broken needs triage Needs a response from a contributor

Comments

@mrocklin
Copy link
Member

mrocklin commented Jun 1, 2023

I observe that after I call set_index on the uber-lyft data with p2p that my dataset takes up more memory than before. When I use tasks, it doesn't. cc @hendrikmakait

Reproducible (but not minimal) example:

import dask
from dask.distributed import wait
import dask.dataframe as dd

dask.config.set({"dataframe.convert-string": True})  # use PyArrow strings by default

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
)

print(df.memory_usage(deep=True).sum().compute() / 1e9)  # about 100

df = df.set_index("request_datetime", shuffle="p2p").persist()

print(df.memory_usage(deep=True).sum().compute() / 1e9)  # about 200

If you try wth shuffle="tasks" it doesn't expand that much. I haven't tried this without Arrow.

@github-actions github-actions bot added the needs triage Needs a response from a contributor label Jun 1, 2023
@mrocklin
Copy link
Member Author

mrocklin commented Jun 1, 2023

Without arrow involved the memory does not expand (it's about 200 both before and after)

@hendrikmakait
Copy link
Member

I haven't looked deeper into this, but from the description it sounds like this is caused by dask/distributed#7420

@hendrikmakait hendrikmakait added the bug Something is broken label Jun 1, 2023
@mrocklin
Copy link
Member Author

mrocklin commented Jun 1, 2023 via email

@jrbourbeau
Copy link
Member

I think this has to do with when we convert pa.Table object to pandas objects with pa.Table.to_pandas(). pa.Table represents both string[python] and string[pyarrow] as pa.string() (which makes sense). By default when pandas creates a string column, it uses string[python], not string[pyarrow] (there's a mode.string_storage pandas option to control that default).

@hendrikmakait and I chatted offline about a possible fix. Right now we're just using a pyarrow schema to keep track of dtypes. I think we should also keep track of meta on the pandas side so we can handle the string case properly. My guess is we grab ._meta at the beginning of a the shuffle process and pipe it down to where those pa.Table.to_pandas() calls happen. We can then do something like (psuedocode below):

if <only-a-single-string-type-present>:
    df = table.to_pandas(type_mapper={pa.string(): <pandas-string-type>})
else:
    # Mixed string type case
    df = table.to_pandas().astype(meta.dtypes.to_dict())

@mrocklin
Copy link
Member Author

mrocklin commented Jun 2, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken needs triage Needs a response from a contributor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants