-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
p2p shuffled pandas data takes more memory #10326
Comments
Without arrow involved the memory does not expand (it's about 200 both before and after) |
I haven't looked deeper into this, but from the description it sounds like this is caused by dask/distributed#7420 |
Ah yes, that would do it.
…On Thu, Jun 1, 2023, 1:04 PM Hendrik Makait ***@***.***> wrote:
I haven't looked deeper into this, but from the description it sounds like
this is caused by dask/distributed#7420
<dask/distributed#7420>
—
Reply to this email directly, view it on GitHub
<#10326 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEYYFP5WIQNBXBLELDXJCAOLANCNFSM6AAAAAAYWUZ5IY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I think this has to do with when we convert @hendrikmakait and I chatted offline about a possible fix. Right now we're just using a pyarrow schema to keep track of dtypes. I think we should also keep track of if <only-a-single-string-type-present>:
df = table.to_pandas(type_mapper={pa.string(): <pandas-string-type>})
else:
# Mixed string type case
df = table.to_pandas().astype(meta.dtypes.to_dict()) |
Thanks for resolving this quickly. It's nice seeing things happen in a day.
…On Fri, Jun 2, 2023 at 7:16 PM James Bourbeau ***@***.***> wrote:
Closed #10326 <#10326> as completed
via dask/distributed#7879 <dask/distributed#7879>.
—
Reply to this email directly, view it on GitHub
<#10326 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTHI6MNG3S4DPTMPVS3XJIUZHANCNFSM6AAAAAAYWUZ5IY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I observe that after I call
set_index
on the uber-lyft data withp2p
that my dataset takes up more memory than before. When I usetasks
, it doesn't. cc @hendrikmakaitReproducible (but not minimal) example:
If you try wth
shuffle="tasks"
it doesn't expand that much. I haven't tried this without Arrow.The text was updated successfully, but these errors were encountered: