-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecated and remove from_legacy_dataframe usage #1168
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @phofl
) | ||
def test_dataframe_mode_split_every(pdf, df, split_every, expect_tasks): | ||
assert_eq(df.to_legacy_dataframe().mode(split_every=split_every), pdf.mode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me why to_legacy_dataframe
was needed here originally. Maybe some sort of workaround that's no longer needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, this is totally fine, just a non-blocking, not very important question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No clue, I was as confused as you are about this. Probably a mistake when adding the test at some point
@@ -114,7 +113,6 @@ def read_json( | |||
path_converter=path_converter, | |||
**kwargs, | |||
) | |||
return from_legacy_dataframe(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. It looks like we can do this now since the changes over in dask/dask#11529 make some methods (e.g. dd.read_json
) more dask-expr
aware
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
@@ -172,7 +170,7 @@ def to_json( | |||
from dask.dataframe.io.json import to_json | |||
|
|||
return to_json( | |||
df.to_legacy_dataframe(), | |||
df, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Over in orc.py
this is df.optimize()
-- should we do something similar here? Or is the .optimize()
maybe not needed over there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calls to_delayed, which triggers the optimization. I'd rather collect them in a single place
if lengths is True: | ||
lengths = tuple(self.map_partitions(len, enforce_metadata=False).compute()) | ||
|
||
arr = self.values | ||
|
||
chunks = self._validate_chunks(arr, lengths) | ||
arr._chunks = chunks | ||
|
||
if meta is not None: | ||
arr._meta = meta | ||
|
||
return arr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that this is the same implementation as over in dask/dask
if is_extension_array_dtype(self._meta.values): | ||
warnings.warn( | ||
"Dask currently has limited support for converting pandas extension dtypes " | ||
f"to arrays. Converting {self._meta.values.dtype} to object dtype.", | ||
UserWarning, | ||
) | ||
return self.map_partitions(methods.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
goes with dask/dask#11529