-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray Data] fix problem: to_pandas failed on datasets returned by from_spark #32968
Conversation
Signed-off-by: Zhi Lin <zhi.lin@intel.com>
@kira-lin can raydp save the blocks as pyarrow tables? saving blocks as arbitrary bytes is generally not supported in Ray Datasets I believe. cc @clarkzinzow @c21 |
else: | ||
self._tables.append(block) | ||
self._tables_size_bytes += accessor.size_bytes() | ||
self._num_rows += accessor.num_rows() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kira-lin Why is this necessary? The existing TableBlockBuilder.add_block()
logic should convert the bytes into an Arrow table: delegation, conversion
Is this an issue with the instance check in TableBlockBuilder.add_block()
? If so, we can change the block type supplied by the ArrowBlockBuilder
constructor to its parent's constructor to a (pyarrow.Table, bytes)
tuple, since isinstance()
checks work with a tuple of types, and TableBlockBuilder._block_type
is only used for that instance check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see. I tried Union, didn't work. but this alone is not enough to solve the problem. Please see the update.
@amogkam RayDP save these blocks in java, and these blocks are in arrow stream format. But due to cross language serialization, the type information is lost, and can only be saved as bytes. I submitted a PR #20242 to add arrow back as a serialization format long time ago. Now serialization 2.0 is not on track, maybe we can just update and merge it? What do you think @jovany-wang @ericl |
Signed-off-by: Zhi Lin <zhi.lin@intel.com>
@jovany-wang I'll probably submit a new one. I need to finish things I'm currently working on, so that'll be at least April |
@kira-lin Sounds good. Feel free to bother me if it's needed. |
@clarkzinzow The failure in CI seems to be not related to my PR, is it? |
can we merge this first? @clarkzinzow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to merge after removing the special ArrowBlockAccessor
case in the block builder!
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Signed-off-by: Zhi Lin <zhi.lin@intel.com>
@kira-lin Looks like lint is failing due to an unused import in |
Addressed |
Signed-off-by: Zhi Lin <zhi.lin@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation and CI looks good!
@clarkzinzow quick note, we are in feature freeze. Please tag me (and I will approve) before merging to |
…_spark (ray-project#32968) Signed-off-by: Zhi Lin <zhi.lin@intel.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Signed-off-by: elliottower <elliot@elliottower.com>
…_spark (ray-project#32968) Signed-off-by: Zhi Lin <zhi.lin@intel.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
Why are these changes needed?
Related issue number
Closes #32967
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.