Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Data] fix problem: to_pandas failed on datasets returned by from_spark #32968

Merged
merged 10 commits into from
Mar 28, 2023

Conversation

kira-lin
Copy link
Contributor

@kira-lin kira-lin commented Mar 2, 2023

Why are these changes needed?

Related issue number

Closes #32967

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

kira-lin added 2 commits March 2, 2023 14:57
Signed-off-by: Zhi Lin <zhi.lin@intel.com>
Signed-off-by: Zhi Lin <zhi.lin@intel.com>
@amogkam
Copy link
Contributor

amogkam commented Mar 2, 2023

@kira-lin can raydp save the blocks as pyarrow tables? saving blocks as arbitrary bytes is generally not supported in Ray Datasets I believe. cc @clarkzinzow @c21

else:
self._tables.append(block)
self._tables_size_bytes += accessor.size_bytes()
self._num_rows += accessor.num_rows()
Copy link
Contributor

@clarkzinzow clarkzinzow Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kira-lin Why is this necessary? The existing TableBlockBuilder.add_block() logic should convert the bytes into an Arrow table: delegation, conversion

Is this an issue with the instance check in TableBlockBuilder.add_block()? If so, we can change the block type supplied by the ArrowBlockBuilder constructor to its parent's constructor to a (pyarrow.Table, bytes) tuple, since isinstance() checks work with a tuple of types, and TableBlockBuilder._block_type is only used for that instance check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see. I tried Union, didn't work. but this alone is not enough to solve the problem. Please see the update.

Signed-off-by: Zhi Lin <zhi.lin@intel.com>
@kira-lin
Copy link
Contributor Author

kira-lin commented Mar 3, 2023

@amogkam RayDP save these blocks in java, and these blocks are in arrow stream format. But due to cross language serialization, the type information is lost, and can only be saved as bytes.

I submitted a PR #20242 to add arrow back as a serialization format long time ago. Now serialization 2.0 is not on track, maybe we can just update and merge it? What do you think @jovany-wang @ericl

@jovany-wang
Copy link
Contributor

Hi @kira-lin , I think it's okay to merge #20242 as the short term plan to unblock your issue.

And for the long term plan, I don't know whether the plugable serialization framework(serialization2.0) is indeed needed in community. @ericl is it a high priority item?

@jovany-wang
Copy link
Contributor

@kira-lin Are you going to reopn #20242 or submit a new one? I can do the review for that.

@kira-lin
Copy link
Contributor Author

kira-lin commented Mar 6, 2023

@jovany-wang I'll probably submit a new one. I need to finish things I'm currently working on, so that'll be at least April

@jovany-wang
Copy link
Contributor

@kira-lin Sounds good. Feel free to bother me if it's needed.

@kira-lin
Copy link
Contributor Author

kira-lin commented Mar 6, 2023

@clarkzinzow The failure in CI seems to be not related to my PR, is it?

@kira-lin
Copy link
Contributor Author

can we merge this first? @clarkzinzow

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to merge after removing the special ArrowBlockAccessor case in the block builder!

python/ray/data/_internal/delegating_block_builder.py Outdated Show resolved Hide resolved
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Signed-off-by: Zhi Lin <zhi.lin@intel.com>
@clarkzinzow
Copy link
Contributor

@kira-lin Looks like lint is failing due to an unused import in delegating_block_builder.py: https://buildkite.com/ray-project/oss-ci-build-pr/builds/16356#0187209f-6d76-426c-9666-69d900445ec7

@kira-lin
Copy link
Contributor Author

@kira-lin Looks like lint is failing due to an unused import in delegating_block_builder.py: https://buildkite.com/ray-project/oss-ci-build-pr/builds/16356#0187209f-6d76-426c-9666-69d900445ec7

Addressed

Signed-off-by: Zhi Lin <zhi.lin@intel.com>
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation and CI looks good!

@clarkzinzow clarkzinzow merged commit 23c8012 into ray-project:master Mar 28, 2023
@zhe-thoughts
Copy link
Collaborator

@clarkzinzow quick note, we are in feature freeze. Please tag me (and I will approve) before merging to master. Thanks!

elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…_spark (ray-project#32968)

Signed-off-by: Zhi Lin <zhi.lin@intel.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…_spark (ray-project#32968)

Signed-off-by: Zhi Lin <zhi.lin@intel.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[data] to_pandas failed on datasets returned by from_spark
5 participants