[Ray Data] fix problem: to_pandas failed on datasets returned by from_spark #32968

kira-lin · 2023-03-02T07:26:26Z

Why are these changes needed?

Related issue number

Closes #32967

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Zhi Lin <zhi.lin@intel.com>

amogkam · 2023-03-02T16:08:38Z

@kira-lin can raydp save the blocks as pyarrow tables? saving blocks as arbitrary bytes is generally not supported in Ray Datasets I believe. cc @clarkzinzow @c21

clarkzinzow · 2023-03-02T16:20:11Z

python/ray/data/_internal/arrow_block.py

+        else:
+            self._tables.append(block)
+        self._tables_size_bytes += accessor.size_bytes()
+        self._num_rows += accessor.num_rows()


@kira-lin Why is this necessary? The existing TableBlockBuilder.add_block() logic should convert the bytes into an Arrow table: delegation, conversion

Is this an issue with the instance check in TableBlockBuilder.add_block()? If so, we can change the block type supplied by the ArrowBlockBuilder constructor to its parent's constructor to a (pyarrow.Table, bytes) tuple, since isinstance() checks work with a tuple of types, and TableBlockBuilder._block_type is only used for that instance check.

oh, I see. I tried Union, didn't work. but this alone is not enough to solve the problem. Please see the update.

Signed-off-by: Zhi Lin <zhi.lin@intel.com>

kira-lin · 2023-03-03T02:34:27Z

@amogkam RayDP save these blocks in java, and these blocks are in arrow stream format. But due to cross language serialization, the type information is lost, and can only be saved as bytes.

I submitted a PR #20242 to add arrow back as a serialization format long time ago. Now serialization 2.0 is not on track, maybe we can just update and merge it? What do you think @jovany-wang @ericl

jovany-wang · 2023-03-03T07:05:18Z

Hi @kira-lin , I think it's okay to merge #20242 as the short term plan to unblock your issue.

And for the long term plan, I don't know whether the plugable serialization framework(serialization2.0) is indeed needed in community. @ericl is it a high priority item?

jovany-wang · 2023-03-03T07:06:04Z

@kira-lin Are you going to reopn #20242 or submit a new one? I can do the review for that.

Signed-off-by: Zhi Lin <zhi.lin@intel.com>

kira-lin · 2023-03-06T01:29:13Z

@jovany-wang I'll probably submit a new one. I need to finish things I'm currently working on, so that'll be at least April

jovany-wang · 2023-03-06T02:23:36Z

@kira-lin Sounds good. Feel free to bother me if it's needed.

kira-lin · 2023-03-06T08:47:47Z

@clarkzinzow The failure in CI seems to be not related to my PR, is it?

Signed-off-by: Zhi Lin <zhi.lin@intel.com>

kira-lin · 2023-03-24T02:09:43Z

can we merge this first? @clarkzinzow

clarkzinzow

Looks good to merge after removing the special ArrowBlockAccessor case in the block builder!

python/ray/data/_internal/delegating_block_builder.py

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Signed-off-by: Zhi Lin <zhi.lin@intel.com>

clarkzinzow · 2023-03-27T16:32:00Z

@kira-lin Looks like lint is failing due to an unused import in delegating_block_builder.py: https://buildkite.com/ray-project/oss-ci-build-pr/builds/16356#0187209f-6d76-426c-9666-69d900445ec7

kira-lin · 2023-03-28T01:58:43Z

@kira-lin Looks like lint is failing due to an unused import in delegating_block_builder.py: https://buildkite.com/ray-project/oss-ci-build-pr/builds/16356#0187209f-6d76-426c-9666-69d900445ec7

Addressed

Signed-off-by: Zhi Lin <zhi.lin@intel.com>

clarkzinzow

Implementation and CI looks good!

zhe-thoughts · 2023-03-28T19:59:35Z

@clarkzinzow quick note, we are in feature freeze. Please tag me (and I will approve) before merging to master. Thanks!

…_spark (ray-project#32968) Signed-off-by: Zhi Lin <zhi.lin@intel.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Signed-off-by: elliottower <elliot@elliottower.com>

…_spark (ray-project#32968) Signed-off-by: Zhi Lin <zhi.lin@intel.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

kira-lin requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners March 2, 2023 07:26

kira-lin added 2 commits March 2, 2023 14:57

fix

8507826

Signed-off-by: Zhi Lin <zhi.lin@intel.com>

add a test to test to_pandas

fb30631

Signed-off-by: Zhi Lin <zhi.lin@intel.com>

clarkzinzow reviewed Mar 2, 2023

View reviewed changes

format

9c82116

Signed-off-by: Zhi Lin <zhi.lin@intel.com>

amogkam assigned clarkzinzow Mar 3, 2023

kira-lin added 2 commits March 3, 2023 09:25

delete override add_block

a58cf5d

Signed-off-by: Zhi Lin <zhi.lin@intel.com>

Merge remote-tracking branch 'upstream/master' into fix-raydp-to-pd

520be87

kira-lin added 3 commits March 9, 2023 09:54

Merge remote-tracking branch 'upstream/master' into fix-raydp-to-pd

dabf3a5

format

2254b18

Signed-off-by: Zhi Lin <zhi.lin@intel.com>

Merge remote-tracking branch 'upstream/master' into fix-raydp-to-pd

4f3860a

clarkzinzow reviewed Mar 24, 2023

View reviewed changes

python/ray/data/_internal/delegating_block_builder.py Outdated Show resolved Hide resolved

Update python/ray/data/_internal/delegating_block_builder.py

5e50f0b

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Signed-off-by: Zhi Lin <zhi.lin@intel.com>

remove unused import

5378779

Signed-off-by: Zhi Lin <zhi.lin@intel.com>

clarkzinzow approved these changes Mar 28, 2023

View reviewed changes

clarkzinzow merged commit 23c8012 into ray-project:master Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray Data] fix problem: to_pandas failed on datasets returned by from_spark #32968

[Ray Data] fix problem: to_pandas failed on datasets returned by from_spark #32968

kira-lin commented Mar 2, 2023 •

edited

Loading

amogkam commented Mar 2, 2023 •

edited

Loading

clarkzinzow Mar 2, 2023 •

edited

Loading

kira-lin Mar 3, 2023

kira-lin commented Mar 3, 2023

jovany-wang commented Mar 3, 2023

jovany-wang commented Mar 3, 2023

kira-lin commented Mar 6, 2023

jovany-wang commented Mar 6, 2023

kira-lin commented Mar 6, 2023

kira-lin commented Mar 24, 2023

clarkzinzow left a comment

clarkzinzow commented Mar 27, 2023

kira-lin commented Mar 28, 2023

clarkzinzow left a comment

zhe-thoughts commented Mar 28, 2023

[Ray Data] fix problem: to_pandas failed on datasets returned by from_spark #32968

[Ray Data] fix problem: to_pandas failed on datasets returned by from_spark #32968

Conversation

kira-lin commented Mar 2, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

amogkam commented Mar 2, 2023 • edited Loading

clarkzinzow Mar 2, 2023 • edited Loading

Choose a reason for hiding this comment

kira-lin Mar 3, 2023

Choose a reason for hiding this comment

kira-lin commented Mar 3, 2023

jovany-wang commented Mar 3, 2023

jovany-wang commented Mar 3, 2023

kira-lin commented Mar 6, 2023

jovany-wang commented Mar 6, 2023

kira-lin commented Mar 6, 2023

kira-lin commented Mar 24, 2023

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow commented Mar 27, 2023

kira-lin commented Mar 28, 2023

clarkzinzow left a comment

Choose a reason for hiding this comment

zhe-thoughts commented Mar 28, 2023

kira-lin commented Mar 2, 2023 •

edited

Loading

amogkam commented Mar 2, 2023 •

edited

Loading

clarkzinzow Mar 2, 2023 •

edited

Loading