How to run this framework in a distributed cluster? #18

yang1fan2 · 2023-09-13T02:28:59Z

Spark? Slurm?

yxdyc · 2023-09-13T06:54:19Z

Thanks for your interest! We'll have a PR on a Ray-based implementation soon @pan-x-c ; an exploratory version based on Beam (for supporting Flink) will follow.

yang1fan2 · 2023-09-13T20:50:30Z

@yxdyc @pan-x-c Thanks for quick response. A quick naive question. What's the advantages of using Ray over Spark?

pan-x-c · 2023-09-14T04:01:33Z

Thanks for your question!

We opted to use Ray as it does not necessitate any modifications to implementations of most existing operators, namely Filters and Mappers. This allows for a near-seamless migration of Data-Juicer to a distributed cluster, thereby enhancing the user-friendliness and deployment ease of Data-Juicer.

Currently, we do not use Spark primarily due to the incompatibility of the Data-Juicer operator's interface with it, which results in the need for additional development for almost all OPs. Feel free to discuss or collaborate on further Spark versions!

yxdyc · 2023-09-18T09:41:42Z

Currently, the first ray version has been merged into master, which supports partial Formatters and all Filters and Mappers (#21); we are working on supporting the remaining OPs (Deduplicators and other Formatters), and more experimental Beam versions. Feel free to reopen this issue for more discussions and suggestions!

yxdyc assigned pan-x-c Sep 13, 2023

pan-x-c mentioned this issue Sep 15, 2023

Distributed data processing with Ray #21

Merged

pan-x-c linked a pull request Sep 15, 2023 that will close this issue

Distributed data processing with Ray #21

Merged

yxdyc closed this as completed in #21 Sep 18, 2023

HYLcool added the question Further information is requested label Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run this framework in a distributed cluster? #18

How to run this framework in a distributed cluster? #18

yang1fan2 commented Sep 13, 2023

yxdyc commented Sep 13, 2023

yang1fan2 commented Sep 13, 2023

pan-x-c commented Sep 14, 2023

yxdyc commented Sep 18, 2023

How to run this framework in a distributed cluster? #18

How to run this framework in a distributed cluster? #18

Comments

yang1fan2 commented Sep 13, 2023

yxdyc commented Sep 13, 2023

yang1fan2 commented Sep 13, 2023

pan-x-c commented Sep 14, 2023

yxdyc commented Sep 18, 2023