Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run this framework in a distributed cluster? #18

Closed
yang1fan2 opened this issue Sep 13, 2023 · 4 comments · Fixed by #21
Closed

How to run this framework in a distributed cluster? #18

yang1fan2 opened this issue Sep 13, 2023 · 4 comments · Fixed by #21
Assignees
Labels
question Further information is requested

Comments

@yang1fan2
Copy link

Spark? Slurm?

@yxdyc
Copy link
Collaborator

yxdyc commented Sep 13, 2023

Thanks for your interest! We'll have a PR on a Ray-based implementation soon @pan-x-c ; an exploratory version based on Beam (for supporting Flink) will follow.

@yang1fan2
Copy link
Author

@yxdyc @pan-x-c Thanks for quick response. A quick naive question. What's the advantages of using Ray over Spark?

@pan-x-c
Copy link
Collaborator

pan-x-c commented Sep 14, 2023

Thanks for your question!

We opted to use Ray as it does not necessitate any modifications to implementations of most existing operators, namely Filters and Mappers. This allows for a near-seamless migration of Data-Juicer to a distributed cluster, thereby enhancing the user-friendliness and deployment ease of Data-Juicer.

Currently, we do not use Spark primarily due to the incompatibility of the Data-Juicer operator's interface with it, which results in the need for additional development for almost all OPs. Feel free to discuss or collaborate on further Spark versions!

@pan-x-c pan-x-c linked a pull request Sep 15, 2023 that will close this issue
@yxdyc yxdyc closed this as completed in #21 Sep 18, 2023
@yxdyc
Copy link
Collaborator

yxdyc commented Sep 18, 2023

Currently, the first ray version has been merged into master, which supports partial Formatters and all Filters and Mappers (#21); we are working on supporting the remaining OPs (Deduplicators and other Formatters), and more experimental Beam versions. Feel free to reopen this issue for more discussions and suggestions!

@HYLcool HYLcool added the question Further information is requested label Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants