-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to run this framework in a distributed cluster? #18
Comments
Thanks for your interest! We'll have a PR on a Ray-based implementation soon @pan-x-c ; an exploratory version based on Beam (for supporting Flink) will follow. |
Thanks for your question! We opted to use Ray as it does not necessitate any modifications to implementations of most existing operators, namely Filters and Mappers. This allows for a near-seamless migration of Data-Juicer to a distributed cluster, thereby enhancing the user-friendliness and deployment ease of Data-Juicer. Currently, we do not use Spark primarily due to the incompatibility of the Data-Juicer operator's interface with it, which results in the need for additional development for almost all OPs. Feel free to discuss or collaborate on further Spark versions! |
Currently, the first ray version has been merged into master, which supports partial Formatters and all Filters and Mappers (#21); we are working on supporting the remaining OPs (Deduplicators and other Formatters), and more experimental Beam versions. Feel free to reopen this issue for more discussions and suggestions! |
Spark? Slurm?
The text was updated successfully, but these errors were encountered: