-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support MLX on Kubernetes with Kubeflow #2047
Comments
Interesting. Does MLX support multi-node training? |
Not yet. We are working on it. Probably makes sense to follow up on this once we have some basic support there. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Once we implemented the Kubeflow Training V2 APIs we can make MLX work with MPI using It looks like distributed communication with MLX uses MPI: https://ml-explore.github.io/mlx/build/html/usage/distributed.html#getting-started |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/remove-lifecycle stale |
MLX is a new ML framework specifically designed to run on Apple silicon: https://github.com/ml-explore/mlx
It has some differences compare to PyTorch with
mps
backend: ml-explore/mlx#12 (comment)It would be nice to integrate MLX in Kubeflow ecosystem for distributed capabilities, and provide a way to run MLX models on Kubernetes.
For example, we can leverage Kubeflow Training Operator for MLX Model Training and Fine-Tuning, and Kubeflow Katib for HyperParameter optimization.
Since Kind cluster supports ARM arch, we should explore if we can use M-series GPUs for MLX model training with Kind in the future.
In addition to that, I saw examples how folks run Kubernetes on multi-VMs with MacOS machines and
kubeadm
.That might be useful when a single machine can't handle very large ML model.
cc @kubeflow/wg-training-leads @awni
The text was updated successfully, but these errors were encountered: