Support MLX on Kubernetes with Kubeflow #2047

andreyvelich · 2024-04-10T00:09:57Z

MLX is a new ML framework specifically designed to run on Apple silicon: https://github.com/ml-explore/mlx

It has some differences compare to PyTorch with mps backend: ml-explore/mlx#12 (comment)

It would be nice to integrate MLX in Kubeflow ecosystem for distributed capabilities, and provide a way to run MLX models on Kubernetes.

For example, we can leverage Kubeflow Training Operator for MLX Model Training and Fine-Tuning, and Kubeflow Katib for HyperParameter optimization.
Since Kind cluster supports ARM arch, we should explore if we can use M-series GPUs for MLX model training with Kind in the future.

In addition to that, I saw examples how folks run Kubernetes on multi-VMs with MacOS machines and kubeadm.
That might be useful when a single machine can't handle very large ML model.

cc @kubeflow/wg-training-leads @awni

The text was updated successfully, but these errors were encountered:

gaocegege · 2024-04-10T04:27:39Z

In addition to that, I saw examples how folks run Kubernetes on multi-VMs with MacOS machines and kubeadm.
That might be useful when a single machine can't handle very large ML model.

Interesting. Does MLX support multi-node training?

awni · 2024-04-10T04:30:18Z

Not yet. We are working on it. Probably makes sense to follow up on this once we have some basic support there.

github-actions · 2024-07-09T05:01:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich · 2024-07-09T12:03:38Z

Once we implemented the Kubeflow Training V2 APIs we can make MLX work with MPI using mpirun command 🎉

It looks like distributed communication with MLX uses MPI: https://ml-explore.github.io/mlx/build/html/usage/distributed.html#getting-started

github-actions · 2024-10-07T20:01:57Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich · 2024-10-08T18:16:41Z

/remove-lifecycle stale

andreyvelich added the kind/feature label Apr 10, 2024

github-actions bot added the lifecycle/stale label Jul 9, 2024

github-actions bot removed the lifecycle/stale label Jul 9, 2024

github-actions bot added the lifecycle/stale label Oct 7, 2024

google-oss-prow bot removed the lifecycle/stale label Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support MLX on Kubernetes with Kubeflow #2047

Support MLX on Kubernetes with Kubeflow #2047

andreyvelich commented Apr 10, 2024

gaocegege commented Apr 10, 2024

awni commented Apr 10, 2024

github-actions bot commented Jul 9, 2024

andreyvelich commented Jul 9, 2024

github-actions bot commented Oct 7, 2024

andreyvelich commented Oct 8, 2024

Support MLX on Kubernetes with Kubeflow #2047

Support MLX on Kubernetes with Kubeflow #2047

Comments

andreyvelich commented Apr 10, 2024

gaocegege commented Apr 10, 2024

awni commented Apr 10, 2024

github-actions bot commented Jul 9, 2024

andreyvelich commented Jul 9, 2024

github-actions bot commented Oct 7, 2024

andreyvelich commented Oct 8, 2024