Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSOC] Tracking Issue: Integrate JAX in Kubeflow Training Operator #2145

Open
2 of 12 tasks
sandipanpanda opened this issue Jun 13, 2024 · 0 comments
Open
2 of 12 tasks

Comments

@sandipanpanda
Copy link
Contributor

sandipanpanda commented Jun 13, 2024

Milestone: Integrate JAX in Kubeflow Training Operator

Description

This milestone tracks the progress of integrating JAX into the Kubeflow Training Operator to enable distributed training and fine-tuning jobs on Kubernetes. This involves leveraging the JAX jax.distributed.initialize API and utilizing Kubernetes JobSet API for managing job lifecycle.

Checklist

  • Review JAX documentation and distributed training requirements.
  • Review Kubeflow Training Operator and JobSet API documentation.
  • Draft the initial design document for JAX integration. JAX Integration Enhancement Proposal #2125
  • Create a new Custom Resource Definition (CRD) for JAX jobs (e.g., JaxJob).
  • Update the Kubeflow Training Operator to manage JaxJob resources.
  • Implement webhook validations for the JAXJob
  • Implement a mechanism to initialize and manage JAX distributed training processes using jax.distributed.initialize.
  • Extend the Training Operator Python SDK to simplify the creation and management of JaxJob resources.
  • Update Kubeflow Training Operator documentation to include instructions for running JAX jobs.
  • Provide example configurations and training scripts.
  • Ensure backward compatibility with existing Kubeflow components.
  • Release the updated Kubeflow Training Operator with JAX support.

Milestone Due Date

TBD

Assignees

@sandipanpanda

/area gsoc

@sandipanpanda sandipanpanda changed the title Tracking Issue: Integrate JAX in Kubeflow Training Operator [GSOC] Tracking Issue: Integrate JAX in Kubeflow Training Operator Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant