feat(framework): Support JAX #1619

gaocegege · 2022-06-22T07:35:19Z

JAX becomes extremely popular these days. Users may expect to run JAX distributed training jobs on Kubernetes with the help of training-operator.

JAX uses a “multi-controller” programming model where each JAX Python process runs independently, sometimes referred to as a Single Program, Multiple Data (SPMD) model. I think it is not hard to support from the operator's perspective.

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

zw0610 · 2022-06-23T01:15:00Z

Any solid examples showing how a multi-host JAX job runs? Especially the host registering part.

kuizhiqing · 2022-06-23T05:32:15Z

Well, to launch a distributed training with Jax, jax.distributed.initialize api should be used.
A method to implement this would be like this, training operator provide related environ for each container and the user script should handle them as below. src.

coordinator_address= os.environ.get('JAX_COORDINATOR_ADDRESS', None)  # defined internal
num_processes = int(os.environ.get('JAX_NUM_PROCESSES', 1))  # world size
process_id = int(os.environ.get('JAX_PROCESS_ID', 0)) # rank

jax.distributed.initialize(coordinator_address=coordinator_address, num_processes=num_processes, process_id=process_id)

Anyway, I'm not aware of any mature practice of this in production.

andreyvelich · 2023-07-20T16:29:37Z

/help

google-oss-prow · 2023-07-20T16:29:39Z

@andreyvelich:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

github-actions · 2023-10-18T20:01:44Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y · 2023-10-19T16:41:51Z

/lifecycle frozen

andreyvelich · 2023-11-28T17:39:28Z

/assign @Davidnet

For anyone interested in Jax support for Training Operator, please join our AutoML and Training WG Call on November 29th 5:00pm UTC:
https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#heading=h.yvypq06ot57p

We are going to discuss how we can move forward with the Jax support.

tenzen-y · 2023-11-29T21:17:12Z

cc: @mimowo

Michal may be interested in this JAX integration.

yzhao-2023 · 2023-12-22T11:07:22Z

We'll be interested in supporting JAX as well.
Would be interested in contributing developer hours (with mentoring from a qualified Kubeflow maintainer)
@sxwl-donggang

johnugeorge · 2023-12-22T17:44:25Z

Thanks for the interest. Happy to help @yzhao-2023

andreyvelich · 2023-12-22T18:03:40Z

That would be great if you could help us with the Jax implementation in Training Operator.

If you are available @yzhao-2023, please attend one of our upcoming AutoML and Training WG calls.
We can guide you through Training Operator implementation and how we can add Jax support.

kuizhiqing · 2023-12-23T15:28:16Z

That would be great if you could help us with the Jax implementation in Training Operator.

If you are available @yzhao-2023, please attend one of our upcoming AutoML and Training WG calls. We can guide you through Training Operator implementation and how we can add Jax support.

If you prefer 11:00 am UTC, I'd like to be there too.

jdcfd · 2024-02-11T03:45:50Z

Links from the 2023-11-29 Meeting notes:

Developer Guide for Training Operator: https://github.com/kubeflow/training-operator/blob/master/docs/development/developer_guide.md
Distributed Jax: https://jax.readthedocs.io/en/latest/multi_process.html

In the meeting they mentioned that the documentation for the Training Operator was a bit outdated. Has it been updated?
EDIT: I see that the .md file was updated 2 weeks ago so I suppose it is up to date.

jdcfd · 2024-02-11T04:08:31Z

Also, if you end up doing a Training Operator Deep dive session, it would be good if you share it here so anyone wanting to contribute can join or watch a recording later.

octonawish-akcodes · 2024-02-21T03:01:51Z

hi @andreyvelich im interested in this issue for the upcoming term of gsoc, is there any roadmap doc available and can you provide some more context or resources I can look up to better understand it?

andreyvelich · 2024-02-21T12:13:49Z

Hi @jdcfd @octonawish-akcodes, thank you for your interest to work on Jax support in Training Operator!

If you are available, please attend one of the upcoming AutoML and Training WG calls: https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#heading=h.yvypq06ot57p
We will discuss details on how we can add support for JaxJobs

jdcfd · 2024-02-22T06:49:53Z

Sorry I missed it, Wednesdays are very tough for me. I did watch last recording and I will watch today's meeting later this week. Judging by the meeting notes, it seems like the JaxJobs topic wasn't touched this time.

andreyvelich · 2024-02-22T14:55:31Z

Hi @jdcfd, we briefly discussed Jax support in the recent call: https://youtu.be/rXBCliRugNk
We are going to speak more about Jax in the next Training WG community meetings.
/area gsoc

sandipanpanda · 2024-03-10T19:47:29Z

I am interested in collaborating on a design proposal for integrating Jax into Training Operator.

ahg-g · 2024-03-28T21:32:23Z

Why not just use the Job or JobSet API, what is missing?

tenzen-y · 2024-03-28T22:26:32Z

Why not just use the Job or JobSet API, what is missing?

Does it mean that why don't you recommend using Job or JobSet instead of TrainingOperator?

ahg-g · 2024-03-28T22:31:18Z

Yes, for the Jax case, I think Indexed Job will just work and the Job ships with any k8s cluster, so you don't need to install any extra operators. For more advanced setups, like multi-slice TPUs, JobSet works well and it is easy to transition from Job to JobSet.

andreyvelich · 2024-03-28T22:56:54Z

@ahg-g As we discussed recently, we should understand if JobSet can cover all use-cases for Jax and other ML frameworks. I remember previously @tenzen-y was working on to add SuccessPolicy support to the Job APIs, so we can re-use Job in the Training Operator.

Also, we should understand the following:

Does Jax support any specific distributed training capabilities that will require to orchestrate additional Kubernetes resources ? Like MPI-Operator.
Do we need specific resource statuses that will be exclusive to JaxJob but not to other Jobs (e.g. PyTorchJob).

To be clear, I am not against using JobSet as a final entity for distributed ML training on Kubernetes and deprecate framework specific CRs, but we need to discuss pros and cons.

Moreover, when @Jeffwan and @zw0610 designed unified Training Operator they proposed the idea of common CR where we have Frontend Operator to manage framework specific resources and Role operator to manage common resources. By that time (2021) we didn't have JobSet yet.
In that case, the flow looks like this:

JaxJob -> JobSet -> Job -> Pod
PyTorchJob -> JobSet -> Job -> Pod

Let's collaborate together in the upcoming WG Batch and Kubeflow WG Training community calls to discuss our next steps.

cc @bigsur0

tenzen-y · 2024-03-28T23:00:07Z

Yes, for the Jax case, I think Indexed Job will just work and the Job ships with any k8s cluster, so you don't need to install any extra operators. For more advanced setups, like multi-slice TPUs, JobSet works well and it is easy to transition from Job to JobSet.

@ahg-g I think that kubeflow/JaxJob has some advantages: 1. It can use the same CRD as other frameworks like PyTorchJob and TFJob; 2. No need to setup EnvVars and services; 3 It is possible to use higher level Python SDK.

Indeed, some developers prefer to use the plain Job and JobSet for extensibility, but I believe that some developers prefer to use more abstract API.

So, I believe that both approaches are valuable.

tenzen-y · 2024-03-28T23:14:19Z

@ahg-g As we discussed recently, we should understand if JobSet can cover all use-cases for Jax and other ML frameworks. I remember previously @tenzen-y was working on to add SuccessPolicy support to the Job APIs, so we can re-use Job in the Training Operator.

Also, we should understand the following:

Does Jax support any specific distributed training capabilities that will require to orchestrate additional Kubernetes resources ? Like MPI-Operator.

Do we need specific resource statuses that will be exclusive to JaxJob but not to other Jobs (e.g. PyTorchJob).

To be clear, I am not against using JobSet as a final entity for distributed ML training on Kubernetes and deprecate framework specific CRs, but we need to discuss pros and cons.

Moreover, when @Jeffwan and @kuizhiqing designed unified Training Operator they proposed the idea of common CR where we have Frontend Operator to manage framework specific resources and Role operator to manage common resources. By that time (2021) we didn't have JobSet yet. In that case, the flow looks like this:
JaxJob -> JobSet -> Job -> Pod
PyTorchJob -> JobSet -> Job -> Pod
Let's collaborate together in the upcoming WG Batch and Kubeflow WG Training community calls to discuss our next steps.

cc @bigsur0

I totally agree with @andreyvelich.

ahg-g · 2024-03-28T23:30:44Z

Thanks @tenzen-y and @andreyvelich.

My worry is that adding another API on top means another operator and so more sources of errors and additional operational overhead.

The points related to automating the configurations (envVars, configmaps etc.) are valid and it is something we are thinking about solutions for in JobSet. One idea is JobSet "extensions": imagine that the JobSet API includes an opaque class parameter of type "Object" that represents the specific training job you want to run, and we introduce hooks in JobSet operator and webhook to actuate on it.

kind: JobSet
spec:
  class: 
    kind: MPI
    ...

The MPI extension within JobSet would know how to parse this class, and populate JobsSet with all things MPI. This is just a rough idea, devil in the details as usual :)

andreyvelich · 2024-04-02T14:09:46Z

that represents the specific training job you want to run, and we introduce hooks in JobSet operator and webhook to actuate on it.

@ahg-g In that case the mutation webhook will be responsible to orchestrate additional Kubernetes resources for the Job (e.g. ConfigMap, RBAC) ? How we are going to handle orchestration that needs to happen during the Job runtime ? For example, fetch appropriate status or SSH to the pod in case of MPIJob ?

ahg-g · 2024-04-03T06:47:57Z

I meant hooks in the general sense as in places where we invoke the workload specific function, that would be one in the webhook and one in the reconciler.

andreyvelich · 2024-04-03T15:24:52Z

I meant hooks in the general sense as in places where we invoke the workload specific function, that would be one in the webhook and one in the reconciler.

In that case, users should take JobSet controller and re-build image for reconciler to support such execution, right?
Or we will contribute such extensions to the upstream ?

andreyvelich · 2024-05-24T19:49:57Z

/assign @sandipanpanda

andreyvelich · 2024-10-28T10:52:52Z

Thanks to @sandipanpanda for implementing JAXJob support in Training Operator V1: https://www.kubeflow.org/docs/components/training/user-guides/jax/ 🎉

We are planning to implement JAX as a training runtime in the Training Operator V2, as well.

google-oss-prow bot added the help wanted label Jul 20, 2023

github-actions bot added the lifecycle/stale label Oct 18, 2023

google-oss-prow bot added lifecycle/frozen and removed lifecycle/stale labels Oct 19, 2023

google-oss-prow bot assigned Davidnet Nov 28, 2023

google-oss-prow bot added the area/gsoc label Feb 22, 2024

PeterWrighten mentioned this issue Mar 8, 2024

[WIP] Add enhancement for Parameter Distribution kubeflow/katib#2059

Closed

1 task

sandipanpanda mentioned this issue May 22, 2024

JAX Integration Enhancement Proposal #2125

Merged

google-oss-prow bot assigned sandipanpanda May 24, 2024

sandipanpanda mentioned this issue Jul 10, 2024

Add JAX API #2163

Merged

1 task

sandipanpanda mentioned this issue Aug 5, 2024

Add JAX controller #2194

Merged

sandipanpanda mentioned this issue Sep 23, 2024

Add documentation for JAXJob kubeflow/website#3877

Merged

andreyvelich closed this as completed Oct 28, 2024

andreyvelich mentioned this issue Dec 11, 2024

[Release] Training Operator 1.9 Roadmap #2169

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(framework): Support JAX #1619

feat(framework): Support JAX #1619

gaocegege commented Jun 22, 2022

zw0610 commented Jun 23, 2022

kuizhiqing commented Jun 23, 2022

andreyvelich commented Jul 20, 2023

google-oss-prow bot commented Jul 20, 2023

github-actions bot commented Oct 18, 2023

tenzen-y commented Oct 19, 2023

andreyvelich commented Nov 28, 2023

tenzen-y commented Nov 29, 2023

yzhao-2023 commented Dec 22, 2023 •

edited

Loading

johnugeorge commented Dec 22, 2023

andreyvelich commented Dec 22, 2023 •

edited

Loading

kuizhiqing commented Dec 23, 2023

jdcfd commented Feb 11, 2024 •

edited

Loading

jdcfd commented Feb 11, 2024

octonawish-akcodes commented Feb 21, 2024

andreyvelich commented Feb 21, 2024

jdcfd commented Feb 22, 2024

andreyvelich commented Feb 22, 2024

sandipanpanda commented Mar 10, 2024

ahg-g commented Mar 28, 2024

tenzen-y commented Mar 28, 2024

ahg-g commented Mar 28, 2024

andreyvelich commented Mar 28, 2024 •

edited

Loading

tenzen-y commented Mar 28, 2024

tenzen-y commented Mar 28, 2024

ahg-g commented Mar 28, 2024

andreyvelich commented Apr 2, 2024

ahg-g commented Apr 3, 2024

andreyvelich commented Apr 3, 2024

andreyvelich commented May 24, 2024

andreyvelich commented Oct 28, 2024

feat(framework): Support JAX #1619

feat(framework): Support JAX #1619

Comments

gaocegege commented Jun 22, 2022

zw0610 commented Jun 23, 2022

kuizhiqing commented Jun 23, 2022

andreyvelich commented Jul 20, 2023

google-oss-prow bot commented Jul 20, 2023

github-actions bot commented Oct 18, 2023

tenzen-y commented Oct 19, 2023

andreyvelich commented Nov 28, 2023

tenzen-y commented Nov 29, 2023

yzhao-2023 commented Dec 22, 2023 • edited Loading

johnugeorge commented Dec 22, 2023

andreyvelich commented Dec 22, 2023 • edited Loading

kuizhiqing commented Dec 23, 2023

jdcfd commented Feb 11, 2024 • edited Loading

jdcfd commented Feb 11, 2024

octonawish-akcodes commented Feb 21, 2024

andreyvelich commented Feb 21, 2024

jdcfd commented Feb 22, 2024

andreyvelich commented Feb 22, 2024

sandipanpanda commented Mar 10, 2024

ahg-g commented Mar 28, 2024

tenzen-y commented Mar 28, 2024

ahg-g commented Mar 28, 2024

andreyvelich commented Mar 28, 2024 • edited Loading

tenzen-y commented Mar 28, 2024

tenzen-y commented Mar 28, 2024

ahg-g commented Mar 28, 2024

andreyvelich commented Apr 2, 2024

ahg-g commented Apr 3, 2024

andreyvelich commented Apr 3, 2024

andreyvelich commented May 24, 2024

andreyvelich commented Oct 28, 2024

yzhao-2023 commented Dec 22, 2023 •

edited

Loading

andreyvelich commented Dec 22, 2023 •

edited

Loading

jdcfd commented Feb 11, 2024 •

edited

Loading

andreyvelich commented Mar 28, 2024 •

edited

Loading