-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(framework): Support JAX #1619
Comments
Any solid examples showing how a multi-host JAX job runs? Especially the host registering part. |
Well, to launch a distributed training with Jax, coordinator_address= os.environ.get('JAX_COORDINATOR_ADDRESS', None) # defined internal
num_processes = int(os.environ.get('JAX_NUM_PROCESSES', 1)) # world size
process_id = int(os.environ.get('JAX_PROCESS_ID', 0)) # rank
jax.distributed.initialize(coordinator_address=coordinator_address, num_processes=num_processes, process_id=process_id) Anyway, I'm not aware of any mature practice of this in production. |
/help |
@andreyvelich: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/lifecycle frozen |
/assign @Davidnet For anyone interested in Jax support for Training Operator, please join our AutoML and Training WG Call on November 29th 5:00pm UTC: We are going to discuss how we can move forward with the Jax support. |
cc: @mimowo Michal may be interested in this JAX integration. |
We'll be interested in supporting JAX as well. |
Thanks for the interest. Happy to help @yzhao-2023 |
That would be great if you could help us with the Jax implementation in Training Operator. If you are available @yzhao-2023, please attend one of our upcoming AutoML and Training WG calls. |
If you prefer |
Links from the 2023-11-29 Meeting notes:
In the meeting they mentioned that the documentation for the Training Operator was a bit outdated. Has it been updated? |
Also, if you end up doing a Training Operator Deep dive session, it would be good if you share it here so anyone wanting to contribute can join or watch a recording later. |
hi @andreyvelich im interested in this issue for the upcoming term of gsoc, is there any roadmap doc available and can you provide some more context or resources I can look up to better understand it? |
Hi @jdcfd @octonawish-akcodes, thank you for your interest to work on Jax support in Training Operator! If you are available, please attend one of the upcoming AutoML and Training WG calls: https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#heading=h.yvypq06ot57p |
Sorry I missed it, Wednesdays are very tough for me. I did watch last recording and I will watch today's meeting later this week. Judging by the meeting notes, it seems like the JaxJobs topic wasn't touched this time. |
Hi @jdcfd, we briefly discussed Jax support in the recent call: https://youtu.be/rXBCliRugNk |
I am interested in collaborating on a design proposal for integrating Jax into Training Operator. |
Why not just use the Job or JobSet API, what is missing? |
Does it mean that why don't you recommend using Job or JobSet instead of TrainingOperator? |
Yes, for the Jax case, I think Indexed Job will just work and the Job ships with any k8s cluster, so you don't need to install any extra operators. For more advanced setups, like multi-slice TPUs, JobSet works well and it is easy to transition from Job to JobSet. |
@ahg-g As we discussed recently, we should understand if Also, we should understand the following:
To be clear, I am not against using Moreover, when @Jeffwan and @zw0610 designed unified Training Operator they proposed the idea of common CR where we have Frontend Operator to manage framework specific resources and Role operator to manage common resources. By that time (2021) we didn't have
Let's collaborate together in the upcoming WG Batch and Kubeflow WG Training community calls to discuss our next steps. cc @bigsur0 |
@ahg-g I think that kubeflow/JaxJob has some advantages: 1. It can use the same CRD as other frameworks like PyTorchJob and TFJob; 2. No need to setup EnvVars and services; 3 It is possible to use higher level Python SDK. Indeed, some developers prefer to use the plain Job and JobSet for extensibility, but I believe that some developers prefer to use more abstract API. So, I believe that both approaches are valuable. |
I totally agree with @andreyvelich. |
Thanks @tenzen-y and @andreyvelich. My worry is that adding another API on top means another operator and so more sources of errors and additional operational overhead. The points related to automating the configurations (envVars, configmaps etc.) are valid and it is something we are thinking about solutions for in JobSet. One idea is JobSet "extensions": imagine that the JobSet API includes an opaque class parameter of type "Object" that represents the specific training job you want to run, and we introduce hooks in JobSet operator and webhook to actuate on it.
The MPI extension within JobSet would know how to parse this class, and populate JobsSet with all things MPI. This is just a rough idea, devil in the details as usual :) |
@ahg-g In that case the mutation webhook will be responsible to orchestrate additional Kubernetes resources for the Job (e.g. ConfigMap, RBAC) ? How we are going to handle orchestration that needs to happen during the Job runtime ? For example, fetch appropriate status or SSH to the pod in case of MPIJob ? |
I meant hooks in the general sense as in places where we invoke the workload specific function, that would be one in the webhook and one in the reconciler. |
In that case, users should take JobSet controller and re-build image for reconciler to support such execution, right? |
/assign @sandipanpanda |
Thanks to @sandipanpanda for implementing JAXJob support in Training Operator V1: https://www.kubeflow.org/docs/components/training/user-guides/jax/ 🎉 We are planning to implement JAX as a training runtime in the Training Operator V2, as well. |
JAX becomes extremely popular these days. Users may expect to run JAX distributed training jobs on Kubernetes with the help of training-operator.
JAX uses a “multi-controller” programming model where each JAX Python process runs independently, sometimes referred to as a Single Program, Multiple Data (SPMD) model. I think it is not hard to support from the operator's perspective.
Message from the maintainers:
Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.
The text was updated successfully, but these errors were encountered: