-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JAX Integration Enhancement Proposal #2125
JAX Integration Enhancement Proposal #2125
Conversation
c45d88e
to
7ca6f7e
Compare
@andreyvelich @tenzen-y do we want to see about using JobSet here? API seems pretty simple so far and JobSet should be a quick drop in from what I see. |
Yes, we will work with @sandipanpanda to refactor this proposal based on our discussion from this doc: https://docs.google.com/document/d/1bha8rB6_iPTi9rXjJMnKi-CLxfL7dwtmQCKH4N6dcsI/edit#heading=h.24ap79rbqoxl. |
Pull Request Test Coverage Report for Build 9207133318Details
💛 - Coveralls |
- **Mitigation**: Implement robust scheduling and resource management policies in the Training Operator. | ||
|
||
## Design Details | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a corresponding table between all JAX environment variables and how to obtain OR configure the JAX parameters like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sandipanpanda Is there any tables?
docs/proposals/jax-integration.md
Outdated
selector: | ||
app: jaxjob-master-${job_id} | ||
ports: | ||
- port: 23456 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the default port somewhere
@sandipanpanda We agreed that initially we can support Jax within existing Training Operator implementation given the time it takes for us to design and to implement That will unblock @sandipanpanda with this work. |
7ca6f7e
to
2040ac4
Compare
Thanks! I have updated the proposal according to Kubeflow Proposal Document Structure. I am working on finalizing the proposal using the existing Training Operator assets. |
2040ac4
to
abc4ca4
Compare
abc4ca4
to
1e8059d
Compare
1e8059d
to
895620d
Compare
895620d
to
242df25
Compare
/cc @andreyvelich @tenzen-y PTAL |
@sandipanpanda: GitHub didn't allow me to request PR reviews from the following users: PTAL. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, could you mention about validations?
You can refer to the other framework validations like this: https://github.com/kubeflow/training-operator/blob/2b39d3cbc1525dc29eb40aa58b149dbeef00aa0f/pkg/webhooks/pytorch/pytorchjob_webhook.go
- **Mitigation**: Implement robust scheduling and resource management policies in the Training Operator. | ||
|
||
## Design Details | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sandipanpanda Is there any tables?
9c4c9bd
to
e3eaa99
Compare
@tenzen-y PTAL |
Pull Request Test Coverage Report for Build 9898700147Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I never find validation design through this design.
Could you add which validations should we add?
For example, do we need to validate if the JaxJob has cordinator role?
Kubeflow Enhancement Proposal: Integrate JAX with Kubeflow Training Operator Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>
e3eaa99
to
816efcc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Kubeflow Enhancement Proposal: Integrate JAX with Kubeflow Training Operator Signed-off-by: Sandipan Panda <samparksandipan@gmail.com> Signed-off-by: yelias <yossi.elias@nokia.com>
Enhancement Proposal for integrating JAX with Kubeflow Training Operator for distributed training on Kubernetes
Ref:
/area gsoc