-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train/Fine-tune API Proposal for LLMs #1945
Train/Fine-tune API Proposal for LLMs #1945
Conversation
/cc @kubeflow/wg-training-leads @kuizhiqing @tenzen-y |
Pull Request Test Coverage Report for Build 6826741030Warning: This coverage report may be inaccurate.We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
💛 - Coveralls |
511e86f
to
91d0f31
Compare
So sorry for the delay. I finally arrived here, right now. |
91d0f31
to
9b63b70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance, this approach looks good to me.
BTW, can you add goals and nongoals to clarify the objective?
Also, I couldn't know how we handle other frameworks since you mentioned only PyTorchJob in this proposal. So, can you add proposals about other frameworks as well?
I appreciate your @johnugeorge's and @deepak-muley's efforts.
@tenzen-y For this new higher level api for different model providers, framework abstraction is hidden to the user. From user perspsective to use a specific model provider, it doesn't matter if distributed training is deployed using PytorchJob or not. For a future use case where a new model provider requires tensorflow training, we need to extend SDK to deploy a distributed TF training. I think, we can provide case by case basis. For Huggingface, PytorchJob support should be sufficient to deploy a distributed training |
@johnugeorge That makes sense. Can you put it on the proposal? |
Added it in limitations |
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
26ee2ff
to
2315fc4
Compare
Thanks for your nice proposal. Before dive into the details, I have two questions that I would like to consult/discuss with you.
|
@kuizhiqing Thanks for the comments
Did you refer deepspeed as a separate launcher or with other supported frameworks? For eg: To use deepspeed with HF, it should be straightforward with torchrun. If we want to have deepspeed launcher inside training-operator, we might have to setup few things that
Good point. I had similar thoughts. I agree with flexibility idea. But I feel, there will be significant overhead if we separate into new project now(to keep projects in sync etc). When the higher layer becomes really large, we can separate it out. |
I agree with @johnugeorge. I think, we should not separate SDK out of Training Operator initially since SDK is an interface for Data Scientists to interact with Kubeflow Training Operator. Ideally, for the long-term we can create Kubeflow Python library for Kubeflow users. !pip install kubeflow # Kubeflow cluster should be pre-deployed
from kubeflow import KubeflowClient
client = KubeflowClient()
client.train() # To train my model
client.tune() # To tune my model
client.serve() # To serve my model The question in this approach is to how we should support various component versions (e.g. Training Operator, Katib, KServe (cc @yuzisun) has its own releases for the control plane). I think, we should discuss this separately and figure out the plan to simplify Kubeflow usage. WDYT @kuizhiqing @tenzen-y @johnugeorge ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this effort @deepanker13 @johnugeorge!
I left a few comments
Yes, I mean we can focus on two approaches for LLM training, torchrun mode with PytorchJob and mpirun/pdsh mode with MPIJob. @johnugeorge @andreyvelich |
We can add |
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
That makes sense. Ideally, we should provide the all-in-one Kubeflow SDK. We can discuss it outside of this proposal. |
Thanks @deepanker13 |
@andreyvelich @tenzen-y I have made the suggested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/approve
/assign @andreyvelich
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deepanker13, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thank you for doing this @deepanker13! |
Thanks @deepanker13 |
/hold cancel |
Adding proposal for a new Train API in the training operator SDK for training/finetuning