You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem
I have a model finetuning job that takes around 80 minutes. There are two steps one is preprocessing and other is training.
Both preprocessing and training takes GPU models. But preprocessing has multiple steps and multiple models to be loaded on a step by step. Same with training too. I've put all this code inside predict function and left the model load empty.
Problems seen
I personally don't think what i'm doing is not ideal for truss as the model load is needed for you to optimise the GPU utlization.
How should an ideal solution for this to be? I've been checking chains but it looks too much of a work on top of what ive done with truss for so long.
I deployed it this way but the API hit takes 80 minutes to execute but pod delete max time limit is 1 hour. For some reason though the call is sync API call, and training is happening on the pod, it doesn't realise and brings down the pod without API returning any result.
Reasons I can't use model load,
My preprocessing step takes user input, so it can only be passed during predict step. Not sure if I can take inputs at predict, load a model based on that and then use model.load when ever my preprocessing is happening and again change the weight files in model load and do the training
Describe the solution you'd like
For now if I were to go with current approach. I see a hacky way, increase the pod inactivity time for 2 hours for my finetuning task to work
Make sure if one API is still in progress, don't bring the pod down.
Ideally a feasible way for me to use Truss correctly as it's intended. May be using Model load.
Describe alternatives you've considered
As of now, I don't find a solution than custom host and setup on AWS. I'm not able to find enough documentation on finetuning tasks. And dynamic weight file loading based on API inputs
The text was updated successfully, but these errors were encountered:
Problem
I have a model finetuning job that takes around 80 minutes. There are two steps one is preprocessing and other is training.
Both preprocessing and training takes GPU models. But preprocessing has multiple steps and multiple models to be loaded on a step by step. Same with training too. I've put all this code inside predict function and left the model load empty.
Problems seen
Reasons I can't use model load,
Describe the solution you'd like
Describe alternatives you've considered
As of now, I don't find a solution than custom host and setup on AWS. I'm not able to find enough documentation on finetuning tasks. And dynamic weight file loading based on API inputs
The text was updated successfully, but these errors were encountered: