[<Ray component: Ray Train] Non blocking reporting of the checkpoint to maximize the GPU utilization #48801
Labels
enhancement
Request for new feature and/or capability
P1
Issue that should be fixed within a few weeks
train
Ray Train Related Issue
Description
Hello Ray team, my team and I are using ray for training, the model we save is of size 13Gb and it takes around 20min to upload to S3 storage, in the mean time GPU workers are sitting and not doing anything.
In order to maximize the GPU usage, we want to do this upload in the background or asynchronously.
What is the recommended ray way to do this? if it doesn't can you support this? if it's also not on the ray side, it's fine.
Below is a sample of our code:
custom_fs = pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(s3_fs))
in the train_func:
Thank you!
Use case
No response
The text was updated successfully, but these errors were encountered: