Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[<Ray component: Ray Train] Non blocking reporting of the checkpoint to maximize the GPU utilization #48801

Open
azayz opened this issue Nov 19, 2024 · 6 comments
Assignees
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks train Ray Train Related Issue

Comments

@azayz
Copy link
Contributor

azayz commented Nov 19, 2024

Description

Hello Ray team, my team and I are using ray for training, the model we save is of size 13Gb and it takes around 20min to upload to S3 storage, in the mean time GPU workers are sitting and not doing anything.

In order to maximize the GPU usage, we want to do this upload in the background or asynchronously.

What is the recommended ray way to do this? if it doesn't can you support this? if it's also not on the ray side, it's fine.

Below is a sample of our code:

s3_fs = s3fs.S3FileSystem(
    key=os.getenv('AWS_ACCESS_KEY_ID'),
    secret=os.getenv('AWS_SECRET_ACCESS_KEY'),
    endpoint_url=endpoint,
    client_kwargs=region_dict,
    max_concurrency=20,
)

custom_fs = pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(s3_fs))
in the train_func:

        time_start = time.time()
        save_deepspeed_model(trainer, ckpt_path)
        print(
            f"MIDASTOUCH: Files in the save path after custom save: {os.listdir(ckpt_path)}"
        )
        time_end = time.time()
        print(
            f"MIDASTOUCH:Time taken to save the model: {time_end - time_start} seconds"
        )

        # Report to train session
        checkpoint = Checkpoint.from_directory(tmpdir)
        print(
            "MIDASTOUCH:Reporting to train session/ Uploading the checkpoint to S-3"
        )
        time_start = time.time()
        print(f"Before reporting: {checkpoint.get_metadata()}")
        ray.train.report(metrics=metrics, checkpoint=checkpoint)

        # Add a barrier to ensure all workers finished reporting here
        trainer.strategy.barrier()
        time_end = time.time()

Thank you!

Use case

No response

@azayz azayz added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 19, 2024
@azayz azayz changed the title [<Ray component: Ray Train] [<Ray component: Ray Train] Non blocking reporting of the checkpoint to maximize the GPU utilization Nov 19, 2024
@jcotant1 jcotant1 added the train Ray Train Related Issue label Nov 19, 2024
@justinvyu justinvyu added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 19, 2024
@justinvyu justinvyu self-assigned this Nov 19, 2024
@justinvyu
Copy link
Contributor

cc @hongpeng-guo

@Superskyyy
Copy link
Contributor

Superskyyy commented Nov 19, 2024

I believe the async checkpoint you can do out of band using torch abilities. And for uploading the artifact you can use an extra thread to do that that doesn't block the training loop. But ideally this can be provided as a generic Ray interface using the object store. To enable distributed checkpointing, merging and layered storages. I can propose a REP design on that since we already experimented a bit on this direction.

@Superskyyy
Copy link
Contributor

Probably a best practice section somewhere in the docs is also helpful to Train users.

@justinvyu
Copy link
Contributor

@Superskyyy That would be great! Maybe you could start with a quick sketch of your proposal as a github issue?

@Superskyyy
Copy link
Contributor

@Superskyyy That would be great! Maybe you could start with a quick sketch of your proposal as a github issue?

Cool, you mean in the main Ray repo right? not in the REP repo.

@justinvyu
Copy link
Contributor

@Superskyyy Yep, just in the Ray repo for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks train Ray Train Related Issue
Projects
None yet
Development

No branches or pull requests

4 participants