Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray AIR/Train] Explain the difference between PyTorch Lightning trainer checkpoint config and Ray AIR checkpoint config #36314

Open
scottsun94 opened this issue Jun 12, 2023 · 4 comments
Labels
docs An issue or change related to documentation P1 Issue that should be fixed within a few weeks ray-team-created Ray Team created train Ray Train Related Issue

Comments

@scottsun94
Copy link
Contributor

Description

According to @woshiyyya

The checkpointing process is:
Lightning save a checkpoint file according to the config in LightningConfigBuilder.checkpointing()
LightningTrainer reports this new checkpoint to AIR
AIR saves the AIR checkpoints according to AIR CheckpointConfig

  1. We should better document this process in the doc for PyTorch Lightning Trainer.
  2. We should document which parameters in ray.air.CheckpointConfig apply to which trainers.

Link

No response

@scottsun94 scottsun94 added triage Needs triage (eg: priority, bug/not-bug, and owning component) docs An issue or change related to documentation labels Jun 12, 2023
@scottsun94
Copy link
Contributor Author

cc: @matthewdeng @woshiyyya

@woshiyyya
Copy link
Member

Actually we do have docstring in LightningConfigBuilder that describes this behavior. But perhaps it needs more observability? https://docs.ray.io/en/latest/train/api/doc/ray.train.lightning.LightningConfigBuilder.checkpointing.html#ray-train-lightning-lightningconfigbuilder-checkpointing

@scottsun94
Copy link
Contributor Author

To me, it's a bit hard to understand and it also misses some info (e.g., some parameters in ray.air.configs.CheckpointConfig doesn't apply when using lightningTrainer).

If I were to rewrite it, here is my 5-min version:

LightningTrainer creates a subclass instance of the ModelCheckpoint callback with the kwargs. It handles checkpointing and metrics logging logics. This method is not a replacement for the ray.air.configs.CheckpointConfig. Specify both lightning checkpointing strategy and AIR checkpoint strategy to properly control the checkpointing behavior.

Here is how they work together:

  1. LightningTrainer saves checkpoint files according to the config in LightningConfigBuilder.checkpointing()
  2. LightningTrainer reports saved checkpoint files to AIR.
    • the callback periodically reports the latest metrics and checkpoint to the AIR session via session.report(). The report frequency matches the checkpointing frequency in LightningConfigBuilder.checkpointing(). You have to make sure that the target metrics (e.g. metrics defined in TuneConfig or CheckpointConfig) are ready when a new checkpoint is being saved.
  3. AIR saves the AIR checkpoints according to AIR CheckpointConfig.
    • If CheckpointConfig is not provided, AIR stores all the reported checkpoints reported by LightningTrainer by default.
    • Only ..., ..., ... parameters apply when using LightningTrainer.

One thing that I still don't completely understand is It handles checkpointing and metrics logging logics. What does "metrics logging" mean here?

@woshiyyya
Copy link
Member

woshiyyya commented Jun 13, 2023

Yes your thoughts make sense. I am working on unifying the Lightning checkpoint configs and AIR CheckpointConfigs, after that we can make a clearer explanation of the relationships and behavior between them. #35920

It handles checkpointing and metrics logging logics.

This actually refers to the metrics dict you reported to AIR session (session.report(metrics, checkpoint))

@hora-anyscale hora-anyscale added train Ray Train Related Issue air labels Jun 20, 2023
@krfricke krfricke added P1 Issue that should be fixed within a few weeks ray-team-created Ray Team created and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 21, 2023
@anyscalesam anyscalesam removed the air label Oct 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs An issue or change related to documentation P1 Issue that should be fixed within a few weeks ray-team-created Ray Team created train Ray Train Related Issue
Projects
None yet
Development

No branches or pull requests

5 participants