[Train] Improve lazy checkpointing #32233

Yard1 · 2023-02-06T17:57:25Z

Signed-off-by: Antoni Baum antoni.baum@protonmail.com

Why are these changes needed?

This PR improves Train lazy checkpointing with NFS setups. Previously, the logic to determine whether lazy checkpointing should be used was dependent on whether the Train worker-actor was on the same node as the Trainable actor. The new logic instead has the Trainable actor drop a marker file in the Trial's directory. If a worker-actor can detect that file, it means it can access the same directory as the Trainable actor.

This PR also fixes lazy checkpointing env var propagation.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

krfricke

LGTM!

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

…/Yard1/ray into train_improve_lazy_checkpointing

This PR improves Train lazy checkpointing with NFS setups. Previously, the logic to determine whether lazy checkpointing should be used was dependent on whether the Train worker-actor was on the same node as the Trainable actor. The new logic instead has the Trainable actor drop a marker file in the Trial's directory. If a worker-actor can detect that file, it means it can access the same directory as the Trainable actor. This PR also fixes lazy checkpointing env var propagation. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

This PR improves Train lazy checkpointing with NFS setups. Previously, the logic to determine whether lazy checkpointing should be used was dependent on whether the Train worker-actor was on the same node as the Trainable actor. The new logic instead has the Trainable actor drop a marker file in the Trial's directory. If a worker-actor can detect that file, it means it can access the same directory as the Trainable actor. This PR also fixes lazy checkpointing env var propagation. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: elliottower <elliot@elliottower.com>

Yard1 added 5 commits February 6, 2023 17:28

WIP

f43f65f

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'master' into train_improve_lazy_checkpointing

4ed3426

Merge branch 'master' into train_improve_lazy_checkpointing

9fe78d4

Merge branch 'ray-project:master' into train_improve_lazy_checkpointing

30ca8ff

Rearrange

027c369

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Yard1 requested review from krfricke and amogkam February 9, 2023 01:18

Yard1 assigned amogkam and krfricke Feb 9, 2023

Yard1 marked this pull request as ready for review February 9, 2023 01:18

Merge branch 'master' into train_improve_lazy_checkpointing

8694c07

krfricke approved these changes Feb 9, 2023

View reviewed changes

Yard1 added 4 commits February 10, 2023 11:44

Merge branch 'ray-project:master' into train_improve_lazy_checkpointing

612dea7

Add API annotations to session

06c815e

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Merge branch 'train_improve_lazy_checkpointing' of https://github.com…

b3751ff

…/Yard1/ray into train_improve_lazy_checkpointing

Merge branch 'ray-project:master' into train_improve_lazy_checkpointing

f6c0530

krfricke merged commit 0202379 into ray-project:master Feb 15, 2023

Yard1 deleted the train_improve_lazy_checkpointing branch February 15, 2023 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Improve lazy checkpointing #32233

[Train] Improve lazy checkpointing #32233

Yard1 commented Feb 6, 2023 •

edited

Loading

krfricke left a comment

[Train] Improve lazy checkpointing #32233

[Train] Improve lazy checkpointing #32233

Conversation

Yard1 commented Feb 6, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

krfricke left a comment

Choose a reason for hiding this comment

Yard1 commented Feb 6, 2023 •

edited

Loading