[Train] (Bandaid) Mitigate OOMs on checkpointing #33089

Yard1 · 2023-03-07T01:17:11Z

Why are these changes needed?

This PR introduces a tactical mitigation of #33073 by ensuring that the rank 0 worker is colocated with the Trainable if possible, allowing for lazy checkpointing to be used (and thus avoiding a situation where the entire checkpoint is loaded into memory to be passed to object store).

THIS IS NOT A LONG TERM FIX. It is necessary to unblock LLM use cases for now. A proper solution should be implemented soon.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

python/ray/train/_internal/worker_group.py

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

* [Train] (Bandaid) Mitigate OOMs on checkpointing This PR introduces a tactical mitigation of ray-project#33073 by ensuring that the rank 0 worker is colocated with the Trainable if possible, allowing for lazy checkpointing to be used (and thus avoiding a situation where the entire checkpoint is loaded into memory to be passed to object store). THIS IS NOT A LONG TERM FIX. It is necessary to unblock LLM use cases for now. A proper solution should be implemented soon. --------- Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

* [Train] (Bandaid) Mitigate OOMs on checkpointing This PR introduces a tactical mitigation of ray-project#33073 by ensuring that the rank 0 worker is colocated with the Trainable if possible, allowing for lazy checkpointing to be used (and thus avoiding a situation where the entire checkpoint is loaded into memory to be passed to object store). THIS IS NOT A LONG TERM FIX. It is necessary to unblock LLM use cases for now. A proper solution should be implemented soon. --------- Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

* [Train] (Bandaid) Mitigate OOMs on checkpointing This PR introduces a tactical mitigation of ray-project#33073 by ensuring that the rank 0 worker is colocated with the Trainable if possible, allowing for lazy checkpointing to be used (and thus avoiding a situation where the entire checkpoint is loaded into memory to be passed to object store). THIS IS NOT A LONG TERM FIX. It is necessary to unblock LLM use cases for now. A proper solution should be implemented soon. --------- Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

* [Train] (Bandaid) Mitigate OOMs on checkpointing This PR introduces a tactical mitigation of ray-project#33073 by ensuring that the rank 0 worker is colocated with the Trainable if possible, allowing for lazy checkpointing to be used (and thus avoiding a situation where the entire checkpoint is loaded into memory to be passed to object store). THIS IS NOT A LONG TERM FIX. It is necessary to unblock LLM use cases for now. A proper solution should be implemented soon. --------- Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: elliottower <elliot@elliottower.com>

* [Train] (Bandaid) Mitigate OOMs on checkpointing This PR introduces a tactical mitigation of ray-project#33073 by ensuring that the rank 0 worker is colocated with the Trainable if possible, allowing for lazy checkpointing to be used (and thus avoiding a situation where the entire checkpoint is loaded into memory to be passed to object store). THIS IS NOT A LONG TERM FIX. It is necessary to unblock LLM use cases for now. A proper solution should be implemented soon. --------- Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

[Train] (Bandaid) Mitigate OOMs on checkpointing

80fc4ed

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Yard1 requested review from matthewdeng, amogkam and krfricke March 7, 2023 01:17

Yard1 assigned amogkam and krfricke Mar 7, 2023

Yard1 mentioned this pull request Mar 7, 2023

[AIR/Docs] GPT-J fine tuning with DeepSpeed example #33090

Merged

7 tasks

amogkam reviewed Mar 7, 2023

View reviewed changes

python/ray/train/_internal/worker_group.py Show resolved Hide resolved

Yard1 added 2 commits March 7, 2023 01:54

Fix popping

b9f5ae7

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Add test

613a8cd

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Yard1 requested a review from amogkam March 7, 2023 02:01

Improve test

6e69c7f

amogkam approved these changes Mar 7, 2023

View reviewed changes

Fix tests

06ef652

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

Yard1 merged commit f7aa4f6 into ray-project:master Mar 7, 2023

Yard1 deleted the train_oom_checkpointing_mitigation branch March 7, 2023 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] (Bandaid) Mitigate OOMs on checkpointing #33089

[Train] (Bandaid) Mitigate OOMs on checkpointing #33089

Yard1 commented Mar 7, 2023

[Train] (Bandaid) Mitigate OOMs on checkpointing #33089

[Train] (Bandaid) Mitigate OOMs on checkpointing #33089

Conversation

Yard1 commented Mar 7, 2023

Why are these changes needed?

Related issue number

Checks