Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] (Bandaid) Mitigate OOMs on checkpointing #33089

Merged
merged 5 commits into from
Mar 7, 2023

Conversation

Yard1
Copy link
Member

@Yard1 Yard1 commented Mar 7, 2023

Why are these changes needed?

This PR introduces a tactical mitigation of #33073 by ensuring that the rank 0 worker is colocated with the Trainable if possible, allowing for lazy checkpointing to be used (and thus avoiding a situation where the entire checkpoint is loaded into memory to be passed to object store).

THIS IS NOT A LONG TERM FIX. It is necessary to unblock LLM use cases for now. A proper solution should be implemented soon.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Yard1 added 2 commits March 7, 2023 01:54
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
@Yard1 Yard1 requested a review from amogkam March 7, 2023 02:01
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
@Yard1 Yard1 merged commit f7aa4f6 into ray-project:master Mar 7, 2023
@Yard1 Yard1 deleted the train_oom_checkpointing_mitigation branch March 7, 2023 20:17
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request Mar 21, 2023
* [Train] (Bandaid) Mitigate OOMs on checkpointing

This PR introduces a tactical mitigation of ray-project#33073 by ensuring that the rank 0 worker is colocated with the Trainable if possible, allowing for lazy checkpointing to be used (and thus avoiding a situation where the entire checkpoint is loaded into memory to be passed to object store).

THIS IS NOT A LONG TERM FIX. It is necessary to unblock LLM use cases for now. A proper solution should be implemented soon.

---------

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
* [Train] (Bandaid) Mitigate OOMs on checkpointing

This PR introduces a tactical mitigation of ray-project#33073 by ensuring that the rank 0 worker is colocated with the Trainable if possible, allowing for lazy checkpointing to be used (and thus avoiding a situation where the entire checkpoint is loaded into memory to be passed to object store).

THIS IS NOT A LONG TERM FIX. It is necessary to unblock LLM use cases for now. A proper solution should be implemented soon.

---------

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
* [Train] (Bandaid) Mitigate OOMs on checkpointing

This PR introduces a tactical mitigation of ray-project#33073 by ensuring that the rank 0 worker is colocated with the Trainable if possible, allowing for lazy checkpointing to be used (and thus avoiding a situation where the entire checkpoint is loaded into memory to be passed to object store).

THIS IS NOT A LONG TERM FIX. It is necessary to unblock LLM use cases for now. A proper solution should be implemented soon.

---------

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
* [Train] (Bandaid) Mitigate OOMs on checkpointing

This PR introduces a tactical mitigation of ray-project#33073 by ensuring that the rank 0 worker is colocated with the Trainable if possible, allowing for lazy checkpointing to be used (and thus avoiding a situation where the entire checkpoint is loaded into memory to be passed to object store).

THIS IS NOT A LONG TERM FIX. It is necessary to unblock LLM use cases for now. A proper solution should be implemented soon.

---------

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
* [Train] (Bandaid) Mitigate OOMs on checkpointing

This PR introduces a tactical mitigation of ray-project#33073 by ensuring that the rank 0 worker is colocated with the Trainable if possible, allowing for lazy checkpointing to be used (and thus avoiding a situation where the entire checkpoint is loaded into memory to be passed to object store).

THIS IS NOT A LONG TERM FIX. It is necessary to unblock LLM use cases for now. A proper solution should be implemented soon.

---------

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants