-
Notifications
You must be signed in to change notification settings - Fork 22.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --offload-to-disk support to minifier #100546
Conversation
Signed-off-by: Edward Z. Yang <ezyang@meta.com> [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100546
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c0b1355: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: dfd0c5a82f1970a3eaa655d5fd0864f5c2c4f6a0 Pull Request resolved: #100546
When minifying extremely large repros, the minifier can run out of memory. This is because, for delta debugging, the minifier keeps a copy of every intermediate output in the network. This can easily put you over the memory limit for your GPU. To make matters worse, we cannot easily delta debug in such a situation, as delta debugging involves replacing intermediates with inputs, but doing so can cause an intermediate to become live longer than its actual extent in the original model (since inputs all have to be allocated up front). The strategy in this PR is to use `load_tensor` from the previous PR to offer a low memory mode for delta debugging. Instead of putting intermediates as inputs, we instead load them in the middle of the graph in question. If, through DCE, the load_tensor ends up floating to the top of the graph, we can input-ify it. We now no longer save all intermediates in memory, but instead save them to disk. I used this to successfully minify the repro that helped us solve #100332 The testing is not very good. I can try to add more robust testing but it will involve a more involved refactor to FX minifier. Let me know if that's what you want. Signed-off-by: Edward Z. Yang <ezyangmeta.com> cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
When minifying extremely large repros, the minifier can run out of memory. This is because, for delta debugging, the minifier keeps a copy of every intermediate output in the network. This can easily put you over the memory limit for your GPU. To make matters worse, we cannot easily delta debug in such a situation, as delta debugging involves replacing intermediates with inputs, but doing so can cause an intermediate to become live longer than its actual extent in the original model (since inputs all have to be allocated up front). The strategy in this PR is to use `load_tensor` from the previous PR to offer a low memory mode for delta debugging. Instead of putting intermediates as inputs, we instead load them in the middle of the graph in question. If, through DCE, the load_tensor ends up floating to the top of the graph, we can input-ify it. We now no longer save all intermediates in memory, but instead save them to disk. I used this to successfully minify the repro that helped us solve #100332 The testing is not very good. I can try to add more robust testing but it will involve a more involved refactor to FX minifier. Let me know if that's what you want. Signed-off-by: Edward Z. Yang <ezyangmeta.com> cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 4fbdca8cb62fc90662111444c1d0a4740e4f5495 Pull Request resolved: #100546
When minifying extremely large repros, the minifier can run out of memory. This is because, for delta debugging, the minifier keeps a copy of every intermediate output in the network. This can easily put you over the memory limit for your GPU. To make matters worse, we cannot easily delta debug in such a situation, as delta debugging involves replacing intermediates with inputs, but doing so can cause an intermediate to become live longer than its actual extent in the original model (since inputs all have to be allocated up front). The strategy in this PR is to use `load_tensor` from the previous PR to offer a low memory mode for delta debugging. Instead of putting intermediates as inputs, we instead load them in the middle of the graph in question. If, through DCE, the load_tensor ends up floating to the top of the graph, we can input-ify it. We now no longer save all intermediates in memory, but instead save them to disk. I used this to successfully minify the repro that helped us solve #100332 The testing is not very good. I can try to add more robust testing but it will involve a more involved refactor to FX minifier. Let me know if that's what you want. Signed-off-by: Edward Z. Yang <ezyangmeta.com> cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: c1b36a2be4978eff7899aeae903280af2c9a72fd Pull Request resolved: #100546
When minifying extremely large repros, the minifier can run out of memory. This is because, for delta debugging, the minifier keeps a copy of every intermediate output in the network. This can easily put you over the memory limit for your GPU. To make matters worse, we cannot easily delta debug in such a situation, as delta debugging involves replacing intermediates with inputs, but doing so can cause an intermediate to become live longer than its actual extent in the original model (since inputs all have to be allocated up front). The strategy in this PR is to use `load_tensor` from the previous PR to offer a low memory mode for delta debugging. Instead of putting intermediates as inputs, we instead load them in the middle of the graph in question. If, through DCE, the load_tensor ends up floating to the top of the graph, we can input-ify it. We now no longer save all intermediates in memory, but instead save them to disk. I used this to successfully minify the repro that helped us solve #100332 The testing is not very good. I can try to add more robust testing but it will involve a more involved refactor to FX minifier. Let me know if that's what you want. Signed-off-by: Edward Z. Yang <ezyangmeta.com> cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
When minifying extremely large repros, the minifier can run out of memory. This is because, for delta debugging, the minifier keeps a copy of every intermediate output in the network. This can easily put you over the memory limit for your GPU. To make matters worse, we cannot easily delta debug in such a situation, as delta debugging involves replacing intermediates with inputs, but doing so can cause an intermediate to become live longer than its actual extent in the original model (since inputs all have to be allocated up front). The strategy in this PR is to use `load_tensor` from the previous PR to offer a low memory mode for delta debugging. Instead of putting intermediates as inputs, we instead load them in the middle of the graph in question. If, through DCE, the load_tensor ends up floating to the top of the graph, we can input-ify it. We now no longer save all intermediates in memory, but instead save them to disk. I used this to successfully minify the repro that helped us solve #100332 The testing is not very good. I can try to add more robust testing but it will involve a more involved refactor to FX minifier. Let me know if that's what you want. Signed-off-by: Edward Z. Yang <ezyangmeta.com> cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: d92580542615fe19bb3934a32ab303f11fe68b5c Pull Request resolved: #100546
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
When minifying extremely large repros, the minifier can run out of memory. This is because, for delta debugging, the minifier keeps a copy of every intermediate output in the network. This can easily put you over the memory limit for your GPU. To make matters worse, we cannot easily delta debug in such a situation, as delta debugging involves replacing intermediates with inputs, but doing so can cause an intermediate to become live longer than its actual extent in the original model (since inputs all have to be allocated up front).
The strategy in this PR is to use
load_tensor
from the previous PR to offer a low memory mode for delta debugging. Instead of putting intermediates as inputs, we instead load them in the middle of the graph in question. If, through DCE, the load_tensor ends up floating to the top of the graph, we can input-ify it. We now no longer save all intermediates in memory, but instead save them to disk. I used this to successfully minify the repro that helped us solve #100332The testing is not very good. I can try to add more robust testing but it will involve a more involved refactor to FX minifier. Let me know if that's what you want.
Signed-off-by: Edward Z. Yang ezyang@meta.com
cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire