-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the TF dummies even smaller #24071
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Adding @frostming's fix for Keras 2.13 to this PR as well |
Tests are passing now, pinging @ydshieh and @amyeroberts for quick review! Unfortunately it's quite hard for me to test if this PR will fix the entire memory issue in the overnight CI, but we'll try this fix and if the memory issues remain then I'll try some other things too. |
I can trigger a run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 Thanks for iterating on this
For my own understanding: do we know why the new build logic is triggering the OOM issues (what's the difference from before)? Would these failures happen if just one model was being tested - or does it happen after N models being tested?
It's more like when several processes are run in parallel: on circleci, 8 pytest processes. |
I also have the same question on why the memory usage increases that much. Previously, we don't really use batch size 1 in dummy if I remember correctly. |
A run is triggered here. If you need more changes and more runs to check, you can update the branch https://github.com/huggingface/transformers/tree/run_even_lower_tf_dummy_memory on top of this PR branch. |
@amyeroberts I don't have a very good intuition for this, actually. I think it's some combination of:
It's also possible that the update sped up building by removing unnecessary build ops left over from TF 1 and not unneccessarily passing dummies when models were already built. Speeding up builds might cause tests to be in the actual model calls more of the time, and if peak model usage occurs during the actual model calls and we have lots of tests running in parallel then more tests being in the calls simultaneously might result in higher peak memory usage for the test server. This is all purely speculative on my part, though - I can't reproduce the problem locally and the nature of the parallel tests makes it hard to point to a single culprit for an OOM error! |
@ydshieh the new run seems to be passing - there's an unrelated issue with one of the vit-mae tests that I can't reproduce locally and that doesn't seem related, but I think this PR resolves most of the problems! |
@Rocketknight1 Unfortunately, it doesn't pass. We have to go to the And if you click Download the full output as a file, you will see 😢 😭 |
Well, to be more sure, I can revert the PR #23234 on another branch, so would be Do you want me to do this? |
No, I'm pretty confident that the change to the dummies is the cause! |
@ydshieh Can we reduce the number of parallel workers by 1 for these tests? I think the speed boost from these PRs (plus some other ones I have planned) should compensate for any slowdown we experience, and it would be good to be able to make small changes without breaking fragile parallel tests like these |
Let me open a PR for that :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
0b47c41
to
c7d9962
Compare
(rebasing onto @ydshieh's PR to test everything in combination) |
* Let's see if we can use the smallest possible dummies * Make GPT-2's dummies a little longer * Just use (1,2) as the default shape * Update other dummies in sync * Correct imports for Keras 2.13 * Shrink the Wav2Vec2 dummies
cc @ydshieh - this will probably break some things, but if I can make it work it should reduce the memory usage during building a lot