Make the TF dummies even smaller #24071

Rocketknight1 · 2023-06-07T10:23:41Z

cc @ydshieh - this will probably break some things, but if I can make it work it should reduce the memory usage during building a lot

HuggingFaceDocBuilderDev · 2023-06-07T10:41:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Rocketknight1 · 2023-06-07T11:17:34Z

Adding @frostming's fix for Keras 2.13 to this PR as well

Rocketknight1 · 2023-06-07T11:32:45Z

Tests are passing now, pinging @ydshieh and @amyeroberts for quick review! Unfortunately it's quite hard for me to test if this PR will fix the entire memory issue in the overnight CI, but we'll try this fix and if the memory issues remain then I'll try some other things too.

ydshieh · 2023-06-07T11:39:06Z

I can trigger a run.

amyeroberts

LGTM 👍 Thanks for iterating on this

For my own understanding: do we know why the new build logic is triggering the OOM issues (what's the difference from before)? Would these failures happen if just one model was being tested - or does it happen after N models being tested?

ydshieh · 2023-06-07T11:40:50Z

It's more like when several processes are run in parallel: on circleci, 8 pytest processes.

ydshieh · 2023-06-07T11:41:42Z

I also have the same question on why the memory usage increases that much. Previously, we don't really use batch size 1 in dummy if I remember correctly.

ydshieh · 2023-06-07T11:47:32Z

A run is triggered here.

If you need more changes and more runs to check, you can update the branch

https://github.com/huggingface/transformers/tree/run_even_lower_tf_dummy_memory

on top of this PR branch.

Rocketknight1 · 2023-06-07T11:52:00Z

@amyeroberts I don't have a very good intuition for this, actually. I think it's some combination of:

The test runners were already at 90%+ memory usage before all of these PRs and tests are run in parallel as @ydshieh said, which means small perturbations could push them over the limit.
The update changed the shapes of dummies a bit - they should be smaller on average, especially after this PR, but maybe they ended up a little larger for some high-memory models and that caused the issues.

It's also possible that the update sped up building by removing unnecessary build ops left over from TF 1 and not unneccessarily passing dummies when models were already built. Speeding up builds might cause tests to be in the actual model calls more of the time, and if peak model usage occurs during the actual model calls and we have lots of tests running in parallel then more tests being in the calls simultaneously might result in higher peak memory usage for the test server.

This is all purely speculative on my part, though - I can't reproduce the problem locally and the nature of the parallel tests makes it hard to point to a single culprit for an OOM error!

Rocketknight1 · 2023-06-07T12:51:46Z

@ydshieh the new run seems to be passing - there's an unrelated issue with one of the vit-mae tests that I can't reproduce locally and that doesn't seem related, but I think this PR resolves most of the problems!

ydshieh · 2023-06-07T12:55:59Z

@Rocketknight1 Unfortunately, it doesn't pass. We have to go to the Resource tab, and see the memory usage.

And if you click Download the full output as a file, you will see worker gw7 crashed and worker restarting disabled.

😢 😭

ydshieh · 2023-06-07T12:59:47Z

Well, to be more sure, I can revert the PR #23234 on another branch, so would be main without that PR, and run the test. The goal is to make sure no other PRs contribute to the OOM issue.

Do you want me to do this?

Rocketknight1 · 2023-06-07T13:00:13Z

No, I'm pretty confident that the change to the dummies is the cause!

Rocketknight1 · 2023-06-07T13:06:25Z

@ydshieh Can we reduce the number of parallel workers by 1 for these tests? I think the speed boost from these PRs (plus some other ones I have planned) should compensate for any slowdown we experience, and it would be good to be able to make small changes without breaking fragile parallel tests like these

ydshieh · 2023-06-07T13:17:17Z

Let me open a PR for that :-)

ydshieh

OK

sgugger

Thanks!

Rocketknight1 · 2023-06-07T14:20:26Z

(rebasing onto @ydshieh's PR to test everything in combination)

* Let's see if we can use the smallest possible dummies * Make GPT-2's dummies a little longer * Just use (1,2) as the default shape * Update other dummies in sync * Correct imports for Keras 2.13 * Shrink the Wav2Vec2 dummies

Rocketknight1 marked this pull request as ready for review June 7, 2023 11:31

Rocketknight1 requested review from amyeroberts and ydshieh June 7, 2023 11:31

amyeroberts approved these changes Jun 7, 2023

View reviewed changes

ydshieh mentioned this pull request Jun 7, 2023

Be nice to TF #24076

Merged

ydshieh approved these changes Jun 7, 2023

View reviewed changes

sgugger approved these changes Jun 7, 2023

View reviewed changes

Rocketknight1 added 6 commits June 7, 2023 15:19

Let's see if we can use the smallest possible dummies

b0b4d1b

Make GPT-2's dummies a little longer

8d6840f

Just use (1,2) as the default shape

47e3224

Update other dummies in sync

a8d2f4c

Correct imports for Keras 2.13

e593919

Shrink the Wav2Vec2 dummies

c7d9962

Rocketknight1 force-pushed the even_lower_tf_dummy_memory branch from 0b47c41 to c7d9962 Compare June 7, 2023 14:20

Rocketknight1 merged commit 1fc832b into main Jun 7, 2023

Rocketknight1 deleted the even_lower_tf_dummy_memory branch June 7, 2023 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the TF dummies even smaller #24071

Make the TF dummies even smaller #24071

Rocketknight1 commented Jun 7, 2023

HuggingFaceDocBuilderDev commented Jun 7, 2023

Rocketknight1 commented Jun 7, 2023

Rocketknight1 commented Jun 7, 2023

ydshieh commented Jun 7, 2023

amyeroberts left a comment

ydshieh commented Jun 7, 2023

ydshieh commented Jun 7, 2023

ydshieh commented Jun 7, 2023

Rocketknight1 commented Jun 7, 2023

Rocketknight1 commented Jun 7, 2023

ydshieh commented Jun 7, 2023

ydshieh commented Jun 7, 2023

Rocketknight1 commented Jun 7, 2023

Rocketknight1 commented Jun 7, 2023

ydshieh commented Jun 7, 2023

ydshieh left a comment

sgugger left a comment

Rocketknight1 commented Jun 7, 2023

Make the TF dummies even smaller #24071

Make the TF dummies even smaller #24071

Conversation

Rocketknight1 commented Jun 7, 2023

HuggingFaceDocBuilderDev commented Jun 7, 2023

Rocketknight1 commented Jun 7, 2023

Rocketknight1 commented Jun 7, 2023

ydshieh commented Jun 7, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

ydshieh commented Jun 7, 2023

ydshieh commented Jun 7, 2023

ydshieh commented Jun 7, 2023

Rocketknight1 commented Jun 7, 2023

Rocketknight1 commented Jun 7, 2023

ydshieh commented Jun 7, 2023

ydshieh commented Jun 7, 2023

Rocketknight1 commented Jun 7, 2023

Rocketknight1 commented Jun 7, 2023

ydshieh commented Jun 7, 2023

ydshieh left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Rocketknight1 commented Jun 7, 2023