Flax whisper gradient checkpointing #22897

versae · 2023-04-20T16:05:08Z

It uses flax.linen.remat and follows on PRs #13657 and #17994.

What does this PR do?

Adds gradient_checkpointing to Flax Whisper models.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sanchit-gandhi @peregilk

It uses `flax.linen.remat` and follows on PRs huggingface#13657 and huggingface#17994

versae · 2023-04-20T16:07:13Z

At the moment, the model loads fine but I then get a weird error when training or generating:

│ /data/venvflax/lib/python3.8/site-packages/transformers/models/whisper/modeling_flax_whisper.py: │
│ 520 in __call__                                                                                  │
│                                                                                                  │
│    517 │   │   │   residual = hidden_states                                                      │
│    518 │   │   │                                                                                 │
│    519 │   │   │   hidden_states = self.encoder_attn_layer_norm(hidden_states)                   │
│ ❱  520 │   │   │   hidden_states, cross_attn_weights = self.encoder_attn(                        │
│    521 │   │   │   │   hidden_states=hidden_states,                                              │
│    522 │   │   │   │   key_value_states=encoder_hidden_states,                                   │
│    523 │   │   │   │   attention_mask=encoder_attention_mask,                                    │
│                                                                                                  │
│ /data/venvflax/lib/python3.8/site-packages/transformers/models/whisper/modeling_flax_whisper.py: │
│ 256 in __call__                                                                                  │
│                                                                                                  │
│    253 │   │   elif self.causal:                                                                 │
│    254 │   │   │   attention_mask = causal_mask                                                  │
│    255 │   │   elif attention_mask is not None:                                                  │
│ ❱  256 │   │   │   attention_mask = jnp.expand_dims(attention_mask, axis=(-3, -2))               │
│    257 │   │                                                                                     │
│    258 │   │   # During fast autoregressive decoding, we feed one position at a time,            │
│    259 │   │   # and cache the keys and values step by step.                                     │
│                                                                                                  │
│ /data/venvflax/lib/python3.8/site-packages/jax/_src/numpy/lax_numpy.py:896 in expand_dims        │
│                                                                                                  │
│    893   axis = _ensure_index_tuple(axis)                                                        │
│    894   if hasattr(a, "expand_dims"):                                                           │
│    895 │   return a.expand_dims(axis)                                                            │
│ ❱  896   return lax.expand_dims(a, axis)                                                         │
│    897                                                                                           │
│    898                                                                                           │
│    899 @_wraps(np.swapaxes, lax_description=_ARRAY_VIEW_DOC)                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: axis -3 is out of bounds for array of dimension 2

I'm not sure what's happening. So I thought maybe @sanchit-gandhi could provide some feedback :)

HuggingFaceDocBuilderDev · 2023-04-20T16:26:07Z

The documentation is not available anymore as the PR was closed or merged.

src/transformers/models/whisper/modeling_flax_whisper.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

versae · 2023-04-22T14:44:53Z

I've been digging and the only difference I can find is that for some reason the parameters for calling FlaxWhisperDecoderLayerCollection.__call__() in FlaxWhisperDecoder.__call__() are different in this PR's model than in the original implementation. I tested this using a tiny model

Original model

encoder_attention_mask=None
deterministic=True
output_hidden_states=False

This PR's model:

encoder_attention_mask=True
deterministic=False
output_hidden_states=True

The rest of params are the same: hidden_states, attention_mask, encoder_hidden_states, init_cache, output_attentions and return_dict. The problem is that while the first decoder layers loads fine, the second one gets an attention_mask value of True for some reason, making any tensor operation to fail.

versae · 2023-04-22T18:10:30Z

All passing! The main issue was a missing self.gradient_checkpointing in the FlaxWhisperPreTrainedModel.__init__() function. Took me forever to debug it.

I'll clean up the git history mess, but other than that I think it's finally ready :)

versae · 2023-04-24T10:59:15Z

Closing in favor of #22954.

versae added 2 commits April 20, 2023 17:41

Adding gradient_checkpointing to Flax Whisper

1eefc67

It uses `flax.linen.remat` and follows on PRs huggingface#13657 and huggingface#17994

Fixes issue with arguments

a38df46

sanchit-gandhi reviewed Apr 20, 2023

View reviewed changes

src/transformers/models/whisper/modeling_flax_whisper.py Outdated Show resolved Hide resolved

sanchit-gandhi reviewed Apr 20, 2023

View reviewed changes

src/transformers/models/whisper/modeling_flax_whisper.py Show resolved Hide resolved

versae and others added 6 commits April 21, 2023 10:00

Update src/transformers/models/whisper/modeling_flax_whisper.py

83fdd16

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

Adding back kwargs to decoder_layer

010e6eb

Merge branch 'huggingface:main' into flax-whisper-gradient-checkpointing

7420ea4

Merge branch 'huggingface:main' into flax-whisper-gradient-checkpointing

f880f16

Setting internal attr self.gradient_checkpointing

9204658

Merge branch 'huggingface:main' into flax-whisper-gradient-checkpointing

93589f9

versae added 5 commits April 22, 2023 18:34

Adding gradient_checkpointing to self to it's available in setup()

1774c69

Adding back proper kwargs

5bf5f8c

Removing kwargs to remat decoder layer

4a4bad6

Black reformat

3dead7f

Fix check copies

d1a6c74

versae marked this pull request as ready for review April 22, 2023 18:10

versae closed this Apr 24, 2023

versae deleted the flax-whisper-gradient-checkpointing branch April 27, 2023 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flax whisper gradient checkpointing #22897

Flax whisper gradient checkpointing #22897

versae commented Apr 20, 2023

versae commented Apr 20, 2023

HuggingFaceDocBuilderDev commented Apr 20, 2023 •

edited

Loading

versae commented Apr 22, 2023

versae commented Apr 22, 2023

versae commented Apr 24, 2023

Flax whisper gradient checkpointing #22897

Flax whisper gradient checkpointing #22897

Conversation

versae commented Apr 20, 2023

What does this PR do?

Before submitting

Who can review?

versae commented Apr 20, 2023

HuggingFaceDocBuilderDev commented Apr 20, 2023 • edited Loading

versae commented Apr 22, 2023

versae commented Apr 22, 2023

versae commented Apr 24, 2023

HuggingFaceDocBuilderDev commented Apr 20, 2023 •

edited

Loading