Remove dropout from decoder cell state #15

richardburleigh · 2019-12-07T00:15:24Z

Fix FP16 stagnation at "OVERFLOW! Skipping step. Attempted loss scale.."

thepowerfuldeez · 2019-12-11T07:03:36Z

Thank you! That helped me

candlewill · 2020-06-12T07:01:04Z

but why?

mychiux413 · 2020-07-20T08:22:18Z

It helped me, too.
And I noticed that the tacotron2 only apply dropout on hidden state.

After several studies,
there seems to be no consensus on how to dropout RNN,
and many papers discussed this.

Here is my opinion:
The intuition of dropout is: "Can't rely on any one feature, so have to spread out weight.",
so if we apply dropout on hidden state,
which means we don't want those gates only depend on some specific input features.

But the cell states run directly along the entire chain of RNN, to achieve the long memory behavior. Therefore, if we drop the cell state weights for each recurrent, this seems to mean that we do not want the memory to pass too long?

Jeevesh8 · 2020-09-13T04:39:45Z

@mychiux413 But how would that lead to gradient overflow ?

chazo1994 · 2022-01-27T08:55:39Z

It helped me, too. And I noticed that the tacotron2 only apply dropout on hidden state.

After several studies, there seems to be no consensus on how to dropout RNN, and many papers discussed this.

Here is my opinion: The intuition of dropout is: "Can't rely on any one feature, so have to spread out weight.", so if we apply dropout on hidden state, which means we don't want those gates only depend on some specific input features.

But the cell states run directly along the entire chain of RNN, to achieve the long memory behavior. Therefore, if we drop the cell state weights for each recurrent, this seems to mean that we do not want the memory to pass too long?

@mychiux413 But how about the quality of fp32 model after change the code like this commit ?

Remove dropout from decoder cell state

e7afe9c

Fix FP16 stagnation at "OVERFLOW! Skipping step. Attempted loss scale.."

richardburleigh mentioned this pull request Jun 9, 2020

Gradient overflow with Mixed Precision Training #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove dropout from decoder cell state #15

Remove dropout from decoder cell state #15

richardburleigh commented Dec 7, 2019

thepowerfuldeez commented Dec 11, 2019

candlewill commented Jun 12, 2020

mychiux413 commented Jul 20, 2020

Jeevesh8 commented Sep 13, 2020

chazo1994 commented Jan 27, 2022

Remove dropout from decoder cell state #15

Are you sure you want to change the base?

Remove dropout from decoder cell state #15

Conversation

richardburleigh commented Dec 7, 2019

thepowerfuldeez commented Dec 11, 2019

candlewill commented Jun 12, 2020

mychiux413 commented Jul 20, 2020

Jeevesh8 commented Sep 13, 2020

chazo1994 commented Jan 27, 2022