-
Thank you for the great work and for making it open-source! I am currently trying to fine-tune the Whisper large-v2 model with SpecAugment and I am wondering which value was used for masked values during training. Was it 0? My concern with using 0 for masking is that it is also used for padding audio, and this could potentially cause the model to "hallucinate" at the end of shorter audio files (less than 30 seconds) during recognition, since the model was trained to predict masked parts that may have overlapped with padding values. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Apologies for missing this post -- this is a totally valid point and I should fix the zero-padding method in |
Beta Was this translation helpful? Give feedback.
Apologies for missing this post -- this is a totally valid point and I should fix the zero-padding method in
transcribe()
(also answered in #838)