Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Medium.en model just outputting "Okay" for every second in the audio while the base.en model works well #719

Closed
bangpradyumna opened this issue Apr 5, 2023 · 6 comments
Labels
decoding Decoding related issues

Comments

@bangpradyumna
Copy link

Hello Everyone,
I have a recording that I'm trying to transcribe. I first tried doing that using base model which worked fine but not perfect. I then tried doing the same using the Medium.en model but it just outputs "Okay" for each second of the audio.

Although there are 5 or 6 "Okays" in the audio but Medium model just keeps on outputting "Okay" even for lines which the "Base" model is able to transcribe.

Screenshot of Base.en model's output which works well :
image

Screenshot of Medium.en model's output :
image

Any idea on what I might be doing wrong ?

@carlosbaraza
Copy link

I having the same problem:

[00:37:36.000 --> 00:38:02.000]   >> Thank you.
[00:38:02.000 --> 00:38:28.000]   >> Thank you.
[00:38:28.000 --> 00:38:54.000]   >> Thank you.
[00:38:54.000 --> 00:39:20.000]   >> Thank you.
[00:39:20.000 --> 00:39:46.000]   >> Thank you.
[00:39:46.000 --> 00:40:12.000]   >> Thank you.
[00:40:12.000 --> 00:40:38.000]   >> Thank you.
[00:40:38.000 --> 00:41:04.000]   >> Thank you.
[00:41:04.000 --> 00:41:30.000]   >> Thank you.
[00:41:30.000 --> 00:41:56.000]   >> Thank you.
[00:41:56.000 --> 00:42:22.000]   >> Thank you.
[00:42:22.000 --> 00:42:48.000]   >> Thank you.
[00:42:48.000 --> 00:43:14.000]   >> Thank you.
[00:43:14.000 --> 00:43:40.000]   >> Thank you.
[00:43:40.000 --> 00:44:06.000]   >> Thank you.
[00:44:06.000 --> 00:44:32.000]   >> Thank you.
[00:44:32.000 --> 00:44:58.000]   >> Thank you.
[00:44:58.000 --> 00:45:24.000]   >> Thank you.
[00:45:24.000 --> 00:45:50.000]   >> Thank you.
[00:45:50.000 --> 00:46:16.000]   >> Thank you.
[00:46:16.000 --> 00:46:42.000]   >> Thank you.
[00:46:42.000 --> 00:47:08.000]   >> Thank you.
[00:47:08.000 --> 00:47:34.000]   >> Thank you.
[00:47:34.000 --> 00:48:00.000]   >> Thank you.
[00:48:00.000 --> 00:48:26.000]   >> Thank you.
[00:48:26.000 --> 00:48:52.000]   >> Thank you.
[00:48:52.000 --> 00:49:12.000]   >> Thank you.
[00:49:12.000 --> 00:49:38.000]   >> Thank you.
[00:49:38.000 --> 00:50:04.000]   >> Thank you.
[00:50:04.000 --> 00:50:30.000]   >> Thank you.
[00:50:30.000 --> 00:50:56.000]   >> Thank you.
[00:50:56.000 --> 00:51:22.000]   >> Thank you.
[00:51:22.000 --> 00:51:42.000]   >> Thank you.
[00:51:42.000 --> 00:52:08.000]   >> Thank you.
[00:52:08.000 --> 00:52:34.000]   >> Thank you.
[00:52:34.000 --> 00:53:00.000]   >> Thank you.
[00:53:00.000 --> 00:53:26.000]   >> Thank you.
[00:53:26.000 --> 00:53:52.000]   >> Thank you.
[00:53:52.000 --> 00:54:12.000]   >> Thank you.
[00:54:12.000 --> 00:54:38.000]   >> Thank you.
[00:54:38.000 --> 00:55:04.000]   >> Thank you.
[00:55:04.000 --> 00:55:30.000]   >> Thank you.
[00:55:30.000 --> 00:55:56.000]   >> Thank you.
[00:55:56.000 --> 00:56:22.000]   >> Thank you.
[00:56:22.000 --> 00:56:48.000]   >> Thank you.
[00:56:48.000 --> 00:57:14.000]   >> Thank you.
[00:57:14.000 --> 00:57:40.000]   >> Thank you.
[00:57:40.000 --> 00:58:06.000]   >> Thank you.
[00:58:06.000 --> 00:58:32.000]   >> Thank you.
[00:58:32.000 --> 00:58:58.000]   >> Thank you.
[00:58:58.000 --> 00:59:24.000]   >> Thank you.
[00:59:24.000 --> 00:59:50.000]   >> Thank you.
[00:59:50.000 --> 01:00:16.000]   >> Thank you.
[01:00:16.000 --> 01:00:42.000]   >> Thank you.
[01:00:42.000 --> 01:01:08.000]   >> Thank you.
[01:01:08.000 --> 01:01:34.000]   >> Thank you.
[01:01:34.000 --> 01:02:00.000]   >> Thank you.
[01:02:00.000 --> 01:02:26.000]   >> Thank you.
[01:02:26.000 --> 01:02:52.000]   >> Thank you.
[01:02:52.000 --> 01:03:14.000]   >> Thank you.
[01:03:14.000 --> 01:03:40.000]   >> Thank you.
[01:03:40.000 --> 01:04:06.000]   >> Thank you.
[01:04:06.000 --> 01:04:32.000]   >> Thank you.
[01:04:32.000 --> 01:04:52.000]   >> Thank you.

@abelbabel
Copy link

I have the same issue ... seems not to be related to a specific model ... and not with each input file ...

@abelbabel
Copy link

abelbabel commented Apr 11, 2023

similar to #731 and #612

@ggerganov ggerganov added the decoding Decoding related issues label Apr 14, 2023
@ggerganov
Copy link
Owner

I've disabled the decoder fallbacks because current implementation is very inefficient.
This will be resolved some time in the future

@abelbabel
Copy link

Turned out that in one case the section where multiple "Okay"s were "hallucinated" was loud rumbling / noises (no speech). I isolated this part and it was detected correctly. After that I took one detected noise output (like "(pages rustling)") as an input for the prompt-parameter and the original file was detected properly.

This is of course not working in large scale.
But maybe it gives an idea where the problem is ...

ggerganov added a commit that referenced this issue Apr 15, 2023
I disabled this because there were many complaints about slow decoding.
The current implementation does not allow batching the decoders when
using the "best of" or "beam size" parameters, so the decoding time is
proportional to the number of decoders, which is obviously not great.

However, now there are even more complaints about wrong decodings and
repetition.

So, making a compromise by re-enabling the fallbacks, but defaulting to
just 2 "best of" / "beam size" decoders. Also, the temperature step is
increased from 0.2 to 0.4 - i.e. from maximum of 5 fallbacks to maximum
of 2.

Also, the stream example now has fallbacks enabled by default.

close #471 #477 #508 #612 #719 #731
@ggerganov
Copy link
Owner

Should be resolved via f19e23f

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023
I disabled this because there were many complaints about slow decoding.
The current implementation does not allow batching the decoders when
using the "best of" or "beam size" parameters, so the decoding time is
proportional to the number of decoders, which is obviously not great.

However, now there are even more complaints about wrong decodings and
repetition.

So, making a compromise by re-enabling the fallbacks, but defaulting to
just 2 "best of" / "beam size" decoders. Also, the temperature step is
increased from 0.2 to 0.4 - i.e. from maximum of 5 fallbacks to maximum
of 2.

Also, the stream example now has fallbacks enabled by default.

close ggerganov#471 ggerganov#477 ggerganov#508 ggerganov#612 ggerganov#719 ggerganov#731
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023
I disabled this because there were many complaints about slow decoding.
The current implementation does not allow batching the decoders when
using the "best of" or "beam size" parameters, so the decoding time is
proportional to the number of decoders, which is obviously not great.

However, now there are even more complaints about wrong decodings and
repetition.

So, making a compromise by re-enabling the fallbacks, but defaulting to
just 2 "best of" / "beam size" decoders. Also, the temperature step is
increased from 0.2 to 0.4 - i.e. from maximum of 5 fallbacks to maximum
of 2.

Also, the stream example now has fallbacks enabled by default.

close ggerganov#471 ggerganov#477 ggerganov#508 ggerganov#612 ggerganov#719 ggerganov#731
landtanin pushed a commit to landtanin/whisper.cpp that referenced this issue Dec 16, 2023
I disabled this because there were many complaints about slow decoding.
The current implementation does not allow batching the decoders when
using the "best of" or "beam size" parameters, so the decoding time is
proportional to the number of decoders, which is obviously not great.

However, now there are even more complaints about wrong decodings and
repetition.

So, making a compromise by re-enabling the fallbacks, but defaulting to
just 2 "best of" / "beam size" decoders. Also, the temperature step is
increased from 0.2 to 0.4 - i.e. from maximum of 5 fallbacks to maximum
of 2.

Also, the stream example now has fallbacks enabled by default.

close ggerganov#471 ggerganov#477 ggerganov#508 ggerganov#612 ggerganov#719 ggerganov#731
iThalay pushed a commit to iThalay/whisper.cpp that referenced this issue Sep 23, 2024
I disabled this because there were many complaints about slow decoding.
The current implementation does not allow batching the decoders when
using the "best of" or "beam size" parameters, so the decoding time is
proportional to the number of decoders, which is obviously not great.

However, now there are even more complaints about wrong decodings and
repetition.

So, making a compromise by re-enabling the fallbacks, but defaulting to
just 2 "best of" / "beam size" decoders. Also, the temperature step is
increased from 0.2 to 0.4 - i.e. from maximum of 5 fallbacks to maximum
of 2.

Also, the stream example now has fallbacks enabled by default.

close ggerganov#471 ggerganov#477 ggerganov#508 ggerganov#612 ggerganov#719 ggerganov#731
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decoding Decoding related issues
Projects
None yet
Development

No branches or pull requests

4 participants