Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically adds "Thank you" #1592

Open
gkarmas opened this issue Dec 4, 2023 · 9 comments · May be fixed by #1768
Open

Automatically adds "Thank you" #1592

gkarmas opened this issue Dec 4, 2023 · 9 comments · May be fixed by #1768
Labels
enhancement New feature or request question Further information is requested

Comments

@gkarmas
Copy link

gkarmas commented Dec 4, 2023

Testing the large v3 model on a word-by-word transcript output, when there is no audio at the end, it always adds "Thank you"

@bobqianic
Copy link
Collaborator

That's hallucination.

https://arxiv.org/abs/2311.14648

@bobqianic bobqianic added the question Further information is requested label Dec 5, 2023
@gkarmas
Copy link
Author

gkarmas commented Dec 5, 2023

Interesting thanks for sharing. Is this fixable on the model? I'm stripping it programmatically for now

@shylock74
Copy link

openai/whisper#928

@misutoneko
Copy link

misutoneko commented Dec 5, 2023

As I've mentioned in that openai whisper thread, I got rid of these with the --suppress_tokens command line switch.
Looks like whisper.cpp doesn't have that, but the BEG token can be suppressed in whisper_process_logits(),
just add this line:
logits[vocab.token_beg] = -INFINITY;

It will cause you to get descriptions of sound events instead of "Thank you".
That's a bit easier to deal with, I think.

EDIT: Around the line 4600 or so (there are similar lines for other tokens there).
EDIT2: Note that this doesn't work the same way in whisper.cpp, we need something else. No timestamps mode was not a problem for me, but I guess it's only because my clips are usually very short.

@JRWSP
Copy link

JRWSP commented Dec 5, 2023

As I've mentioned in that openai whisper thread, I got rid of these with the --suppress_tokens command line switch. Looks like whisper.cpp doesn't have that, but the BEG token can be suppressed in whisper_process_logits(), just add this line: logits[vocab.token_beg] = -INFINITY;

It will cause you to get descriptions of sound events instead of "Thank you". That's a bit easier to deal with, I think.

Can you give more details, where to add the line into? I don't know c++.

@bobqianic
Copy link
Collaborator

@JRWSP Add it to this function.

static void whisper_process_logits(

@bobqianic
Copy link
Collaborator

As I've mentioned in that openai whisper thread, I got rid of these with the --suppress_tokens command line switch. Looks like whisper.cpp doesn't have that, but the BEG token can be suppressed in whisper_process_logits(), just add this line: logits[vocab.token_beg] = -INFINITY;

It will cause you to get descriptions of sound events instead of "Thank you". That's a bit easier to deal with, I think.

Could you explain how removing the BEG token (begin time stamps) helps in reducing hallucinations?

@misutoneko
Copy link

misutoneko commented Dec 6, 2023

Well if I've understood this correctly, suppressing the non-speech tokens causes the BEG token to emerge somehow (rather than NOT/no timestamps token), and that's what causes these hallucinations.
(I don't think I saw any of these problematic tokens coming up after the NOT token)
So in that sense simply suppressing BEG might help, or not.
But the NOT token (no timestamps mode) might not be very desirable either.

The workaround that I used for whisper/whisper-timestamped was to allow non-speech tokens.
Here's the original thread:
linto-ai/whisper-timestamped#107

I suppose this could all be fixed in the training data too, but that's something we plebs don't get to see.
Btw I have tested this mostly with medium and small models (haven't tried large-v3). The "en" models use a different token id.

EDIT: OK it was a nice theory, but it doesn't hold up (for whisper.cpp).
whisper.cpp does have a parameter for non_speech_tokens, and they're allowed by default.
So must be something else going on.
I actually tried to replicate the --suppress "" mode with whisper.cpp (by allowing everything through without filtering),
but it didn't seem to help much. Maybe there's just difference between these two codecases in how the calculations are done.

PR1588 has some samples for testing.

@misutoneko
Copy link

Hmmm, do these hallucinated tokens always have low probability?
Because if so, they could be easily filtered out based on that.
But there could be a risk is that some useful tokens might get lost (low-quality audio?).

Another idea I haven't seen mentioned is that prompting can sometimes help (for short clips?).
Even if the prompt is just " " that will change the output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants