-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve decoding #291
Improve decoding #291
Conversation
Initial step needed for supporting parallel decoders
We actually use entropy, but it is similar
Thank you for your immense work and this wonderful project |
For real-time processing, we always want a single decoder running at T=0
First, thanks for all the hard work on this! I am playing around with 1.1.0 as I write this. Still have the issue that was closed in #172. The problem is worse now in that the "echos" may eat up CPU like crazy. My test case is to repeat the number "six" multiple times. (Sorry about the 666 humor) I send whisper an audio block of about 4 seconds with the word "six" repeated at least 3 times, and Whisper will now, instead or returning with a large number of sixes, will crunch for up to 1 minute and return various odd strings. |
@RndyP |
Thanks for the fixes 👍 |
* whisper : prepare infra for new decoding strategies * whisper : apply logit filters and compute logprobs * whisper : add whisper_get_logits() * whisper : separate self and cross attention memory Initial step needed for supporting parallel decoders * whisper : move probs_id buffer to whisper_context * whisper : refactor kv cache into separate struct * whisper : move self-attention kv cache to whisper_decoder * whisper : wip decoding parameters + strategies * whisper : wip decoding parameters + strategies (part 2) * whisper : wip decoding parameters + strategies (part 3) * whisper : wip decoding parameters + strategies (part 4) * whisper : fix prompt_past update to not include prompt_init * whisper : temperature + best_of support * whisper : support for compression_ration_threshold We actually use entropy, but it is similar * command : fix example to use logits instead of obsolete probs * whisper : handle empty sequence ranking * whisper : add WHISPER_DEBUG + diagnostic prints + new main args * whisper : minor fixes * whisper : add beam-search support * whisper : bug fix when there no previous context * whisper : add comments * stream : disable temperature fallback For real-time processing, we always want a single decoder running at T=0 * whisper.swiftui : update example - fix paths + add empty folders
* whisper : prepare infra for new decoding strategies * whisper : apply logit filters and compute logprobs * whisper : add whisper_get_logits() * whisper : separate self and cross attention memory Initial step needed for supporting parallel decoders * whisper : move probs_id buffer to whisper_context * whisper : refactor kv cache into separate struct * whisper : move self-attention kv cache to whisper_decoder * whisper : wip decoding parameters + strategies * whisper : wip decoding parameters + strategies (part 2) * whisper : wip decoding parameters + strategies (part 3) * whisper : wip decoding parameters + strategies (part 4) * whisper : fix prompt_past update to not include prompt_init * whisper : temperature + best_of support * whisper : support for compression_ration_threshold We actually use entropy, but it is similar * command : fix example to use logits instead of obsolete probs * whisper : handle empty sequence ranking * whisper : add WHISPER_DEBUG + diagnostic prints + new main args * whisper : minor fixes * whisper : add beam-search support * whisper : bug fix when there no previous context * whisper : add comments * stream : disable temperature fallback For real-time processing, we always want a single decoder running at T=0 * whisper.swiftui : update example - fix paths + add empty folders
* whisper : prepare infra for new decoding strategies * whisper : apply logit filters and compute logprobs * whisper : add whisper_get_logits() * whisper : separate self and cross attention memory Initial step needed for supporting parallel decoders * whisper : move probs_id buffer to whisper_context * whisper : refactor kv cache into separate struct * whisper : move self-attention kv cache to whisper_decoder * whisper : wip decoding parameters + strategies * whisper : wip decoding parameters + strategies (part 2) * whisper : wip decoding parameters + strategies (part 3) * whisper : wip decoding parameters + strategies (part 4) * whisper : fix prompt_past update to not include prompt_init * whisper : temperature + best_of support * whisper : support for compression_ration_threshold We actually use entropy, but it is similar * command : fix example to use logits instead of obsolete probs * whisper : handle empty sequence ranking * whisper : add WHISPER_DEBUG + diagnostic prints + new main args * whisper : minor fixes * whisper : add beam-search support * whisper : bug fix when there no previous context * whisper : add comments * stream : disable temperature fallback For real-time processing, we always want a single decoder running at T=0 * whisper.swiftui : update example - fix paths + add empty folders
* whisper : prepare infra for new decoding strategies * whisper : apply logit filters and compute logprobs * whisper : add whisper_get_logits() * whisper : separate self and cross attention memory Initial step needed for supporting parallel decoders * whisper : move probs_id buffer to whisper_context * whisper : refactor kv cache into separate struct * whisper : move self-attention kv cache to whisper_decoder * whisper : wip decoding parameters + strategies * whisper : wip decoding parameters + strategies (part 2) * whisper : wip decoding parameters + strategies (part 3) * whisper : wip decoding parameters + strategies (part 4) * whisper : fix prompt_past update to not include prompt_init * whisper : temperature + best_of support * whisper : support for compression_ration_threshold We actually use entropy, but it is similar * command : fix example to use logits instead of obsolete probs * whisper : handle empty sequence ranking * whisper : add WHISPER_DEBUG + diagnostic prints + new main args * whisper : minor fixes * whisper : add beam-search support * whisper : bug fix when there no previous context * whisper : add comments * stream : disable temperature fallback For real-time processing, we always want a single decoder running at T=0 * whisper.swiftui : update example - fix paths + add empty folders
* whisper : prepare infra for new decoding strategies * whisper : apply logit filters and compute logprobs * whisper : add whisper_get_logits() * whisper : separate self and cross attention memory Initial step needed for supporting parallel decoders * whisper : move probs_id buffer to whisper_context * whisper : refactor kv cache into separate struct * whisper : move self-attention kv cache to whisper_decoder * whisper : wip decoding parameters + strategies * whisper : wip decoding parameters + strategies (part 2) * whisper : wip decoding parameters + strategies (part 3) * whisper : wip decoding parameters + strategies (part 4) * whisper : fix prompt_past update to not include prompt_init * whisper : temperature + best_of support * whisper : support for compression_ration_threshold We actually use entropy, but it is similar * command : fix example to use logits instead of obsolete probs * whisper : handle empty sequence ranking * whisper : add WHISPER_DEBUG + diagnostic prints + new main args * whisper : minor fixes * whisper : add beam-search support * whisper : bug fix when there no previous context * whisper : add comments * stream : disable temperature fallback For real-time processing, we always want a single decoder running at T=0 * whisper.swiftui : update example - fix paths + add empty folders
* whisper : prepare infra for new decoding strategies * whisper : apply logit filters and compute logprobs * whisper : add whisper_get_logits() * whisper : separate self and cross attention memory Initial step needed for supporting parallel decoders * whisper : move probs_id buffer to whisper_context * whisper : refactor kv cache into separate struct * whisper : move self-attention kv cache to whisper_decoder * whisper : wip decoding parameters + strategies * whisper : wip decoding parameters + strategies (part 2) * whisper : wip decoding parameters + strategies (part 3) * whisper : wip decoding parameters + strategies (part 4) * whisper : fix prompt_past update to not include prompt_init * whisper : temperature + best_of support * whisper : support for compression_ration_threshold We actually use entropy, but it is similar * command : fix example to use logits instead of obsolete probs * whisper : handle empty sequence ranking * whisper : add WHISPER_DEBUG + diagnostic prints + new main args * whisper : minor fixes * whisper : add beam-search support * whisper : bug fix when there no previous context * whisper : add comments * stream : disable temperature fallback For real-time processing, we always want a single decoder running at T=0 * whisper.swiftui : update example - fix paths + add empty folders
ref #278 #133 #172 #255 #270
The goal of this PR is to reach OpenAI decoding parity and potentially go beyond
There are several ideas for improving the decoding strategy that will be explored.
There is some chance that these ideas will improve segment and token timestamp precision, but no guarantees.
Implemented decoding strategies
Decoded sequences can be discarded based on the average logprob of the tokens. When the avg logprob is below the threshold, it means that the model wasn't very confident in the transcription and we need to apply some fallback strategy to generate a better sequence
This is similar to OpenAI's compression ration threshold logic used to determine if a sequence is too repetitive. However, in
whisper.cpp
instead of usingzlib
compression, we use a basic entropy metricH = -sum(p*log(p))
of the last 32 tokens in the sequence to determine if the decoding has degraded in endless repetition. Low entropy means more repetition. This approach has to be further tested - probably the entropy threshold might need some adjustmentsBy default, the decoding starts with
T = 0
, deterministically sampling the best token each time based on the computed logits. Upon failure, we increase the temperature and we start sampling the tokens from a discrete probability distribution obtained by scaling the logits with1/T
Greedy
decoding strategyUses
--best-of
number of independent decoders forT > 0
. Each decoder keeps a separate decoding sequence. At temperatureT > 0.5
we clear any previous context. The rationale is that sometimes the context can confuse the decoder and drive it into a failure caseBeamSearch
decoding strategyAt
T = 0
we start with--beam-size
independent decoders. Each one generates the top--beam-size
sequences from it's current state. From all generated candidate sequences, we pick the top--beam-size
based on the logprob sum of their tokens and reassign them to the decoders. Upon failure, we increase the temperature and fallback to theGreedy
strategy. TheBeamSearch
decoder is--beam-size
times more computationally heavy than theGreedy
decoderI think it is worth exploring a strategy which initially uses 1 beam at
T = 0
and only activates--beam-size
decoders upon failure. This would significantly speed-up the processing and I hope it will keep the transcription quality high. Will probably add a flag for thatDevelopment notes
best_of
is used only by the Greedy decoder attemperature > 0
beam_size
is used by the BeamSearch decoderBoth
best_of
andbeam_size
require to maintain a separate KV memory for each decoder stream. Need changes both inwhisper.h
interface +whisper_context
/whisper_model
to support that. Introducewhisper_decoder
compression_ratio
heuristic might be out-of-scope - I cannot implementzlib.compress
from scratch. Maybe use something simpler, like n-gram entropy?ChatGPT brainstorming
Clear past prompt only for
temperature >= 0.5
:patience controls the max number of sequences to obtain from the beam search
BeamSearch algorithm is explained here: https://arxiv.org/pdf/2204.05424.pdf
For each decoded sequence, maintain
avg_logprob
of the tokens in order to implementlogprob_threshold
fallback:https://github.com/openai/whisper/blob/0b1ba3d46ebf7fe6f953acfd8cad62a4f851b49f/whisper/transcribe.py#L119-L120
Sequence ranking with and without
length_penalty
:https://github.com/openai/whisper/blob/0b1ba3d46ebf7fe6f953acfd8cad62a4f851b49f/whisper/decoding.py#L169-L192