Implement no_speech_thold #2625

jkarthic · 2024-12-13T05:41:21Z

no_speech_thold functionality is on par with OpenAI's whisper

ggerganov · 2024-12-13T09:03:46Z

src/whisper.cpp

+                // Calculate no_speech probability after first decode
+                {
+                    const float * logits = state->logits.data();
+                    const int n_vocab = ctx->vocab.n_vocab;
+
+                    // Find max element for numerical stability
+                    float max_logit = -INFINITY;
+                    for (int i = 0; i < n_vocab; ++i) {
+                        max_logit = std::max(max_logit, logits[i]);
+                    }
+
+                    // Calculate softmax
+                    float sum_exp = 0.0f;
+                    std::vector<float> probs(n_vocab);
+                    for (int i = 0; i < n_vocab; ++i) {
+                        float exp_val = expf(logits[i] - max_logit);
+                        sum_exp += exp_val;
+                        probs[i] = exp_val;
+                    }
+
+                    // Normalize
+                    for (int i = 0; i < n_vocab; ++i) {
+                        probs[i] /= sum_exp;
+                    }
+
+                    // Get probability of no_speech token
+                    state->no_speech_prob = probs[whisper_token_nosp(ctx)];
+                }
+


This likely has to be done inside whisper_process_logits in order to avoid computing the softmax again just for this probability.

Unfortunately we cannot reuse the softmax computed inside whisper_process_logits since no_speech_prob has to be calculated before any logits filtering. Otherwise we get some wrong no_speech_prob values. The same method is followed in openai's whisper as well. https://github.com/openai/whisper/blob/main/whisper/decoding.py#L689-L703
Since this no_speech_prob calculation is only for the first token in the sequence, it will not cause a big performance impact.
On a related note, I have now modularized the probs calculation and now reusing the same code as whisper_process_logits

ggerganov · 2024-12-13T09:04:22Z

src/whisper.cpp

@@ -6038,7 +6068,8 @@ int whisper_full_with_state(
            if (it != (int) temperatures.size() - 1) {
                const auto & decoder = state->decoders[best_decoder_id];

-                if (decoder.failed || decoder.sequence.avg_logprobs < params.logprob_thold) {
+                if (decoder.failed ||
+                    (decoder.sequence.avg_logprobs < params.logprob_thold && state->no_speech_prob < params.no_speech_thold)) {


The log message is no longer correct

The comparison for the speech prob is wrong

Thanks. I have corrected the log message to print the no_speech_prob and no_speech_thold values as well.

The comparison logic is on par with the openai implementation. avg_logprobs being lesser than the threshold is considered as a failure only for speech segment. If it is a non-speech, then it is considered as a successful prediction of "silence". Here is the relavant code from openai. I have just merged it into one condition by inverting the comparison. But the logic is the same.
https://github.com/openai/whisper/blob/main/whisper/transcribe.py#L209-L220

ggerganov

Thank you, good PR.

# By Georgi Gerganov (4) and others # Via GitHub * ggerganov/master: stream : improve consistency in README (ggerganov#2642) whisper : support no_speech_thold (ggerganov#2625) whisper : add single-timestamp logic (ggerganov#2629) readme : fix typo (ggerganov#2637) cmake : fix "amd64" processor string (ggerganov#2638) vulkan : fix soft_max.comp division by zero (ggerganov#2633) common : add cstdio header stream : update build instructions android : fix build and ci (ggerganov#2624) models : fix typo in download-ggml-model.sh (ggerganov#2623) ruby : Sync whisper.cpp and model download feature (ggerganov#2617) scripts : update to new build system # Conflicts: # src/whisper.cpp

Implement no_speech_thold

72c277f

no_speech_thold functionality is on par with OpenAI's whisper

ggerganov reviewed Dec 13, 2024

View reviewed changes

Addressed review comments

3448759

ggerganov approved these changes Dec 17, 2024

View reviewed changes

ggerganov merged commit f897eb7 into ggerganov:master Dec 17, 2024

KitaitiMakoto mentioned this pull request Dec 17, 2024

ruby : Add no_speech_thold #2641

Merged

pminev mentioned this pull request Dec 18, 2024

Update submodules alpaca-core/ilib-whisper.cpp#5

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement no_speech_thold #2625

Implement no_speech_thold #2625

jkarthic commented Dec 13, 2024

ggerganov Dec 13, 2024

jkarthic Dec 13, 2024

ggerganov Dec 13, 2024

jkarthic Dec 13, 2024

ggerganov left a comment

Implement no_speech_thold #2625

Implement no_speech_thold #2625

Conversation

jkarthic commented Dec 13, 2024

ggerganov Dec 13, 2024

Choose a reason for hiding this comment

jkarthic Dec 13, 2024

Choose a reason for hiding this comment

ggerganov Dec 13, 2024

Choose a reason for hiding this comment

jkarthic Dec 13, 2024

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment