Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hallucinations during silence #2629

Merged
merged 2 commits into from
Dec 17, 2024

Conversation

jkarthic
Copy link
Contributor

When the predicted tokens end with a single timestamp the the entire 30 segment should be considered as done, to avoid hallucinations for the remaining part of segment.
This behaviour is on par with openai's whisper. Refer to logic related to single_timestamp_ending in https://github.com/openai/whisper/blob/main/whisper/transcribe.py

When the predicted tokens end with a single timestamp the the entire 30 segment should be considered as done, to avoid hallucinations for the remaining part of segment.
This behaviour is on par with openai's whisper. Refer to logic related to `single_timestamp_ending` in https://github.com/openai/whisper/blob/main/whisper/transcribe.py
@itsthisjustin
Copy link

We need this so bad. Hopefully it'll work with the swift package?

@jkarthic
Copy link
Contributor Author

We need this so bad. Hopefully it'll work with the swift package?

@itsthisjustin Yes, of course. The fix is done in the core whisper.cpp file. So any language binding using this version/branch will have the issue fixed.

@mrfragger
Copy link

gonna test this..here is 1.7.2

[00:01:07.360 --> 00:01:07.820] Father, for all-in-sacrifice, for all-in-sacrifice, for all-in-sacrifice, for all-in-sacrifice,
[00:01:07.820 --> 00:01:08.360] for all-in-sacrifice. Yes, hold on, hold on.
[00:01:08.360 --> 00:01:12.360] Father, for all-in-sacrifice, for all-in-sacrifice, for all-in-sacrifice. Yes, hold on.
[00:01:12.360 --> 00:01:14.360] D.C. now, see what's going on.
[00:01:14.360 --> 00:01:14.360] Hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey,
[00:01:14.360 --> 00:01:14.360] hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey,
[00:01:14.360 --> 00:01:14.360] hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey,
[00:01:14.360 --> 00:01:14.360] hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey,
[00:01:14.360 --> 00:01:14.360] hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey,
[00:01:14.360 --> 00:01:14.360] hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey,
[00:01:14.360 --> 00:01:14.360] hey
[00:01:14.360 --> 00:01:15.360] D.C. now, see what's going on.
[00:01:15.360 --> 00:01:16.360] D.C. now, see what's going on.
[00:01:16.360 --> 00:01:17.360] D.C. now, see what's going on.
[00:01:17.360 --> 00:01:18.360] D.C. now, see what's going on.
[00:01:18.360 --> 00:01:19.360] D.C. now, see what's going on.
[00:01:19.360 --> 00:01:20.360] D.C. now, see what's going on.
[00:01:20.360 --> 00:01:21.360] D.C. now, see what's going on.
[00:01:21.360 --> 00:01:22.360] D.C. now, see what's going on.
[00:01:22.360 --> 00:01:23.360] D.C. now, see what's going on.
[00:01:23.360 --> 00:01:24.360] D.C. now, see what's going on.
[00:01:24.360 --> 00:01:25.360] D.C. now, see what's going on.
[00:01:25.360 --> 00:01:26.360] D.C. now, see what's going on.
[00:01:26.360 --> 00:01:27.360] D.C. now, see what's going on.
[00:01:27.360 --> 00:01:28.360] D.C. now, see what's going on.
[00:01:28.360 --> 00:01:29.360] D.C. now, see what's going on.
[00:01:29.360 --> 00:01:30.360] D.C. now, see what's going on.
[00:01:30.360 --> 00:01:31.360] D.C. now, see what's going on.
[00:01:31.360 --> 00:01:32.360] D.C. now, see what's going on.
[00:01:32.360 --> 00:01:33.360] D.C. now, see what's going on.
[00:01:33.360 --> 00:01:34.360] D.C. now, see what's going on.
[00:01:34.360 --> 00:01:35.360] D.C. now, see what's going on.
[00:01:35.360 --> 00:01:36.360] D.C. now, see what's going on.
[00:01:36.360 --> 00:01:37.360] D.C. now, see what's going on.
[00:01:37.360 --> 00:01:38.360] D.C. now, see what's going on.
[00:01:38.360 --> 00:01:39.360] D.C. now, see what's going on.
[00:01:39.360 --> 00:01:40.360] D.C. now, see what's going on.
[00:01:40.360 --> 00:01:41.360] D.C. now, see what's going on.
[00:01:41.360 --> 00:01:42.360] D.C. now, see what's going on.
[00:01:42.360 --> 00:01:43.360] D.C. now, see what's going on.
[00:01:43.360 --> 00:01:44.360] D.C. now, see what's going on.
[00:01:44.360 --> 00:01:45.360] D.C. now, see what's going on.
[00:01:45.360 --> 00:01:46.360] D.C. now, see what's going on.
[00:01:46.360 --> 00:01:47.360] D.C. now, see what's going on.
[00:01:47.360 --> 00:01:48.360] D.C. now, see what's going on.
[00:01:48.360 --> 00:01:49.360] D.C. now, see what's going on.
[00:01:49.360 --> 00:01:50.360] D.C. now, see what's going on.
[00:01:50.360 --> 00:01:51.360] D.C. now, see what's going on.
[00:01:51.360 --> 00:01:52.360] D.C. now, see what's going on.
[00:01:52.360 --> 00:01:53.360] D.C. now, see what's going on.
[00:01:53.360 --> 00:01:54.360] D.C. now, see what's going on.
[00:01:54.360 --> 00:01:55.360] D.C. now, see what's going on.
[00:01:55.360 --> 00:01:56.360] D.C. now, see what's going on.
[00:01:56.360 --> 00:01:57.360] D.C. now, see what's going on.
[00:01:57.360 --> 00:01:58.360] D.C. now, see what's going on.
[00:01:58.360 --> 00:01:59.360] D.C. now, see what's going on.
[00:01:59.360 --> 00:02:00.360] D.C. now, see what's going on.
[00:02:00.360 --> 00:02:01.360] D.C. now, see what's going on.
[00:02:01.360 --> 00:02:02.360] D.C. now, see what's going on.
[00:02:02.360 --> 00:02:03.360] D.C. now, see what's going on.
[00:02:03.360 --> 00:02:04.360] D.C. now, see what's going on.
[00:02:04.360 --> 00:02:05.360] D.C. now, see what's going on.
[00:02:05.360 --> 00:02:06.360] D.C. now, see what's going on.
[00:02:06.360 --> 00:02:07.360] D.C. now, see what's going on.
[00:02:07.360 --> 00:02:08.360] D.C. now, see what's going on.
[00:02:08.360 --> 00:02:09.360] D.C. now, see what's going on.
[00:02:09.360 --> 00:02:10.360] D.C. now, see what's going on.
[00:02:10.360 --> 00:02:11.360] D.C. now, see what's going on.
[00:02:11.360 --> 00:02:12.360] D.C. now, see what's going on.
[00:02:12.360 --> 00:02:13.360] D.C. now, see what's going on.
[00:02:13.360 --> 00:02:14.360] D.C. now, see what's going on.
[00:02:14.360 --> 00:02:15.360] D.C. now, see what's going on.
[00:02:15.360 --> 00:02:16.360] D.C. now, see what's going on.
[00:02:16.360 --> 00:02:17.360] D.C. now, see what's going on.
[00:02:17.360 --> 00:02:18.360] D.C. now, see what's going on.
[00:02:18.360 --> 00:02:19.360] D.C. now, see what's going on.
[00:02:19.360 --> 00:02:20.360] D.C. now, see what's going on.
[00:02:20.360 --> 00:02:21.360] D.C. now, see what's going on.
[00:02:21.360 --> 00:02:22.360] D.C. now, see what's going on.
[00:02:22.360 --> 00:02:23.360] D.C. now, see what's going on.
[00:02:23.360 --> 00:02:24.360] D.C. now, see what's going on.
[00:02:24.360 --> 00:02:25.360] D.C. now, see what's going on.
[00:02:25.360 --> 00:02:26.360] D.C. now, see what's going on.

output_srt: saving output to '0155.srt'

now let's see with the patch ...downloaded the new whisper.cpp in src
make clean
make -j
and got exact same result. This is with large-v3-turbo and only large-v2_q8_0 made it not repeat. So I believe it's more about the models rather than whisper.cpp which causes repeating phrases. This audiobook I'm doing is 81 hrs and break it into 2000 audio segments to avoid long periods of hallucinations. So 100 hour audiobook I can get it to 3 min segments.

2000 ( 3m chapters ) = 6,000 minutes or 100 hours
1500 ( 4m chapters ) = 6,000 minutes or 100 hours
1200 ( 5m chapters ) = 6,000 minutes or 100 hours
1000 ( 6m chapters ) = 6,000 minutes or 100 hours
857 ( 7m chapters ) = 6,000 minutes or 100 hours
750 ( 8m chapters ) = 6,000 minutes or 100 hours
666 ( 9m chapters ) = 6,000 minutes or 100 hours
600 (10m chapters ) = 6,000 minutes or 100 hours

Duration of audiobook 294660 seconds
Duration of audiobook 81h:51m:00s

Total number of chapters: 187

Average length of chapters
1576 seconds or 00h:26m:16s

1625 chunks for ~182 secs or 00h:03m:02s splits
1650 chunks for ~180 secs or 00h:03m:00s splits
1675 chunks for ~177 secs or 00h:02m:57s splits
1700 chunks for ~174 secs or 00h:02m:54s splits
1725 chunks for ~172 secs or 00h:02m:52s splits
1750 chunks for ~169 secs or 00h:02m:49s splits
1775 chunks for ~167 secs or 00h:02m:47s splits
1800 chunks for ~165 secs or 00h:02m:45s splits
1825 chunks for ~162 secs or 00h:02m:42s splits
1850 chunks for ~160 secs or 00h:02m:40s splits
1875 chunks for ~158 secs or 00h:02m:38s splits
1900 chunks for ~156 secs or 00h:02m:36s splits
1925 chunks for ~154 secs or 00h:02m:34s splits
1950 chunks for ~152 secs or 00h:02m:32s splits
1975 chunks for ~150 secs or 00h:02m:30s splits
2000 chunks for ~148 secs or 00h:02m:28s splits

@jkarthic
Copy link
Contributor Author

jkarthic commented Dec 15, 2024

@mrfragger
The issue you are facing might be different from the one that I have fixed.
input_1734180845782.wav.zip
Please try the above wav file.
Here is the output with the 1.7.2
[00:00:00.000 --> 00:00:03.420] activity is like hey here's a picture of my fridge can you tell me what I'm
[00:00:03.420 --> 00:00:07.140] missing because I'm going grocery shopping and I really need to do
[00:00:07.140 --> 00:00:09.680] recipes.
[00:00:09.680 --> 00:00:11.720] you
[00:00:11.720 --> 00:00:13.780] you
[00:00:13.780 --> 00:00:15.820] you
[00:00:15.820 --> 00:00:25.820] [BLANK_AUDIO]

And here is the output with this fixed branch.
[00:00:00.000 --> 00:00:03.420] activity is like hey here's a picture of my fridge can you tell me what I'm
[00:00:03.420 --> 00:00:07.140] missing because I'm going grocery shopping and I really need to do
[00:00:07.140 --> 00:00:09.680] recipes.

Command line : ./main -t 1 -bs 1 -bo 1 -m ../../models/ggml-small.en.bin input_1734180845782.wav

Please note that the extra hallucinations are removed in this branch.
This PR doesn't try to fix any limitations in the whisper model. It just tries to bring the implementation on par with openai's whisper. I noticed that openai's whisper implementation doesn't have that extra hallucinations for the attached file. When I tried to find the rootcause, found this discrepancy and fixed it.

@jkarthic
Copy link
Contributor Author

jkarthic commented Dec 15, 2024

@mrfragger
If you can share the file you are testing with, I can run it with openai's whisper implementation to see if the issue is with the core whisper model or due to any minor bugs in whisper.cpp implementation.

@mrfragger
Copy link

It's a really bad audio recording of a conversation...that portion. Anyway yeah I most of the time I will eliminate all silence before compiling the audiobook to transcribe. Also if there are music intros and outros trim those if feasible. I believe your patch is addressing the silence so if that does indeed work for that it would be a huge boon. So far I'm been running your patch for the last 6 or 7 hours and no negative effects or anything unusual.

src/whisper.cpp Outdated Show resolved Hide resolved
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@ggerganov ggerganov merged commit 2f2841b into ggerganov:master Dec 17, 2024
bygreencn added a commit to bygreencn/whisper.cpp that referenced this pull request Dec 18, 2024
# By Georgi Gerganov (4) and others
# Via GitHub
* ggerganov/master:
  stream : improve consistency in README (ggerganov#2642)
  whisper : support no_speech_thold (ggerganov#2625)
  whisper : add single-timestamp logic (ggerganov#2629)
  readme : fix typo (ggerganov#2637)
  cmake : fix "amd64" processor string (ggerganov#2638)
  vulkan : fix soft_max.comp division by zero (ggerganov#2633)
  common : add cstdio header
  stream : update build instructions
  android : fix build and ci (ggerganov#2624)
  models : fix typo in download-ggml-model.sh (ggerganov#2623)
  ruby : Sync whisper.cpp and model download feature (ggerganov#2617)
  scripts : update to new build system

# Conflicts:
#	src/whisper.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants