-
Notifications
You must be signed in to change notification settings - Fork 732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text output looping/repeating (until end) #26
Comments
A few tests with the workaround mentioned are done now, it looks like we still get repeated text output but it seems to be a lot better than before, e.g. the issue does not occur for a full hour but only for 1 minute duration, then it catches up with the content. |
I'm gonna leave this here because it's relevant but I don't know how this fix would translate into this Windows port, as I don't see any way to edit the parameters. |
For my use case (transcribing from microphone with c# NuGet package) setting the Relevant Posts that mention this solution: |
thanks @VRCWizard that opened a lot of insights for me. I believe setting the max-content (-mc) option to 0 is the equivalent of the mentioned condition_on_previous_text of the python implementation which they propose as a solution (the issue @GrahamboJangles mentioned above). Also i played with adopting this into my workaround, e.g. i leave the -mc at default in order to get improved detection quality but when i detect repeated output, i clear the history instead of seeking:
This also seems to work and it feels much better than seeking. EDIT: the following sentences are superseeded by new info, which i posted later. Also, if i see correctly, the default setting for this is really high: ContextImpl.misc.cpp: I don't know what exactly that is, tokens or characters or something? Anyway, this explains why the issue does not happen for me when feeding starting from -1 minute from the spot of interest but it does happen when feeding -2 minutes. Just for reference, here is the original description of the python implementation on the subject:
This sounds like the stuff is only helpful within one sentence, not within the past 10 minutes... |
It seems like there's an attempt to fix this repetition issue in the official OpenAI-whisper: openai/whisper#1052 |
Just adding some relevant informations. Just found that openai says " The model will only consider the final 224 tokens of the prompt and ignore anything earlier." On another topic, yet i cannot 100% confirm that clearing the prompt past always helps when the output starts repeating, i hope i'll be able to confirm that soon. |
As most readers here might not be able to add the changes i mentioned above in the code for testing, i thought it might be a good idea to share what i am working with. Anyone who feels adventurous, please try this main.exe, i use it in production. It helps additionally when you set -mc 223 as i mention above, it seems to be what openai does too. Also, as mentioned above, feeding 48khz wav files instead of 16khz (only the original whispercpp is limited to 16khz) seems to help too. Whisper_last_text_repeated_workaround.zip Source code changes: |
-mc 223 or 224? |
tried it out and it does catch repeats and refreshes, definitely a good fix. was able to do a 3+ hour video without it getting perma stuck! Also helps repeating in live transcribing aswell |
Yeah, I also tested it, and it caught a bunch of repeats and restarted without issues. I'm not sure how much different it is than just running Either way it's a huge improvement over the base version as it stands, so thank you for making it. |
Honestly very good question. I wrote 223 on purpose because in my last test it made a difference to 224 so i thought we must start counting at 0 so 223 is actually 224 (pretty sure i'm wrong here), but i also wrote above that i belive to have found out that it will be 224 as maxium anyway, no matter if you put a higher number. Also, there is the question as albino1 says, it is not yet 100% clear if -mc > 0 makes any difference at all, especially for stuff that cannot easily be put into context like a movie (compared to news or some discussion, in movies anything can be spoken anytime, so whats kind of context could help to get it right)... |
Are you still getting runFullImpl: failed to generate timestamp token - skipping one second? |
yup, i didnt really look into that part yet, its a minor issue for me |
How you transcribe again the parts that skipped when this occur runFullImpl: failed to generate timestamp token - skipping one second? or when it catch repeats |
i don't, i'm happy to get 99% output of a 3 hour clip :D In my case, a human corrects the stuff afterwards anyway, the timecodes and lots of words are garbish in any case. |
Just adding another puzzle piece from the whisper.cpp project but this is only for the last 500 samples, not very helpful i fear (except for streaming mode maybe?)
|
do you update in source code? |
Update: the patch above works but we can do much better. Let's just hope we don't get into a repeating "empty" text output loop because this one would not be fixed anymore.
and @eagleftw023 sorry for not replying to your request but id wouldnt really make sense to publish a custom build with the lines i posted above i guess because it only helps in the last x milliseconds. (hmmm maybe it helps in live mode but not sure about that) |
Thanks alot,can you share the new patch with exe? |
Nah sorry, i just realize this code was totally garbage, it needs to remove much more from the history. I am currently testing a more sophisticated method and post here when it works. From my current understanding, the whole problem is coming from whisper model for some reason gives back the same text twice. This is not a problem as such but our program collects all the output and feeds it as prompt for the next text generation. Basically it is no problem if the last text is used as "prompt" for the model but in case the same words/sentences are used in the prompt, we kind of force the model to output only this. So as far as i understand is very similar like this: Anyway, i'll update here as soon as i have a working update. |
I gave up on Const-me and base whisper.cpp when I found this whisper fork which seems to work with minimal repetition issues: https://github.com/Purfview/whisper-standalone-win/ Maybe it's not as accurate, I don't know, but it's CUDA accelerated, so it's still very fast, and it works without having to spend ages cleaning it up afterward because of hallucinations. |
Can we directly run the exe file to achieve the purpose of modifying this parameter |
@zlbbme sorry not sure if i understand the question. The .zip file download i provided contains my modification, there is no parameter to turn this off or on... |
OK so here my youngest attempt. Download Whisper Desktop and main.exe for users: Instead of just detecting repeated text, i now prevent repeated stuff to be put into prompt always and IF something repeated AND it is already contained in the prompt, i remove the sentence from prompt. From my current experiments, it should be working. The benefits over the strategy above is that we delete from prompt history much less frequent and only when really needed. Developer note: also i prevent pushing any non text tokens into prompt, not sure if that leads to the model not outputting any timecode tokens at all anymore. The reasons why i post this kind of unfinished work is that it looks like i will not work a lot with const-me version for a longer time and instead experiment with faster-whisper (which is more for developers, if you want a GUI for all whisper versions, you can e.g use subtitleedit). However, it currently looks like i will need to apply the exact same mitigation concepts to any other version of whisper because it looks still like all versions except mine above underly the repeat-forever issue. The reason might be that all projects mostly wait for the MAIN whisper project so come up with solutions. I am sorry for not taking care about the Desktop Version users so far, i do not use Desktop programs for transcibe but the changes i made are valid for the CLI and for Desktop version so it is no additional effort for me to provide the desktop version as well. Actually they share the same whisper.dll anyways... The patch for developers: Last but not least, i am sure this kind of troubles will settle over time. More and more puzzle pieces are coming in from month to month, in forms of code commits in the original whisper project but lots of research work is still to be done. I am sure over time, they will have found adequate solutions to prevent forever repeated stuff in a much better way than i do currently. If there are is no more activity in this thread, i will close it in a few weeks because it is not relevant for me anymore. There are enough other issues about the same topic open and i don't think it is really helpful to keep this issue with my beginners experiences open forever. |
@Const-me Is this safe to use, emcodem's duplicated-tokens-from-prompt.zip? Just being cautious because I'm a noob. |
@softlypink if you are asking for "security/virus" related reasons, there is no way for anyone to tell if any compiled exe file is safe to use. Antivius software attempts to do this but from my perspective with only very limited success. What A/V Software can do for you is to detect "known" virusses/behaviour but in this case you face an exe that maybe 10-20 people are using so in this case even your Antivirus might have a hard life. If you want to be sure about security, there is no way around reading (and understanding) the whole source code of the original program, then apply the changes i posted above and compile your own version. It is only about you to decide what sources for compiled stuff is ok to use and what not (some people use Linux for that exact reason). |
Closed because solved for me in the code i posted above. Usually when we face the forever repeat issue, e.g. output is In this case, we also have a prompt past like "This This This", so the "model" thinks that "This" is very important for us because we have it hundred times in prompt past. Obviously the confidence of the prompted tokens is much higher than any "really detected" tokens confidence so we are getting into a loop. But again, the inference decided to throw "This" 100 times into prompt past, the model is not guilty for that. |
May I ask about the function of main.exe? |
Also mentioned here:
#23
I ran a few tests on longer video clips (e.g. 2 hours) and mostly it tends to repeat the sentence from a certain point until the end. E.g. after one hour, you see repeated output forever. The timestamps seem to indicate that there was a new text detected but the text content is the same as before.
https://1drv.ms/u/s!AkS-A9Jqq09FgzEX78lvh7SiMAYu?e=C6f8GW
In this example, after about 2 minutes, the sentence repeats: "Jagt mich mal mit frahmen nudelholz".
The exact command that i use:
C:\dev\whisper\Whisper\x64\Release\main.exe -f C:\temp\test.wav -l de -m C:\temp\whisper\ggml-large.bin
I did different tests and cut portions of some affected file and it turns out that it is not caused by the audio content itself because it the affected area will translate just fine if i e.g. cut 1 minute before it but if i leave 2 minutes before, it happens.
In ContextImpl.cpp, i tried to catch "repeated text" by just copying the latest "text" to the heap as soon as it is complete (about line 740) and before that, compare if the text is the same as last time. If yes, seek a little.
lasttext = new std::string(text);
This seems to "workaround the issue" (still needs lot of testing) but i am really not sure if this is the correct way to do it.
Also, a question. Is it correct to do such workarounds there (there is another workaround a few lines above) or should the cause be searched and fixed somewhere else? (where)
The text was updated successfully, but these errors were encountered: