Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speaker-blind speech recognition #144

Open
wants to merge 24 commits into
base: develop
Choose a base branch
from
Open

Speaker-blind speech recognition #144

wants to merge 24 commits into from

Conversation

juanmc2005
Copy link
Owner

@juanmc2005 juanmc2005 commented Apr 24, 2023

Depends on #143

Adding a streaming ASR pipeline needed a big refactoring (that began with #143).
This PR continues this effort to allow a new type of pipeline that transcribes speech instead of segmenting it.
A default ASR model based on Whisper is provided, but the dependency is not mandatory.

Additional modifications were also needed to make Whisper compatible with batched inference.
Note that we do not condition Whisper on previous transcriptions here. I expected this to degrade transcription quality but I found it rather robust in my experiments with the microphone and spontaneous speech in various languages (English, Spanish and French).

The new Transcription pipeline can also use a segmentation model as a local VAD to skip non-voiced chunks. In my experiments, this worked better and faster than using Whisper's no_speech_prob.

Transcription is also compatible with diart.stream, diart.benchmark, diart.tune and diart.serve (hence diart.client too).

Still missing

  • README examples and possible restructuring

Changelog

TBD

@juanmc2005 juanmc2005 added bug Something isn't working feature New feature or request API Improvements to the API refactoring Internal design improvements that don't change the API labels Apr 24, 2023
@juanmc2005 juanmc2005 added this to the Version 0.8 milestone Apr 24, 2023
@juanmc2005 juanmc2005 modified the milestones: Version 0.8, Version 0.9 Oct 11, 2023
@juanmc2005 juanmc2005 removed this from the Version 0.9 milestone Nov 2, 2023
@juanmc2005 juanmc2005 marked this pull request as ready for review November 9, 2023 22:59
@BlokusPokus
Copy link

Is this feature considered implemented?

@juanmc2005
Copy link
Owner Author

@BlokusPokus it seemed to work last time I tried but I didn't merge because I wanted to include a faster implementation of Whisper and I needed to clean up the code. Feel free to try it out but it's a pretty old version of the library. I need to find some time to update this PR. If you feel like it, it would be an amazing contribution!

@GeorgeDeac
Copy link

Yeah we definitely need a faster-whisper / WhisperLive implementation. WhisperLive also integrated VAD and I see it has some overlapping features.

@ywangwxd
Copy link

ywangwxd commented Dec 2, 2024

Yeah we definitely need a faster-whisper / WhisperLive implementation. WhisperLive also integrated VAD and I see it has some overlapping features.

Have you started to work on the faster-whisper implementation?
Is whisperX helpful to you? It uses faster-whisper.

@juanmc2005
Copy link
Owner Author

@GeorgeDeac @ywangwxd This feature is pretty much finished. But I think it would be nice to replace the default ASR by a faster one, like the ones you mention. Feel free to work on top of this PR, as I'm not too available to work on this as I'd like.
I'd gladly merge a PR to replace the default ASR and add a couple of examples in the README. I can take care of resolving the conflicts with the main branch after the feature is good

@arbuckle
Copy link

I'd suggest creating some basic tutorial content before merging into main - I tried this branch out and it was a struggle to integrate both Transcription and VAD into a single pipeline - something that's necessary when using Whisper in order to avoid hallucinated words in transcripts. (I get "Thank You" a lot when brief silences are transcribed by whisper)

I also ran into sampling issues when calling a transcription pipeline via diart.client. - these weren't present when doing simple diarization, and I'm not sure whether they're related to my own inexperience with this toolchain or suggestive of an issue with the code. In any case, a robust exemplar for a chained pipeline would be a good addition to the README. I created a couple:

simple: https://gist.github.com/arbuckle/d41cb5e25ccc588f4a98b8430eca40b5
complex: https://gist.github.com/arbuckle/5163d435ba174ee3ae866e789fa03f23

The complex example works, but it feels wrong and the performance is not yet to my satisfaction. (VAD needs to be tuned)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Improvements to the API bug Something isn't working feature New feature or request refactoring Internal design improvements that don't change the API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants