Speaker-blind speech recognition #144

juanmc2005 · 2023-04-24T09:59:34Z

Depends on #143

Adding a streaming ASR pipeline needed a big refactoring (that began with #143).
This PR continues this effort to allow a new type of pipeline that transcribes speech instead of segmenting it.
A default ASR model based on Whisper is provided, but the dependency is not mandatory.

Additional modifications were also needed to make Whisper compatible with batched inference.
Note that we do not condition Whisper on previous transcriptions here. I expected this to degrade transcription quality but I found it rather robust in my experiments with the microphone and spontaneous speech in various languages (English, Spanish and French).

The new Transcription pipeline can also use a segmentation model as a local VAD to skip non-voiced chunks. In my experiments, this worked better and faster than using Whisper's no_speech_prob.

Transcription is also compatible with diart.stream, diart.benchmark, diart.tune and diart.serve (hence diart.client too).

Still missing

README examples and possible restructuring

Changelog

TBD

…o feat/vad

…ut a bit quirky

….tune. Fix major bug in Optimizer

BlokusPokus · 2024-04-18T13:22:56Z

Is this feature considered implemented?

juanmc2005 · 2024-04-19T12:30:07Z

@BlokusPokus it seemed to work last time I tried but I didn't merge because I wanted to include a faster implementation of Whisper and I needed to clean up the code. Feel free to try it out but it's a pretty old version of the library. I need to find some time to update this PR. If you feel like it, it would be an amazing contribution!

GeorgeDeac · 2024-10-18T14:20:20Z

Yeah we definitely need a faster-whisper / WhisperLive implementation. WhisperLive also integrated VAD and I see it has some overlapping features.

ywangwxd · 2024-12-02T02:21:53Z

Yeah we definitely need a faster-whisper / WhisperLive implementation. WhisperLive also integrated VAD and I see it has some overlapping features.

Have you started to work on the faster-whisper implementation?
Is whisperX helpful to you? It uses faster-whisper.

juanmc2005 · 2024-12-05T10:54:58Z

@GeorgeDeac @ywangwxd This feature is pretty much finished. But I think it would be nice to replace the default ASR by a faster one, like the ones you mention. Feel free to work on top of this PR, as I'm not too available to work on this as I'd like.
I'd gladly merge a PR to replace the default ASR and add a couple of examples in the README. I can take care of resolving the conflicts with the main branch after the feature is good

arbuckle · 2024-12-13T17:50:36Z

I'd suggest creating some basic tutorial content before merging into main - I tried this branch out and it was a struggle to integrate both Transcription and VAD into a single pipeline - something that's necessary when using Whisper in order to avoid hallucinated words in transcripts. (I get "Thank You" a lot when brief silences are transcribed by whisper)

I also ran into sampling issues when calling a transcription pipeline via diart.client. - these weren't present when doing simple diarization, and I'm not sure whether they're related to my own inexperience with this toolchain or suggestive of an issue with the code. In any case, a robust exemplar for a chained pipeline would be a good addition to the README. I created a couple:

simple: https://gist.github.com/arbuckle/d41cb5e25ccc588f4a98b8430eca40b5
complex: https://gist.github.com/arbuckle/5163d435ba174ee3ae866e789fa03f23

The complex example works, but it feels wrong and the performance is not yet to my satisfaction. (VAD needs to be tuned)

juanmc2005 added 18 commits April 19, 2023 17:41

New feature: streaming voice activity detection. Pipeline name changes

bca2873

Merge branch 'develop' of github.com:juanmc2005/OnlineDiarization int…

5e44ad4

…o feat/vad

Update link in setup.cfg

7447061

Update code snippets in README

4985394

Add minor README modifications

540ad0a

Initial ASR implementation. Broken stuff

8cc9925

First working transcription pipeline. Using diarization is possible b…

1ae4934

…ut a bit quirky

Reduce Whisper VRAM footprint (around 400Mb). Add fp16 option

d8d7342

Change whisper input type based on fp16 parameter

2cfc35d

Implement batched inference for whisper. Re-implement decoding.

a40112c

Minor changes in transcription arguments

e8196a7

Greatly improve transcription pipeline by adding optional VAD

07dd9ae

Move pipelines to diart.pipelines. Add torchmetrics as a dependency

0bf2522

Add websocket compatibility to transcription pipeline

42fe5f7

Transcription pipeline is now fully compatible with diart.stream

49616e5

Make transcription pipeline compatible with diart.benchmark and diart…

babf49d

….tune. Fix major bug in Optimizer

Rename base pipeline and config objects

6609e3c

Merge changes from branch feat/vad

4c1aeba

juanmc2005 added bug Something isn't working feature New feature or request API Improvements to the API refactoring Internal design improvements that don't change the API labels Apr 24, 2023

juanmc2005 added this to the Version 0.8 milestone Apr 24, 2023

juanmc2005 added 6 commits April 24, 2023 12:39

New feature: streaming voice activity detection. Pipeline name changes

d19b044

Update link in setup.cfg

6caa4a4

Update code snippets in README

0993fe8

Add minor README modifications

95d4fae

Rename base pipeline and config objects

569c68f

Update branch with develop

eed864f

juanmc2005 mentioned this pull request Apr 26, 2023

Add speaker-aware transcription #147

Draft

juanmc2005 modified the milestones: Version 0.8, Version 0.9 Oct 11, 2023

juanmc2005 force-pushed the develop branch from b26e60c to 782ce49 Compare October 28, 2023 14:07

juanmc2005 removed this from the Version 0.9 milestone Nov 2, 2023

juanmc2005 marked this pull request as ready for review November 9, 2023 22:59

juanmc2005 force-pushed the develop branch from f531147 to 467997d Compare May 25, 2024 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker-blind speech recognition #144

Speaker-blind speech recognition #144

juanmc2005 commented Apr 24, 2023 •

edited

Loading

BlokusPokus commented Apr 18, 2024

juanmc2005 commented Apr 19, 2024

GeorgeDeac commented Oct 18, 2024

ywangwxd commented Dec 2, 2024

juanmc2005 commented Dec 5, 2024

arbuckle commented Dec 13, 2024

Speaker-blind speech recognition #144

Are you sure you want to change the base?

Speaker-blind speech recognition #144

Conversation

juanmc2005 commented Apr 24, 2023 • edited Loading

Still missing

Changelog

BlokusPokus commented Apr 18, 2024

juanmc2005 commented Apr 19, 2024

GeorgeDeac commented Oct 18, 2024

ywangwxd commented Dec 2, 2024

juanmc2005 commented Dec 5, 2024

arbuckle commented Dec 13, 2024

juanmc2005 commented Apr 24, 2023 •

edited

Loading