Add speaker-aware transcription #147

juanmc2005 · 2023-04-26T12:49:10Z

Depends on #144

This PR adds a new SpeakerAwareTranscription pipeline that combines streaming diarization and streaming transcription to determine "who says what" in a live conversation. By default, this is shown as colored words in the terminal.

The feature works as expected with diart.stream and diart.serve/diart.client.
The main thing preventing full compatibility with diart.benchmark and diart.tune is the evaluation metric.
Since the output of the pipeline is annotated text with the format: [speaker0]Hello [speaker1]Hi, the metric diart.metrics.WordErrorRate will count labels as insertion errors.

Next steps: implement a SpeakerWordErrorRate that computes the (weighted?) average WER across speakers.

Changelog

TBD

…o feat/vad

…ut a bit quirky

….tune. Fix major bug in Optimizer

C0RE1312 · 2024-04-06T07:47:05Z

Hey, I am unable to use this:
(diart) :~/live-transcript$ diart.stream output.wav --pipeline SpeakerAwareTranscription
Traceback (most recent call last):
File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/rx/core/operators/map.py", line 37, in on_next
result = _mapper(value)
File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/diart/pipelines/speaker_transcription.py", line 325, in call
asr_outputs = self.asr(batch[has_voice])
File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/diart/blocks/asr.py", line 65, in call
output = self.model(wave.to(self.device))
File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/diart/models.py", line 80, in call
return super().call(*args, **kwargs)
File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/diart/models.py", line 485, in forward
batch = whisper.log_mel_spectrogram(batch)
File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/whisper/audio.py", line 148, in log_mel_spectrogram
stft = torch.stft(audio, N_FFT, HOP_LENGTH, window=window, return_complex=True)
File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/torch/functional.py", line 632, in stft
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

juanmc2005 · 2024-04-19T12:32:59Z

@C0RE1312 sounds like a problem with pytorch not being able to compute the FFT. Have you tried updating the dependenciesof both torch and whisper? it's a pretty old PR

ywangwxd · 2024-12-02T02:24:05Z

@C0RE1312 sounds like a problem with pytorch not being able to compute the FFT. Have you tried updating the dependenciesof both torch and whisper? it's a pretty old PR

Will this work with faster-whisper or any other faster version of whisper?

ywangwxd · 2024-12-02T03:04:57Z

@C0RE1312 sounds like a problem with pytorch not being able to compute the FFT. Have you tried updating the dependenciesof both torch and whisper? it's a pretty old PR

BTW, I noticed that the last commit was in the April of 2023. So this feature has no new commits for more than one year. Do this mean the feature implementation has finished but it was not merged into the main branch? I noticed in the readme page of this project, there was a note stating that this feature was comming soon but ready.

juanmc2005 · 2024-12-05T10:49:46Z

@ywangwxd unfortunately I haven't had the time to work on this as I'd like. I prioritized other things like documentation and testing for #98
This should work okay if we update the dependencies and resolve conflicts with the main branch, but I think it's better to do a full rework because it's not very efficient. To begin with, it's using OpenAI's whisper, while there are faster implementations out there.
Please feel free to improve and work on top of this branch and open your own transcription PR, I can take a look regularly and guide you if you want to contribute.

juanmc2005 added 26 commits April 19, 2023 17:41

New feature: streaming voice activity detection. Pipeline name changes

bca2873

Merge branch 'develop' of github.com:juanmc2005/OnlineDiarization int…

5e44ad4

…o feat/vad

Update link in setup.cfg

7447061

Update code snippets in README

4985394

Add minor README modifications

540ad0a

Initial ASR implementation. Broken stuff

8cc9925

First working transcription pipeline. Using diarization is possible b…

1ae4934

…ut a bit quirky

Reduce Whisper VRAM footprint (around 400Mb). Add fp16 option

d8d7342

Change whisper input type based on fp16 parameter

2cfc35d

Implement batched inference for whisper. Re-implement decoding.

a40112c

Minor changes in transcription arguments

e8196a7

Greatly improve transcription pipeline by adding optional VAD

07dd9ae

Move pipelines to diart.pipelines. Add torchmetrics as a dependency

0bf2522

Add websocket compatibility to transcription pipeline

42fe5f7

Transcription pipeline is now fully compatible with diart.stream

49616e5

Make transcription pipeline compatible with diart.benchmark and diart…

babf49d

….tune. Fix major bug in Optimizer

Rename base pipeline and config objects

6609e3c

Merge changes from branch feat/vad

4c1aeba

New feature: streaming voice activity detection. Pipeline name changes

d19b044

Update link in setup.cfg

6caa4a4

Update code snippets in README

0993fe8

Add minor README modifications

95d4fae

Rename base pipeline and config objects

569c68f

Update branch with develop

eed864f

Add initial implementation of SpeakerAwareTranscription

a16bb5c

Refactor SpeakerAwareTranscription

c7bbcc4

juanmc2005 added the feature New feature or request label Apr 26, 2023

juanmc2005 added this to the Version 0.8 milestone Apr 26, 2023

juanmc2005 modified the milestones: Version 0.8, Version 0.9 Oct 11, 2023

juanmc2005 removed this from the Version 0.9 milestone Oct 19, 2023

juanmc2005 force-pushed the develop branch from b26e60c to 782ce49 Compare October 28, 2023 14:07

jfernandrezj mentioned this pull request Jan 16, 2024

Speaker Identity Resolution #230

Closed

juanmc2005 force-pushed the develop branch from f531147 to 467997d Compare May 25, 2024 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add speaker-aware transcription #147

Add speaker-aware transcription #147

juanmc2005 commented Apr 26, 2023

C0RE1312 commented Apr 6, 2024

juanmc2005 commented Apr 19, 2024

ywangwxd commented Dec 2, 2024

ywangwxd commented Dec 2, 2024

juanmc2005 commented Dec 5, 2024

Add speaker-aware transcription #147

Are you sure you want to change the base?

Add speaker-aware transcription #147

Conversation

juanmc2005 commented Apr 26, 2023

Changelog

C0RE1312 commented Apr 6, 2024

juanmc2005 commented Apr 19, 2024

ywangwxd commented Dec 2, 2024

ywangwxd commented Dec 2, 2024

juanmc2005 commented Dec 5, 2024