fix: compute fbank on selected device #1529

hbredin · 2023-11-05T10:56:26Z

@asr-pub: this is a more generic attempt at solving #1522 as it uses the internal self.device that can be set by WeSpeakerPretrainedEmbeddding.to(device) (and does not force using GPU when available). Can you please try it out and confirm that this solves your issue?

@juanmc2005: side effect of this PR is that it should solved #1518. Can you please try it out and confirm?

juanmc2005 · 2023-11-05T13:07:00Z

@hbredin I confirm this solves #1518

asr-pub · 2023-11-05T13:40:22Z

@hbredin Based on my previous testing, besides moving compute_fbank to the GPU, torch.vstack also runs very slowly on the CPU.

pyannote-audio/pyannote/audio/pipelines/speaker_diarization.py

Lines 341 to 345 in 0b45103

    
           waveform_batch = torch.vstack(waveforms) 
        
           # (batch_size, 1, num_samples) torch.Tensor 
        
           mask_batch = torch.vstack(masks) 
        
           # (batch_size, num_frames) torch.Tensor

hbredin · 2023-11-05T14:12:17Z

I don't observe this behavior on Google Colab (T4)

import torch
waveforms = [torch.randn(1, 160000) for i in range(32)]

%%timeit
torch.vstack(waveforms)
# 3.63 ms ± 380 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

gpu = torch.device("cuda")

%%timeit
torch.vstack([w.to(gpu) for w in waveforms])
# 6.37 ms ± 55.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Can you please double check?

asr-pub · 2023-11-05T14:16:39Z

Sure, tomorrow when I'm at work, I'll run the program and see the results.

asr-pub · 2023-11-06T12:16:49Z

@hbredin Hello, I've tested the behavior of the torch.vstack function on the CPU. When I run the following code, I've observed that this program consumes a significant amount of CPU resources, as shown in the figure below. It occupies 128 cores. When we run multiple processes in parallel to execute torch.vstack on the CPU, the performance becomes very slow. You can run this code on your computer and then use the 'top' command to check the CPU usage.

import torch
from loguru import logger
import time

a = torch.randn((1, 1, 160000))
b = (a,)

start = time.perf_counter()
for i in range(1000000):
    torch.vstack(b)
end = time.perf_counter()

logger.info(f"takes {end-start:>.2f}")


# output
# 2023-11-06 20:05:48.691 | INFO     | __main__:<module>:13 - takes 44.06

hbredin · 2023-11-06T12:23:52Z

Thanks. But I think we are still missing a comparison with GPU, don't we?

asr-pub · 2023-11-06T12:36:22Z

Thanks. But I think we are still missing a comparison with GPU, don't we?

GPU A100

import torch
from loguru import logger
import time

a = torch.randn((1, 1, 160000))
a = a.cuda()
b = (a,)

start = time.perf_counter()
for i in range(1000000):
    torch.vstack(b)
end = time.perf_counter()

logger.info(f"takes {end-start:>.2f}")

# output
# 2023-11-06 20:34:31.094 | INFO     | __main__:<module>:14 - takes 12.19

hbredin · 2023-11-06T15:23:33Z

Can you please check what happens when reducing OMP_NUM_THREADS?
See https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#utilize-openmp

asr-pub · 2023-11-07T06:23:28Z

Can you please check what happens when reducing OMP_NUM_THREADS? See https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#utilize-openmp

export OMP_NUM_THREADS=64，It occupies 64 cores. The runtime has been reduced, which came as a pleasant surprise to me.

2023-11-07 14:20:28.237 | INFO     | __main__:<module>:13 - takes 25.43

grazder · 2023-11-07T07:35:07Z

#1523

I tried this fix and it didn't work for me (i described it in following issue), so this fix looks more workable for me

fix: compute fbank on selected device

97ea695

juanmc2005 mentioned this pull request Nov 5, 2023

Fix missing .cpu() call causing WeSpeaker embedding pipeline to crash #1518

Closed

doc: update changelog

29a8f51

hbredin merged commit 40fa67b into develop Nov 7, 2023
3 checks passed

hbredin deleted the fix/fbank_device branch November 7, 2023 08:38

hbredin mentioned this pull request Nov 7, 2023

fix: moving the operations to the GPU if GPU available #1522

Closed

grazder mentioned this pull request Nov 8, 2023

pyannote/speaker-diarization-3.0 runs slower than pyannote/speaker-diarization@2.1 m-bain/whisperX#499

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: compute fbank on selected device #1529

fix: compute fbank on selected device #1529

hbredin commented Nov 5, 2023

juanmc2005 commented Nov 5, 2023

asr-pub commented Nov 5, 2023

hbredin commented Nov 5, 2023

asr-pub commented Nov 5, 2023

asr-pub commented Nov 6, 2023

hbredin commented Nov 6, 2023

asr-pub commented Nov 6, 2023

hbredin commented Nov 6, 2023

asr-pub commented Nov 7, 2023

grazder commented Nov 7, 2023

fix: compute fbank on selected device #1529

fix: compute fbank on selected device #1529

Conversation

hbredin commented Nov 5, 2023

juanmc2005 commented Nov 5, 2023

asr-pub commented Nov 5, 2023

hbredin commented Nov 5, 2023

asr-pub commented Nov 5, 2023

asr-pub commented Nov 6, 2023

hbredin commented Nov 6, 2023

asr-pub commented Nov 6, 2023

GPU A100

hbredin commented Nov 6, 2023

asr-pub commented Nov 7, 2023

grazder commented Nov 7, 2023