Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running whisper.cpp at Scale and in Parallel #1408

Closed
nishanthrs opened this issue Oct 31, 2023 · 2 comments
Closed

Running whisper.cpp at Scale and in Parallel #1408

nishanthrs opened this issue Oct 31, 2023 · 2 comments
Labels
question Further information is requested

Comments

@nishanthrs
Copy link

Just want to start off by saying this is an amazing project!

I'm trying to use this for my own needs to transcribe hundreds of audio files. I'm wondering how I can leverage this library, the parallelism features, and my machine to do this task as quickly and efficiently as possible.

I am testing out various commands out on 4 audio files (three of them are ~5-10 mins long, one is over an hour).

I first tried the regular command with 4 threads (default):
ls $processed_audios_dir | $whisper_cpp_exec_path -t 4 -m $whisper_cpp_model_path -f "$processed_audios_dir_name/{}.wav" --output-srt
This took around 6-7 mins on my Mac M1 (~300-400% CPU). Scaling up to 5 and 6 threads didn't seem to do much good on my machine. In fact, it was slower in many instances.

I then tried GNU parallel:
parallel -j+0 $whisper_cpp_exec_path -m $whisper_cpp_model_path -f {} --output-srt ::: $(ls $processed_audios_dir)
However, this took around 8 mins on my Mac M1 (~500-600% CPU).

Given this, I have a few questions:

  1. Why is GNU parallel running slower in transcribing these audio files? Is it b/c the model has to be loaded multiple times like referenced in this issue?
  2. When -t <num_threads is specified, how is the processing work divided up among the threads? It seems even with multiple input files specified for the -f flag, they're still processed sequentially.
  3. How should the command be configured to run as efficiently as possible on hundreds of audio files?
@bobqianic bobqianic added the question Further information is requested label Oct 31, 2023
@bobqianic
Copy link
Collaborator

Why is GNU parallel running slower in transcribing these audio files? Is it b/c the model has to be loaded multiple times like referenced in #22?

Since you've forked multiple processes and created numerous threads, these threads are now competing for resources with each other. The primary factors limiting your inference speed are the rate of matrix multiplication and memory bandwidth. When threads compete, the on-chip cache is flushed out more frequently, which reduces memory locality and, consequently, lowers the FLOPs during matrix multiplication. Additionally, whenever the operating system switches threads on a CPU core, the contexts of these threads have to be stored and then restored, further decreasing the processing speed.

When -t <num_threads is specified, how is the processing work divided up among the threads?

It's hard to give a straightforward answer. Each input will pass through multiple operators. I'm not sure how work divided up among the threads.

How should the command be configured to run as efficiently as possible on hundreds of audio files?

Whisper.cpp provides the capability for full GPU offloading via Metal, which should represent the fastest method for transcribing hundreds of audio files. To utilize this feature, simply compile the latest master branch on your M1 machine. Setting the -t parameter to 1 should yield the best performance.

@nishanthrs
Copy link
Author

nishanthrs commented Nov 1, 2023

Thanks for the detailed answer! The GNU parallel slowdown makes sense now.

I followed the instructions to use the CoreML model and it runs incredibly fast (~3 mins for -t 4)! Thanks for the pointer to the GPU offloading via metal.
Just had a follow-up question on your last point: how would setting the -t parameter to 1 yield the best performance? Is it b/c the new CoreML model leverages the GPU and less threads in the CPU leads to less context switching and competition for resources? It was around the same processing time as -t 4 when I ran it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants