Transcription for Apple Silicon.
Segmentation is performed to divide the sound source into small chunks, a sound source is created by removing silent parts for each chunk, and text is extracted.
$ git clone https://github.com/mbotsu/mlx_speech2text.git
$ pip install -r requirements.txt
// convert to wav 16K
$ ffmpeg -i input.mp4 -ar 16000 out.wav
// run
$ python speech2text.py -i out.wav -o track -v