SpeechLLM

SpeechLLM is a multi-modal Language Model (LLM) specifically trained to analyze and predict metadata from a speaker's turn in a conversation. This advanced model integrates a speech encoder to transform speech signals into meaningful speech representations. These embeddings, combined with text instructions, are then processed by the LLM to generate predictions.

The model inputs an speech audio file of 16 KHz and predicts the following:

SpeechActivity : if the audio signal contains speech (True/False)
Transcript : ASR transcript of the audio
Gender of the speaker (Female/Male)
Age of the speaker (Young/Middle-Age/Senior)
Accent of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
Emotion of the speaker (Happy/Sad/Anger/Neutral/Frustrated)

Usage

# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-2B", trust_remote_code=True)

model.generate_meta(
    audio_path="path-to-audio.wav", #16k Hz, mono
    audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
    instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
    max_new_tokens=500, 
    return_special_tokens=False
)

# Model Generation
'''
{
  "SpeechActivity" : "True",
  "Transcript": "Yes, I got it. I'll make the payment now.",
  "Gender": "Female",
  "Emotion": "Neutral",
  "Age": "Young",
  "Accent" : "America",
}
'''

Try the model in Google Colab Notebook. Also, check out our blog on SpeechLLM for end-to-end conversational agents(User Speech -> Response).

Model Weights

We released the speechllm-2B and speechllm-1.5B model checkpoints on huggingface 🤗.

Model	Speech Encoder	LLM	checkpoint url
speechllm-2B	facebook/hubert-xlarge-ll60k	TinyLlama/TinyLlama-1.1B-Chat-v1.0	Huggingface
speechllm-1.5B	microsoft/wavlm-large	TinyLlama/TinyLlama-1.1B-Chat-v1.0	Huggingface

Latest Checkpoint Result

speechllm-2B

Dataset	Type	Word Error Rate	Gender Acc	Age Acc	Accent Acc
librispeech-test-clean	Read Speech	6.73	0.9496
librispeech-test-other	Read Speech	9.13	0.9217
CommonVoice test	Diverse Accent, Age	25.66	0.8680	0.6041	0.6959

speechllm-1.5B

Dataset	Type	Word Error Rate	Gender Acc	Age Acc	Accent Acc
librispeech-test-clean	Read Speech	11.51	0.9594
librispeech-test-other	Read Speech	16.68	0.9297
CommonVoice test	Diverse Accent, Age	26.02	0.9476	0.6498	0.8121

Training

Dataset Preparation and Installation

Install the necessary packages in the requirements.txt and take care of CUDA versions. Then prepare the audio dataset similar to data_samples/train.csv and data_samples/dev.csv, if new tasks eg: (noise, environment class) has to be added, then update the dataset.py accordingly.

pip install requirements.txt

Train

update the config in train.py, such as audio_encoder_name, llm_name, etc and other hyper parameters.

python train.py

Evaluation

After training, update checkpoint path and test dataset path(similar format to train/dev.csv).

python test.py

Infer model in Streamlit app

streamlit run app.py

Disclaimer

The models provided in this repository are not perfect and may produce errors in Automatic Speech Recognition (ASR), gender identification, age estimation, accent recognition, and emotion detection. Additionally, these models may exhibit biases related to gender, age, accent, and emotion. Please use with caution, especially in production environments, and be aware of potential inaccuracies and biases.

License

This project is released under the Apache 2.0 license as found in the LICENSE file. The released checkpoints, and code are intended for research purpose subject to the license of facebook/hubert-xlarge-ll60k, microsoft/wavlm-large and TinyLlama/TinyLlama-1.1B-Chat-v1.0 models.

Cite

@misc{Rajaa_SpeechLLM_Multi-Modal_LLM,
author = {Rajaa, Shangeth and Tushar, Abhinav},
title = {{SpeechLLM: Multi-Modal LLM for Speech Understanding}},
url = {https://github.com/skit-ai/SpeechLLM}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpeechLLM

Usage

Model Weights

Latest Checkpoint Result

speechllm-2B

speechllm-1.5B

Training

Dataset Preparation and Installation

Train

Evaluation

Infer model in Streamlit app

Disclaimer

License

Cite

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
assets		assets
data_samples		data_samples
huggingface		huggingface
model		model
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
app.py		app.py
dataset.py		dataset.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py
trainer.py		trainer.py

License

skit-ai/SpeechLLM

Folders and files

Latest commit

History

Repository files navigation

SpeechLLM

Usage

Model Weights

Latest Checkpoint Result

speechllm-2B

speechllm-1.5B

Training

Dataset Preparation and Installation

Train

Evaluation

Infer model in Streamlit app

Disclaimer

License

Cite

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages