Skip to content

Commit

Permalink
move evaluation scripts (#842)
Browse files Browse the repository at this point in the history
Co-authored-by: root <root@idc708073.jf.intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
3 people authored Sep 19, 2024
1 parent 872e93e commit f04f061
Show file tree
Hide file tree
Showing 41 changed files with 478 additions and 7,684 deletions.
51 changes: 51 additions & 0 deletions AudioQnA/benchmark/accuracy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# AudioQnA accuracy Evaluation

AudioQnA is an example that demonstrates the integration of Generative AI (GenAI) models for performing question-answering (QnA) on audio scene, which contains Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The following is the piepline for evaluating the ASR accuracy.

## Dataset

We evaluate the ASR accuracy on the test set of librispeech [dataset](https://huggingface.co/datasets/andreagasparini/librispeech_test_only), which contains 2620 records of audio and texts.

## Metrics

We evaluate the WER (Word Error Rate) metric of the ASR microservice.

## Evaluation

### Launch ASR microservice

Launch the ASR microserice with the following commands. For more details please refer to [doc](https://github.com/opea-project/GenAIComps/tree/main/comps/asr).

```bash
git clone https://github.com/opea-project/GenAIComps
cd GenAIComps
docker build -t opea/whisper:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/asr/whisper/Dockerfile .
# change the name of model by editing model_name_or_path you want to evaluate
docker run -p 7066:7066 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/whisper:latest --model_name_or_path "openai/whisper-tiny"
```

### Evaluate

Install dependencies:

```
pip install -r requirements.txt
```

Evaluate the performance with the LLM:

```py
# validate the offline model
# python offline_evaluate.py
# validate the online asr microservice accuracy
python online_evaluate.py
```

### Performance Result

Here is the tested result for your reference
|| WER |
| --- | ---- |
|whisper-large-v2| 2.87|
|whisper-large| 2.7 |
|whisper-medium| 3.45 |
35 changes: 35 additions & 0 deletions AudioQnA/benchmark/accuracy/local_eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import torch
from datasets import load_dataset
from evaluate import load
from transformers import WhisperForConditionalGeneration, WhisperProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

MODEL_NAME = "openai/whisper-large-v2"

librispeech_test_clean = load_dataset(
"andreagasparini/librispeech_test_only", "clean", split="test", trust_remote_code=True
)
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to(device)


def map_to_pred(batch):
audio = batch["audio"]
input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
batch["reference"] = processor.tokenizer._normalize(batch["text"])

with torch.no_grad():
predicted_ids = model.generate(input_features.to(device))[0]
transcription = processor.decode(predicted_ids)
batch["prediction"] = processor.tokenizer._normalize(transcription)
return batch


result = librispeech_test_clean.map(map_to_pred)

wer = load("wer")
print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
56 changes: 56 additions & 0 deletions AudioQnA/benchmark/accuracy/online_eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import base64
import json

import requests
import torch
from datasets import load_dataset
from evaluate import load
from pydub import AudioSegment
from transformers import WhisperForConditionalGeneration, WhisperProcessor

MODEL_NAME = "openai/whisper-large-v2"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)

librispeech_test_clean = load_dataset(
"andreagasparini/librispeech_test_only", "clean", split="test", trust_remote_code=True
)


def map_to_pred(batch):
batch["reference"] = processor.tokenizer._normalize(batch["text"])

file_path = batch["file"]
# process the file_path
pidx = file_path.rfind("/")
sidx = file_path.rfind(".")

file_path_prefix = file_path[: pidx + 1]
file_path_suffix = file_path[sidx:]
file_path_mid = file_path[pidx + 1 : sidx]
splits = file_path_mid.split("-")
file_path_mid = f"LibriSpeech/test-clean/{splits[0]}/{splits[1]}/{file_path_mid}"

file_path = file_path_prefix + file_path_mid + file_path_suffix

audio = AudioSegment.from_file(file_path)
audio.export("tmp.wav")
with open("tmp.wav", "rb") as f:
test_audio_base64_str = base64.b64encode(f.read()).decode("utf-8")

inputs = {"audio": test_audio_base64_str}
endpoint = "http://localhost:7066/v1/asr"
response = requests.post(url=endpoint, data=json.dumps(inputs), proxies={"http": None})

result_str = response.json()["asr_result"]

batch["prediction"] = processor.tokenizer._normalize(result_str)
return batch


result = librispeech_test_clean.map(map_to_pred)

wer = load("wer")
print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
8 changes: 8 additions & 0 deletions AudioQnA/benchmark/accuracy/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
datasets
evaluate
jiwer
librosa
pydub
soundfile
torch
transformers
Loading

0 comments on commit f04f061

Please sign in to comment.