Skip to content

Forced alignment of Nda‘ Nda’ a Cameroonian language

License

Notifications You must be signed in to change notification settings

mende237/Nda-Nda-Force-Aligner

Repository files navigation

Acoustic Model Training and Evaluation

This project provides scripts for training and evaluating acoustic models using monophones and triphones with various configurations. The scripts handle data preparation, feature extraction, model training, alignment, and evaluation. For data on the language Nda' Nda'.

Table of Contents

Project Structure

configs/
    └── Config.json
evaluation/
    ├── evaluation.sh
    └── make_graph.sh
feature_extractions/
    └── features_extractions.sh
logs/
    └── error.log
main.sh
README.md
requirements.sh
requirements.txt
training/
    ├── acoustic_model/
    └── language_model/
Utils/
    ├── data_preparation/
    ├── init_kaldi_projet_structure/
    ├── phone_statistics/
    │   └── phone_statistics.sh
    ├── utils.sh
    ├── view_ark_file.py
    └── word_count_occurence.py

Requirements

  • Kaldi (wjs scripts version s5)
  • Python 3.x
  • NumPy
  • Matplotlib
  • etc
  • Check the requirements.txt and requirements.sh for more details

Installation

  1. Ensure Kaldi is installed and properly configured. Follow the Kaldi installation guide for detailed instructions.

  2. Clone the repository:

    git clone https://github.com/mende237/Nda-Nda-Force-Aligner.git
    cd Nda-Nda-Force-Aligner
  3. Install the required Python packages to your Python virtual environment:

    pip install -r requirements.txt
  4. shell dependencies
    Install jq:

    sudo apt-get install jq

    or ensure that the requirements.sh file has execution rights and run:

    chmod +x requirements.sh
    ./requirements.sh

Configuration

Configuring Config.json

The Config.json file contains paths to your Kaldi installation and Python virtual environment. Here is an example configuration:

{
    "kaldi_installation_path" : "/home/dimitri/kaldi",
    "python_virtual_environement_path" : "/home/dimitri/NdaNdaForceAligner/myvenv"
}

Setting Environment Variables

To set the Kaldi installation path as an environment variable, add the following line to your ~/.bashrc file, replacing the path with your own Kaldi installation path:

echo 'export KALDI_INSTALLATION_PATH="/home/dimitri/kaldi"' >> ~/.bashrc
source ~/.bashrc

Usage

  1. Prepare your data by placing it in the appropriate directories. You can get the data by using this link. Inside this drive folder, you will find two folders: mono and stereo. The mono folder contains audio data with one channel, and the stereo folder contains the same data with two channels. It is recommended to use the mono data.

  2. Run the initialize.sh script located in the init_kaldi_project_structure/ folder. Change the directory to init_kaldi_project_structure/ and run the command below, where <project_name> is the name of the folder that will contain all the data subfolders which are necessary to extract features, train and test an acoustic model in Kaldi format:

    ./initialize.sh <project_name>

    Kaldi Folder Structure

  3. Configure the main.sh script with your project-specific settings. For the beginning, you must only configure the project folder path and data root folder path. Inside this file, there is one variable named project_name. Set this variable with your <project_name>. Check also the variable named data_root and set it to the data root you downloaded using this link. Example:

data_root="/home/dimitri/Documents/memoire/data/mono"
project_name="test_MFCC_pitch_tone_1_2"

NB: The path to data is the one that directly leads to the folder containing the file named utterance.txt.

  1. Run the main script:
    ./main.sh

This command will automatically perform all tasks from data preparation to evaluation, including feature extraction and model training (acoustic and language models). Four acoustic models will be trained:

  • Monophone HMM
  • Triphone HMM
  • Triphone HMM + SAT
  • Hybrid HMM-DNN

The training results will be stored respectively in the following files:

  • YOUR_KALDI_INSTALLATION_PATH/egs/PROJECT_NAME/exp/train_mono_50_per_spk/decode/scoring_kaldi/best_wer
  • YOUR_KALDI_INSTALLATION_PATH/egs/PROJECT_NAME/exp/train_tri_delta_delta_50_per_spk/decode/scoring_kaldi/best_wer
  • YOUR_KALDI_INSTALLATION_PATH/egs/PROJECT_NAME/exp/train_tri_sat_50_per_spk/decode/scoring_kaldi/best_wer
  • YOUR_KALDI_INSTALLATION_PATH/egs/PROJECT_NAME/exp/tri4_nnet/decode/scoring_kaldi/best_wer

The evaluation metric is WER (Word Error Rate).

Nda' Nda'

The Nda' Nda' language is spoken in the Western region of Cameroon, spread across four departments: Ndé, with the villages of Bangoua, Bamena, Balengou, Bazou, and Batchingou; Hauts-Plateaux, with the villages of Bangou, Batoufam, and Bandrefam; Koung-Khi, with the villages of Bangang-Fongang and Bangang-Fondji; and Haut-Nkam, with the village of Batcha. In 1990, the number of speakers was estimated to be 10,000 source. It is a tonal language composed of four tones: "high" (ˊ), "low" (ˋ), "low-high" (ˇ), and "high-low" (ˆ). For example, kwé translates to "eaten," mbɛ̀ translates to "meat," and kúndyə̂ translates to "bed."

Repartition of Nda' Nda' Language

Experiments

Feature Extractions

In speech processing, the choice of features is crucial for the performance of speech recognition systems. While several features can be considered, there are reasons for favoring certain features over others, such as MFCCs, pitch, and delta and delta-delta parameters with CMVN applied. CMVN, a combination of CMN and CVN, normalizes recording environment variations like volume and noise changes, enhancing the robustness of speech recognition models.

  • MFCC: These coefficients are widely used in speech processing due to their ability to compactly represent spectral information and simulate human sound perception. They effectively capture timbre and tone variations, making them suitable for ASR systems.
  • Pitch: As shown by Ghahremani et al. (2014), pitch features can be useful for ASR systems, especially for tonal languages like Vietnamese and Cantonese. Ignoring this feature could result in the loss of crucial information, particularly in linguistic contexts where tone changes word meanings.
  • Delta and Delta-Delta: In addition to spectral coefficients, first-order (delta) and second-order (delta-delta) regression coefficients are added to heuristically compensate for the conditional independence assumption made by HMM-based acoustic models.

The dimension of the acoustic coefficient vector extracted from recording frames is 13. When combining MFCCs with pitch features, the number of MFCC coefficients was reduced to 10 to maintain a dimension of 13 with the addition of three pitch-related coefficients. This dimension was empirically determined during experiments, as performance was poor beyond 13. With the addition of delta and delta-delta derived coefficients, the vector dimension increases from 13 to 40.

Tone Integration

The integration of tone is performed during the training of the triphone model, specifically during the construction of the phonetic decision tree. Two groups of questions were integrated:

  • Similar to Luong et al. (2016), the first group focused on phones with the same tone and phones with the same base vowels. The table below shows the grouping of phonetic questions. The first column groups phones with the same base tone, while the second column groups phones with the same base vowel.

    Vowels with the same tone Same base vowel
    é ə́ í á é ú ó ɛ́ ɨ́ ɔ́ ɔ́ ɔ̀ ɔ̌
    ɔ̀ ɛ̀ ə̀ ò à á à
    î ɛ̂ ə̂ ɔ̂ ə̂ ə́ ə̀
    ɔ̌ ǔ í î
    ó ò
    ɛ́ ɛ̀ ɛ̂
    ǔ ú
  • The second group focused on sound categories that describe the articulatory and acoustic characteristics of phonemes. The sound categories used are nasals and fricatives.

Nasals Fricatives
m n ŋ f s z

Phone statistics

Monophone Repartition

Monophone repartition

Distribution of the 40 most common Triphones

Triphone repartition

Tone Repartition

Tone repartition

Results of Experiments

Model MFCC +pitch +tone
HMM-monophone 12.70 11.89 -
HMM-triphone 9.73 8.92 8.92
HMM + SAT 12.70 12.97 12.70
HMM-DNN 21.35 19.73 18.65

Table: Summary of WER results

Scripts Overview

  • main.sh: Orchestrates the entire process, including data preparation, feature extraction, model training, alignment, and evaluation.

  • Utils/data_preparation/main_data_preparation.sh: Prepares the data for training and evaluation.

  • Utils/feature_extractions/features_extractions.sh: Extracts features from the data for training and testing.

  • training/acoustic_model/monophone_training.sh: Trains monophone acoustic models.

  • training/acoustic_model/triphone_training.sh: Trains triphone acoustic models with various configurations.

  • training/acoustic_model/align.sh: Aligns the data using the trained models.

  • evaluation/evaluation.sh: Evaluates the trained models on test data.

  • evaluation/make_graph.sh: Constructs the graph for decoding.

  • Other scripts: Check other scripts inside the project for additional functionalities.

NB: The code for training the language model was taken from source

Training Configuration

The main.sh script contains several configuration options, such as:

  • project_name: Name of the project.
  • data_root: Root directory for the data.
  • nbr_job_feature_extraction: Number of jobs for feature extraction.
  • nbr_job_training: Number of jobs for training.
  • add_question: Option to add questions for tone integration.
  • add_pitch_feature: Option to add pitch features.
  • training_type: Type of training to perform (1-5).

For more details, check the main.sh file to see all configurable variables.

Key References

  • Tchoutouo Ketchassop, Anne Christelle. "L'aménagement linguistique de l'aire Nda' Nda' : mesure de l'intelligibilite des parlers à la standardisation." PhD thesis, Université de Dschang, Études Africaines et Mondialisation, 2021. Supervised by Behan Sammy Chumbow.

  • Tchagoua, Guy Merlin. "Usage des langues chez les locuteurs Nda’nda’." Sociolinguistics in African Contexts: Perspectives and Challenges, pages 73-86, 2017. Springer.

  • Luong, Hieu-Thi and Vu, Hai-Quan. "A non-expert Kaldi recipe for Vietnamese speech recognition system." In Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI/OIAF4HLT2016), pages 51-55, 2016.

  • Ghahremani, Pegah, et al. "A pitch extraction algorithm tuned for automatic speech recognition." In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2494-2498. IEEE, 2014.

  • Gales, Mark and Young, Steve. "The application of hidden Markov models in speech recognition." Foundations and Trends® in Signal Processing, vol. 1, no. 3, pages 195-304, 2008. Now Publishers, Inc.

  • Rabiner, Lawrence R and Schafer, Ronald W. "Introduction to digital speech processing." Foundations and Trends® in Signal Processing, vol. 1, no. 1-2, pages 1-194, 2007. Now Publishers, Inc.

  • Morgan, Nelson and Bourlard, Herve. "Continuous speech recognition." IEEE signal processing magazine, vol. 12, no. 3, pages 24-42, 1995. IEEE.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.