Skip to content

Solution for SoICT Hackathon 2023 (Track NLU - Natural Language Understanding)

Notifications You must be signed in to change notification settings

quocanh34/soict-SLU

Repository files navigation

SoICT Hackathon 2023 Track NLU Solutions

1. Inference

* Note: Checkpoints links

Run all using .sh file

#set up requirements
chmod +x scripts/run_commands.sh
scripts/run_commands.sh
chmod +x scripts/predict.sh
scripts/predict.sh
The results will be in folder training/soict_hackathon_JointIDSF/ under file name "predictions.jsonl"

2. Training

2.1 Train ASR

More training instructions details are in README.md of this folder

cd training/ASR-Wav2vec-Finetune
chmod +x asr_train.sh
./asr_train.sh
cd ../..

2.2 Train spoken-norm

More training instructions details are in README.md of this folder

cd training/norm-tuned
chmod +x norm_train.sh
./norm_train.sh
cd ../..

2.3 Train NLU

More training instructions details are in README.md of this folder

cd training
chmod 755 -R soict_hackathon_JointIDSF
cd soict_hackathon_JointIDSF
#(important)
# before running nlu_train.sh, make sure to delete "rm -rf models", 
# and delete "rm -rf data_aug_full_0919_22" if these folders exist
!rm -rf models/
!rm -rf data_aug_full_0919_22/
chmod +x nlu_train.sh
./nlu_train.sh
cd ../..

3. Synthesis data

3.1 Installation

cd synthesis-data-for-ASR
pip install -r requirements.txt

3.2 Create data

CUDA_VISIBLE_DEVICES=0 python create_transcription_wer.py --data_links="thanhduycao/soict_train_dataset" --output_path="thanhduycao/soict_train_dataset_with_wer_validate" --token="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" --num_workers=2

CUDA_VISIBLE_DEVICES=0 python lyric-alignment/predict.py --data_links="thanhduycao/soict_train_dataset_with_wer_validate" --output_path="thanhduycao/data_for_synthesis_with_entities_align_v5_validate" --token="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" --num_workers=4

CUDA_VISIBLE_DEVICES=0 python create_entity_dataset.py --data_links="thanhduycao/data_for_synthesis_with_entities_align_v5_validate" --output_path="thanhduycao/data_for_synthesis_entities_validate" --token="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" --num_workers=1

CUDA_VISIBLE_DEVICES=0 python create_synthesis_dataset.py --data_links="thanhduycao/data_for_synthesis_with_entities_align_v5_validate" --output_path="thanhduycao/data_synthesis_validate" --token="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" --num_workers=1

About

Solution for SoICT Hackathon 2023 (Track NLU - Natural Language Understanding)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published