Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction (ACL24)
To avoid overlooking human communication nuances and misinterpreting speakers' intentions, we propose PerceptiveAgent, an empathetic multi-modal dialogue system designed to discern deeper or more subtle meanings beyond the literal interpretations of words through the integration of speech modality perception. Employing LLMs as a cognitive core, PerceptiveAgent perceives acoustic information from input speech and generates empathetic responses based on speaking styles described in natural language.
- Speech Captioner: download it from GoogleDrive
- Pretrained Synthesizer: download it from GoogleDrive
- MSMA Synthesizer (Finetuned Synthesizer): download it from GoogleDrive
git clone https://github.com/Haoqiu-Yan/PerceptiveAgent.git
cd PerceptiveAgent
Limited by package compatibility, we create two virtual environments. We recommend running on linux using a conda environment. We employ Python 3.8 and Torch 1.13.1 (cuda) in both environments.
- Environment of Speech Captioner & Chatting with LLM
conda create -n cap38 python=3.8
conda activate cap38
pip install -r cap_requirement.txt
- Environment of MSMA Synthesizer
conda create -n syn38 python=3.8
conda activate syn38
pip install -r syn_requirement.txt
-
For the TextrolSpeeh dataset which provides
caption.csv
, the following command reads captions one by one and saves them asaudio_name.json
files to the output directory$MIXDIR
. Besides, audio files would be copied to$MIXDIR
as well.DATA_ROOT=/path/datasets/textrolspeech ALLCSV=${DATA_ROOT}/caption/random_train.csv MIXDIR=/path/to/save python ./captioner/dataset/process.py --dataset textrol --data_dir ${DATA_ROOT} \ --json_path $ALLCSV \ --saveto $MIXDIR
-
For the other datasets without
caption.csv
provided, you can create an empty caption file before processing. Take the EXPRESSO dataset as an example:python data_preprocess/create_caption_expresso.py \ -audio-root /path/expresso/merge_audio_48khz/ \ --transcript-path /path/expresso/read_transcriptions.txt \ --saveas /path/expresso/caption/random_read_all.csv
Then, change the argument
--dataset
to the responding dataset.DATA_ROOT=/path/expresso ALLCSV=/path/expresso/caption/random_read_all.csv MIXDIR=/path/to/save python ./captioner/dataset/process.py --dataset expresso --data_dir ${DATA_ROOT} \ --json_path $ALLCSV \ --saveto $MIXDIR
To update llama_model
in configs/capsp_train_gpt2.yaml
, you can download the pretrained Vicuna weights, according to the instruction in BuboGPT.
To update ABSOLUTE_PATH_OF_bubogpt_7b
in configs/capsp_infer.yaml
, you can download the pretrained bubogpt_7b from the link.
Then, run the following command to train a speech captioner.
bash scripts/captioner_train.sh ${CUDA_ID} ${CUDA_NUM}
Infer by the following shell,
bash scripts/captioner_infer.sh ${CUDA_ID} ${CUDA_NUM}
To integrate captions into dialogue history, you can chat with ChatGPT by the following command.
python ./chat_llm/conversation_chat.py \
--id-convs ./input_egs/id_convs_eg.txt \
--audio-asr-caption ./input_egs/audio_asr_caption_eg.txt \
--save-root /path/to/savedir > ./logs/MMDD-2200pm.log
Besides, to send dialogue history only to chatGPT, you can chat by:
python ./chat_llm/conversation_chat_text-only.py \
--id-convs ./input_egs//id_convs_eg.txt \
--audio-asr-caption ./input_egs/audio_asr_caption_eg.txt \
--save-root /path/to/savedir > ./logs/MMDD-2200pm.log
-
The training phrase takes speech as input. To encode speech into discrete acoustic units, you can run the following command. The HuBERT model, specified by the argument
DENSE_NAME
in the shellexpresso_hubert_gen.sh
, would be downloaded automatically.bash ./scripts/expresso_hubert_gen.sh
-
The inference phrase takes text as input. To transform text into discrete acoustic units, a text-to-unit (T2U) model is trained by us, which can be downloaded from GoogleDrive.
The
spm model
trained by us can be downloaded from GoogleDrive; Otherwise, you can train your own model by:spm_train --input=$ALL_TEXT --model_prefix=spm_bpe_1k --vocab_size=1000 --character_coverage=1.0 --model_type=bpe
Then, run the following command to transform text into units.
bash ./scripts/t2u_infer.sh
-
Use the EXPRESSO, LJSpeech and VCTK dataset to pretrain a vocoder conditioned on emotion and speaker labels. Note:
batch_size
in$CONFIG_FILE
is the product of batch size and CUDA number.CONFIG_FILE=./configs/synthesizer_pretrain_config.json OUTPUT_DIR=/path/to/save python -m torch.distributed.launch --nproc_per_node $GPUS --master_port=29502 \ synthesizer/examples/pretrain/amp_train.py \ --checkpoint_path $OUTPUT_DIR \ --config $CONFIG_FILE \ --training_epochs 2000 \ --validation_interval 5000 \ --checkpoint_interval 25000
-
Use the EXPRESSO dataset to finetune the above vocoder conditioned on pitch, energy and speed labels additionally. To finetune from the lastest checkpoint, you can add
--from-latest-ckpt
in the following command.CONFIG_FILE=./configs/synthesizer_finetune_config.json OUTPUT_DIR=/path/to/save python -m torch.distributed.launch --nproc_per_node $GPUS \ synthesizer/examples/mcond_expresso/amp_train.py \ --checkpoint_path $OUTPUT_DIR \ --config $CONFIG_FILE \ --training_epochs 2000 \ --validation_interval 5000 \ --checkpoint_interval 25000 \ # --from-latest-ckpt
-
Infer with the pretrained model.
CUDA_ID=$1 GPUS=$2 INPUT_CODE_FILE=./input_egs/syntheizer_pretrain_val.txt ckpt=g_00400000 CHECKPOINT_FILE=/path/of/ckptdir/${ckpt} OUTPUT_DIR=/path/to/savedir mkdir $OUTPUT_DIR CUDA_VISIBLE_DEVICES=$CUDA_ID python ./synthesizer/examples/pretrain/inference_example.py \ --input_code_file $INPUT_CODE_FILE \ --checkpoint_file $CHECKPOINT_FILE \ --output_dir $OUTPUT_DIR \ --num-gpu $GPUS
-
Infer with the finetuned model.
CUDA_ID=$1 GPUS=$2 INPUT_CODE_FILE=./input_egs/syntheizer_finetune_dev.txt ckpt=g_00200000 CHECKPOINT_FILE=/path/of/ckptdir/${ckpt} OUTPUT_DIR=/path/to/savedir mkdir $OUTPUT_DIR CUDA_VISIBLE_DEVICES=$CUDA_ID python ./synthesizer/examples/mcond_expresso/inference_example.py \ --input_code_file $INPUT_CODE_FILE \ --checkpoint_file $CHECKPOINT_FILE \ --output_dir $OUTPUT_DIR \ --num-gpu $GPUS \ --dur-prediction
Haoqiu Yan*, Yongxin Zhu*, Kai Zheng, Bing Liu, Haoyu Cao, Deqiang Jiang and Linli Xu†. (*Equal Contribution, †Corresponding Author)
If you use the code or models from this project in your research, please cite our work as follows:
@article{yan2024talk,
title={Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction},
author={Yan, Haoqiu and Zhu, Yongxin and Zheng, Kai and Liu, Bing and Cao, Haoyu and Jiang, Deqiang and Xu, Linli},
journal={arXiv preprint arXiv:2406.12707},
year={2024}
}
PerceptiveAgent is distributed under the Apache License.
This repo is developed based on the following repos: