PerceptiveAgent

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction (ACL24)

To avoid overlooking human communication nuances and misinterpreting speakers' intentions, we propose PerceptiveAgent, an empathetic multi-modal dialogue system designed to discern deeper or more subtle meanings beyond the literal interpretations of words through the integration of speech modality perception. Employing LLMs as a cognitive core, PerceptiveAgent perceives acoustic information from input speech and generates empathetic responses based on speaking styles described in natural language.

Models

Speech Captioner: download it from GoogleDrive
Pretrained Synthesizer: download it from GoogleDrive
MSMA Synthesizer (Finetuned Synthesizer): download it from GoogleDrive

Getting Started

Clone this repository.

git clone https://github.com/Haoqiu-Yan/PerceptiveAgent.git
cd PerceptiveAgent

Configure environment.

Limited by package compatibility, we create two virtual environments. We recommend running on linux using a conda environment. We employ Python 3.8 and Torch 1.13.1 (cuda) in both environments.

Environment of Speech Captioner & Chatting with LLM

conda create -n cap38 python=3.8
conda activate cap38

pip install -r cap_requirement.txt

Environment of MSMA Synthesizer

conda create -n syn38 python=3.8
conda activate syn38

pip install -r syn_requirement.txt

Speech Captioner

Data Preprocessing

For the TextrolSpeeh dataset which provides caption.csv, the following command reads captions one by one and saves them as audio_name.json files to the output directory $MIXDIR. Besides, audio files would be copied to $MIXDIR as well.

DATA_ROOT=/path/datasets/textrolspeech
ALLCSV=${DATA_ROOT}/caption/random_train.csv
MIXDIR=/path/to/save
python ./captioner/dataset/process.py --dataset textrol --data_dir ${DATA_ROOT} \
    --json_path $ALLCSV \
    --saveto $MIXDIR

For the other datasets without caption.csv provided, you can create an empty caption file before processing. Take the EXPRESSO dataset as an example:

python data_preprocess/create_caption_expresso.py \
    -audio-root /path/expresso/merge_audio_48khz/ \
    --transcript-path /path/expresso/read_transcriptions.txt \
    --saveas /path/expresso/caption/random_read_all.csv

Then, change the argument --dataset to the responding dataset.

DATA_ROOT=/path/expresso
ALLCSV=/path/expresso/caption/random_read_all.csv
MIXDIR=/path/to/save
python ./captioner/dataset/process.py --dataset expresso --data_dir ${DATA_ROOT} \
    --json_path $ALLCSV \
    --saveto $MIXDIR

Training

To update llama_model in configs/capsp_train_gpt2.yaml, you can download the pretrained Vicuna weights, according to the instruction in BuboGPT.

To update ABSOLUTE_PATH_OF_bubogpt_7b in configs/capsp_infer.yaml, you can download the pretrained bubogpt_7b from the link.

Then, run the following command to train a speech captioner.

bash scripts/captioner_train.sh ${CUDA_ID} ${CUDA_NUM}

Inference

Infer by the following shell,

bash scripts/captioner_infer.sh ${CUDA_ID} ${CUDA_NUM}

Chat with LLM

To integrate captions into dialogue history, you can chat with ChatGPT by the following command.

python ./chat_llm/conversation_chat.py \
    --id-convs ./input_egs/id_convs_eg.txt \
    --audio-asr-caption ./input_egs/audio_asr_caption_eg.txt \
    --save-root /path/to/savedir > ./logs/MMDD-2200pm.log

Besides, to send dialogue history only to chatGPT, you can chat by:

python ./chat_llm/conversation_chat_text-only.py \
    --id-convs ./input_egs//id_convs_eg.txt \
    --audio-asr-caption ./input_egs/audio_asr_caption_eg.txt \
    --save-root /path/to/savedir > ./logs/MMDD-2200pm.log

MSMA Synthesizer

Data Preprocessing

The training phrase takes speech as input. To encode speech into discrete acoustic units, you can run the following command. The HuBERT model, specified by the argument DENSE_NAME in the shell expresso_hubert_gen.sh, would be downloaded automatically.
```
bash ./scripts/expresso_hubert_gen.sh
```
The inference phrase takes text as input. To transform text into discrete acoustic units, a text-to-unit (T2U) model is trained by us, which can be downloaded from GoogleDrive.

The spm model trained by us can be downloaded from GoogleDrive; Otherwise, you can train your own model by:
```
spm_train --input=$ALL_TEXT --model_prefix=spm_bpe_1k --vocab_size=1000 --character_coverage=1.0 --model_type=bpe
```
Then, run the following command to transform text into units.
```
bash ./scripts/t2u_infer.sh
```

Training

Use the EXPRESSO, LJSpeech and VCTK dataset to pretrain a vocoder conditioned on emotion and speaker labels. Note: batch_size in $CONFIG_FILE is the product of batch size and CUDA number.

CONFIG_FILE=./configs/synthesizer_pretrain_config.json
OUTPUT_DIR=/path/to/save

python -m torch.distributed.launch --nproc_per_node $GPUS --master_port=29502 \
        synthesizer/examples/pretrain/amp_train.py \
        --checkpoint_path $OUTPUT_DIR \
        --config $CONFIG_FILE \
        --training_epochs 2000 \
        --validation_interval 5000 \
        --checkpoint_interval 25000

Use the EXPRESSO dataset to finetune the above vocoder conditioned on pitch, energy and speed labels additionally. To finetune from the lastest checkpoint, you can add --from-latest-ckpt in the following command.

CONFIG_FILE=./configs/synthesizer_finetune_config.json
OUTPUT_DIR=/path/to/save

python -m torch.distributed.launch --nproc_per_node $GPUS \
        synthesizer/examples/mcond_expresso/amp_train.py \
        --checkpoint_path $OUTPUT_DIR \
        --config $CONFIG_FILE \
        --training_epochs 2000 \
        --validation_interval 5000 \
        --checkpoint_interval 25000 \
        # --from-latest-ckpt

Inference

Infer with the pretrained model.

CUDA_ID=$1
GPUS=$2

INPUT_CODE_FILE=./input_egs/syntheizer_pretrain_val.txt
ckpt=g_00400000
CHECKPOINT_FILE=/path/of/ckptdir/${ckpt}
OUTPUT_DIR=/path/to/savedir
mkdir $OUTPUT_DIR

CUDA_VISIBLE_DEVICES=$CUDA_ID python ./synthesizer/examples/pretrain/inference_example.py \
    --input_code_file $INPUT_CODE_FILE \
    --checkpoint_file $CHECKPOINT_FILE \
    --output_dir $OUTPUT_DIR \
    --num-gpu $GPUS

Infer with the finetuned model.

CUDA_ID=$1
GPUS=$2

INPUT_CODE_FILE=./input_egs/syntheizer_finetune_dev.txt
ckpt=g_00200000
CHECKPOINT_FILE=/path/of/ckptdir/${ckpt}
OUTPUT_DIR=/path/to/savedir
mkdir $OUTPUT_DIR

CUDA_VISIBLE_DEVICES=$CUDA_ID python ./synthesizer/examples/mcond_expresso/inference_example.py \
    --input_code_file $INPUT_CODE_FILE \
    --checkpoint_file $CHECKPOINT_FILE \
    --output_dir $OUTPUT_DIR \
    --num-gpu $GPUS \
    --dur-prediction

Author

Haoqiu Yan*, Yongxin Zhu*, Kai Zheng, Bing Liu, Haoyu Cao, Deqiang Jiang and Linli Xu†. (*Equal Contribution, †Corresponding Author)

How to cite

If you use the code or models from this project in your research, please cite our work as follows:

@article{yan2024talk,
  title={Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction},
  author={Yan, Haoqiu and Zhu, Yongxin and Zheng, Kai and Liu, Bing and Cao, Haoyu and Jiang, Deqiang and Xu, Linli},
  journal={arXiv preprint arXiv:2406.12707},
  year={2024}
}

License

PerceptiveAgent is distributed under the Apache License.

Acknowledgment

This repo is developed based on the following repos:

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
captioner		captioner
chat_llm		chat_llm
configs		configs
data_preprocess		data_preprocess
input_egs		input_egs
scripts		scripts
synthesizer		synthesizer
LICENSE		LICENSE
PerceptiveAgent.png		PerceptiveAgent.png
README.md		README.md
cap_requirements.txt		cap_requirements.txt
syn_requirements.txt		syn_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PerceptiveAgent

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction (ACL24)

Models

Getting Started

Clone this repository.

Configure environment.

Speech Captioner

Data Preprocessing

Training

Inference

Chat with LLM

MSMA Synthesizer

Data Preprocessing

Training

Inference

Author

How to cite

License

Acknowledgment

About

Releases

Packages

Languages

License

Haoqiu-Yan/PerceptiveAgent

Folders and files

Latest commit

History

Repository files navigation

PerceptiveAgent

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction (ACL24)

Models

Getting Started

Clone this repository.

Configure environment.

Speech Captioner

Data Preprocessing

Training

Inference

Chat with LLM

MSMA Synthesizer

Data Preprocessing

Training

Inference

Author

How to cite

License

Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages