This repo implements a family of neural components for various hierarchical dialogue models described in "Observing Dialogue in Therapy: Categorizing and Forcasting Behavioral Codes" By Cao et al. in ACL 2019.
@inproceedings{cao2019observing,
author = {Cao, Jie and Tanana, Michael and Imel, Zac E.
and Poitras, Eric and Atkins, David C and Srikumar, Vivek},
title = {Observing Dialogue in Therapy: Categorizing and Forecasting Behavioral Codes},
booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
year = {2019}
}
Besides replicating the results on the psychotherapy dataset used in our paper, we also offer a guideline or building models with the SOTA neural components for conversational analysis in other domains.
- Therapist-Observer
- Table of Contents
- Part I. Usage
- Part II. Experiment Desgining
- Part VI. Usage for Other Dataset or Tasks
- Known Issues (To be moved to issues)
- Install pyenv or other python environment manager
In our case, we use pyenv and its plugin pyenv-virtualenv to set up the python environment. Please follow the detailed steps in https://github.com/pyenv/pyenv-virtualenv for details. Alternative environments management such as conda will be fine.
- Install required packages
pyenv install 2.7.12
# in our default setting, we use `pyenv activate py2.7_tf1.4` to
# activate the envivronment, please change this according to your preference.
pyenv virtualenv 2.7 py2.7_tf1.4
pyenv activate py2.7_tf1.4
pip install tensorflow-gpu==1.4.0 spacy pandas ujson h5py sklearn matplotlib
- Checkout this project.
git clone git@github.com:utahnlp/therapist-observer.git therapist-observer
tensorflow
folder is the source code directory for nerual models.
Expt
folder is a folder for experiment managing, which includes all the commands(Expt/psyc_scripts/commands), config files(Expt/psyc_scripts/configs) to launch the experiments, and store all experiment outputs. In this repo, except Expt/psyc_scirpts/commands/env.sh
contains the global variables, all model hyperparameters and reltaed configurations will be assigned in the config files in Expt/psyc_scripts/configs, each of them is corresponding to a model. For a detailed description for folders in Expt
folder, please refer to Expt README file
Preprocessing pipeline consisted of 4 sub steps:
0) Put original data into Expt/data/psyc_ro/download/data_filename
- Data Transformation (trans.sh), check the path in
trans.sh
- Dataset split and Placement (place_data.sh)
- Tokenization (tok.sh)
- Extra Preprocessing (preprocess_dataset.sh) The following command can run each of them in squeunce to fulfill the preprocessing pipeline.
# it will end after 30 minutes.
cd Expt/psyc-scripts/commands/
./pre_pipe.sh
When re-executing this, finished sub tasks will be skipped because the correponding output folder has existed. Please manually delete the corresponding folder for not skipping
For more details for preprocessing, please refer to document on README of commands
# download glove.840B.300d into $RO_DATA_DIR,
# WORD_EMB_FILE in each config files will point to the path of this downloaded file
./download_glove.sh
# download elmo weights and options file into $DATA_DIR/psyc_elmo
# ELMO_OPTION_FILE and ELMO_WEIGHT_FILE will point the downloaded elmo weights and options file
./download_elmo.sh
# prepare vocabulary and elmo for training
# generating vocabulary embedding in $VOCAB_DIR in the corresponding config file
# which can be used by any task with $CONTEXT_WINDOW = 8, here, we take our selected model on categorizing client codes as a example.
./prepare.sh ../configs/categorizing/selected/C_C.sh
# Commands ends with "gpuid" means, CUDA_VISIBLEE_DEVICE will be specified by a second GPUID argument.
# ./prepare_gpuid.sh ../configs/categorizing/selected/C_C.sh 1
The above commands will mainly for preparing the vocabulary and building elmo embeddings for every sentence and everytoken. When ELMo enabled, this command may last for 25 minutes, and around 12G GPU memory.
You only need to do the preparation again when you need to update the embeding, or you have retokenzied the data(token.sh), or you want to build vocabulary for large context window. Once $VOCAB_DIR is generated, this vocabulary can be used for other reciept by pointing $VOCAB_DIR to this vocab folder.
All the following embedding related configurations in the config file will impact the vocabulary preparation.
- WORD_EMB_FILE
By default, we use glove.840B.300d, which is default value of WORD_EMB_FILE in our config files. For using other word embedding, please change this configuration and do preparation again.
- ELMO_OPTION_FILE, ELMO_WEIGHT_FILE
By default, these two files where point the default location of the download elmo files. If using domain specific ELMo or other pretrained ELMo, make sure to change the above two variables in config file, and prepare.
- CONTEXT_WINDOW
By simply set $CONTEXT_WINDOW=16, it is recommended to re-preprepare the vocab when changing the window size. Because when genenrating sliding window dialogue segments, the words in last $CONTEXT_WINDOW utterance of a dialogue may have slight impact on word frequency.
More details about the configuration, please refer to README on configs
# all training command simply follows a single arguments
./train.sh <config_file>
# training from scratch, see `tensorflow/classes/config_reader.py` for details of each arguments in config_file
# Again, we use selected model on categoring client codes as an example, ../configs/categorizing/selected/C_C.sh
# $CONFIG_DIR will be made, train.log shows the training progress
# $CONFIG_DIR/models/ will save the models and checkpints every $STEPS_PER_CHECKPINTS batch
./train.sh ../configs/categorizing/selected/C_C.sh
# Commands ends with "gpuid" means, CUDA_VISIBLEE_DEVICE will be specified by a second GPUID argument.
./train_gpuid.sh ../configs/categorizing/selected/C_C.sh 1
Worth to mention, when training, best model with respect to different metric will be saved in $CONFIG_DIR/models/. $CONFIG_DIR is required to be set in the model config file.
model prefix = $ALGO + sub_model_prefix.
$ALGO is just a name to identify your model. see tensorflow/classes/config_reader.py
for more details.
$sub_model_prefix is relared to the metrics we used for evaluation, which follows a pattern "_A_B"
# A can be in {P, R, F1, R@K}
# B can be in {macro, weighted_macro, micro} and all MISC labels.
Hence, sub_model_prefix can be _F1_macro, that is what we used for our performance evaluation.
# for analyzing training log for Patient(client) models
python $ROOT_DIR/Expt/stats_scripts/stats_P.py train.log
# for analyzing training log for Therapist models
python $ROOT_DIR/Expt/stats_scripts/stats_T.py train.log
The whole training will last for around 20 hours on a V100 GPU. The following command will analyze the train.log and print current best performance.
# training from saved checkpoint, matched by model file name with prefix as $MODEL_PREFIX_TO_RESTORE
./train_restore.sh <config_file> sub_model_prefix
# The sub_model_prefix argument is optional, when it is not loaded, the save model with best loss will be loaded. # However, model with smallest loss may not indicate best performance. You can resume from the model with repected to best metric.
./train_restore.sh ../configs/categorizing/hlstm_8_p_semb_ru_elmo_pre1024_focal_rur_add_hs512_f1.sh _F1_macro
# For evaluating from a trained model, sub_model_prefix follows the same guide as train_restore.sh
./dev.sh <config_file> sub_model_prefix
# dev with the saved model on dev test with respect to macro F1.
./dev.sh ../configs/categorizing/selected/C_C.sh _F1_macro
# dev on test means do the same evalution on test set.
./dev_on_test.sh ../configs/categorizing/C_C.sh _F1_macro
This scripts can be manually evoked once the model to be restored is saved in the "folder". After evaluation, a dev_{model_name}.log will generated in $CONFIG_DIR/training folder, and results on dev set will show in $CONFIG_DIR/results, results on test will show in $CONFIG_DIR/results_on_test
The two tasks in our paper is distinguished by the following configurations in the config file
All selected receipts are in Expt/psyc-scripts/configs/categorizing/selected/
and Expt/psyc-scripts/configs/forecasting/selected/
.
You can follow the steps above to cook each of them. Worth to mention, if $VOCAB_DIR is already built, then please skip preprocessing and preparing steps, only training and evalution are required. If you would like to try diffrent tokenization or embedding, then redo from the corresponding steps.
# categorization task will use the last utterance(response) to be labeled
# forecasting task will not use the last utterance(response) to be labeled
# `x` just means switch on, leave it empty for swith off
USE_RESPONSE_U=x
# We always use the speaker infomation for both context and response
USE_RESPONSE_S=x
# decode_goal in ['SPEAKER','ALL_LABEL','P_LABEL','T_LABEL','SEQ_TAG']
# use T_LABEL for therapist code only
DECODE_GOAL=T_LABEL
# use P_LABEL for patient code only
DECODE_GOAL=P_LABEL
We offer the performance table on the selected models in our paper as follows. For more, description for each configuration, please refer to README for config file
For the name of selected models, last chaceracter 'C' or 'T' means client or therapist. The second last character 'C' or 'F' means categorizing task or forecasting task. The remaining part of the name is a id for distinguish differrent nerual architecture. See more details in the paper
For client, the best model does not need any word or utterance attention.
Method | macro | FN | CHANGE | SUSTAIN |
---|---|---|---|---|
Majority | 30.6 | 91.7 | 0.0 | 0.0 |
Xiao et al. (2016) | 50.0 | 87.9 | 32.8 | 29.3 |
BiGRU_generic_C | 50.2 | 87.0 | 35.2 | 28.4 |
BiGRU_ELMo_C | 52.9 | 87.6 | 39.2 | 32.0 |
Can et al. (2015) | 44.0 | 91.0 | 20.0 | 21.0 |
Tanana et al. (2016) | 48.3 | 89.0 | 29.0 | 27.0 |
CONCAT_C_C | 51.8 | 86.5 | 38.8 | 30.2 |
GMGRU_H_C_C | 52.6 | 89.5 | 37.1 | 31.1 |
BiDAF_H_C_C | 50.4 | 87.6 | 36.5 | 27.1 |
Our Best | 53.9 | 89.6 | 39.1 | 33.1 |
Change | +3.5 | -2.1 | +3.9 | +3.8 |
For the therapist, it uses GMGRUH for word attention and ANCHOR42 for utterance attention.
Method | macro | FA | RES | REC | GI | QUC | QUO | MIA | MIN |
---|---|---|---|---|---|---|---|---|---|
Majority | 5.87 | 47.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Xiao et al. (2016) | 59.3 | 94.7 | 50.2 | 48.3 | 71.9 | 68.7 | 80.1 | 54.0 | 6.5 |
BiGRU_generic_T | 60.2 | 94.5 | 50.5 | 49.3 | 72.0 | 70.7 | 80.1 | 54.0 | 10.8 |
BiGRU_ELMo_T | 62.6 | 94.5 | 51.6 | 49.4 | 70.7 | 72.1 | 80.8 | 57.2 | 24.2 |
Can et al. (2015) | - | 94.0 | 49.0 | 45.0 | 74.0 | 72.0 | 81.0 | - | - |
Tanana et al. (2016) | - | 94.0 | 48.0 | 39.0 | 69.0 | 68.0 | 77.0 | - | - |
CONCAT_C_T | 61.0 | 94.5 | 54.6 | 34.3 | 73.3 | 73.6 | 81.4 | 54.6 | 22.0 |
GMGRU_H_C_T | 64.9 | 94.9 | 56.0 | 54.4 | 75.5 | 75.7 | 83.0 | 58.2 | 21.8 |
BiDAF_H_C_T | 63.8 | 94.7 | 55.9 | 49.7 | 75.4 | 73.8 | 80.0 | 56.2 | 24.0 |
Our Best | 65.4 | 95.0 | 55.7 | 54.9 | 74.2 | 74.8 | 82.6 | 56.6 | 29.7 |
Change | +5.2 | +0.3 | +3.9 | +3.8 | +0.2 | +2.8 | +1.6 | +2.6 | +18.9 |
For both client and therapist, the best model uses no word attention, and uses SELF42 utterance attention.
Method | Dev | Dev | Test | Test | Test | Test |
---|---|---|---|---|---|---|
CHANGE | SUSTAIN | macro | FN | CHANGE | SUSTAIN | |
CONCAT_F_C | 20.4 | 30.2 | 43.6 | 84.4 | 23.0 | 23.5 |
HGRU_F_C | 19.9 | 31.2 | 44.4 | 85.7 | 24.9 | 22.5 |
GMGRU_H_F_C | 19.4 | 30.5 | 44.3 | 87.1 | 23.3 | 22.4 |
Forecast_C | 21.1 | 31.3 | 44.3 | 85.2 | 24.7 | 22.7 |
Except for R@3, all others are F1 score.
Method | R@3 | macro | FA | RES | REC | GI | QUC | QUO | MIA | MIN |
---|---|---|---|---|---|---|---|---|---|---|
CONCAT_F_T | 72.5 | 23.5 | 63.5 | 0.6 | 0.0 | 53.7 | 27.0 | 15.0 | 18.2 | 9.0 |
HGRU_generic_F_T | 76.8 | 24.0 | 71.0 | 2.7 | 20.5 | 58.8 | 27.5 | 12.9 | 15.2 | 1.6 |
HGRU_F_T | 76.0 | 28.6 | 71.4 | 12.7 | 24.9 | 58.3 | 28.8 | 5.9 | 17.4 | 9.7 |
GMGRU_H_F_T | 76.6 | 26.6 | 72.6 | 10.2 | 20.6 | 58.8 | 27.4 | 6.0 | 8.9 | 7.9 |
Forecase_T | 77.0 | 31.1 | 71.9 | 19.5 | 24.7 | 59.2 | 29.1 | 16.4 | 15.2 | 12.8 |
Preprocessing your own dataset into DSTC-like conversational json format is the main job to do before modeling.
[
{
"correct_seq_labels": [],
"options-for-correct-answers": [
{
"tokenized_utterance": "it 's just",
"codes": [
{
"origin_code": "GI",
"translated_code": "giving_info",
"coder_order": [
{
"order_id": 1,
"coder_id": "ms",
"cid": 72427
}
]
}
],
"uid": "(BAER_936)_31_5_T_49_51",
"agg_label": "giving_info",
"speaker": "T",
"snt_id": 9878
}
],
"example-id": "(BAER_936)_(T, 27, 3)-(T, 31, 51)",
"messages-so-far": [
{
"tokenized_utterance": "mm - hmm",
"codes": [
{
"origin_code": "FA",
"translated_code": "facilitate",
"coder_order": [
{
"order_id": 1,
"coder_id": "ms",
"cid": 72411
}
]
}
],
"uid": "(BAER_936)_27_9_T_3_4",
"agg_label": "facilitate",
"speaker": "T",
"snt_id": 5
},
...
],
"correct_labels": [
3
],
"pred_probs": [
{
"label_index": 2,
"label_name": "reflection_complex",
"prob": 0.2700542211532593
},
{
"label_index": 3,
"label_name": "reflection_simple",
"prob": 0.100542211532593
},
...
]
},
...
]
Our current code base is based on feeddict-based tensorflow inputs. In future, we will upgrade it with newer tensforflow feattures, such as estimator and tensorflow serving.
Our code base allows user to build converstational baseline models without writing much tensorflow code. For all supported model components, creating customized config file is the only thing to do for building a model for your dataset.
-
Domain Specific Glove
-
Domain Specific ELMo
- Known issues about spaCy with python 2.7.5
see explosion/spaCy#3734, Please use python 2.7.12. But Python 2 will be dropped in Jan 2020, we will try to test our code on python 3 and publish a new repo for python 3.