This the official PyTorch implementation of paper MTP-CLNN, please check the paper for more details. New intent discovery aims to uncover novel intent categories from user utterances to expand the set of supported intent classes. It is a critical task for the development and service expansion of a practical dialogue system. Our method provide new solutions to two important research questions for new intent discovery: (1) how to learn semantic utterance representations and (2) how to better cluster utterances. Particularly, we first propose a multi-task pre-training (MTP) strategy to leverage rich unlabeled data along with external labeled data for representation learning. Then, we design a new contrastive loss to exploit self-supervisory signals in unlabeled data for clustering.
We have already included the data folder in our code.
This is the environment we have testified to be working. However, other versions might be working too.
- python==3.8
- pytorch==1.10.0
- transformers==4.15.0
- faiss-gpu==1.7.2
- numpy
- pandas
- scikit-learn
The external pretraining is conducted with IntentBert.
You can also download the pretrained checkpoints from following link. And then put them into a folder pretrained_models
in root directory. The link includes following models.
- IntentBert-banking
- IntentBert-mcid
- IntentBert-stackoverflow
Please organize the file structure like following:
MTP-CLNN
├── README.md (this file)
├── data (each folder is a dataset)
├── banking
├── mcid
├── stackoverflow
└── clinc
├── pretrained_models (external pretrained models)
├── banking
├── mcid
└── stackoverflow
├── saved_models (save trained clnn models)
├── scripts (running scripts with hyper-parameters)
├── utils
├── contrastive.py (objective function)
├── memory.py (memory bank for loading neighbors)
├── neighbor_dataset.py
└── tools.py
├── clnn.py
├── mtp.py
├── init_parameters.py (hyper-parameters)
├── model.py
└── dataloader.py
Run MTP-CLNN on any dataset as following
bash scripts/clnn_${DATASET_NAME}.sh ${GPU_ID}
and please fill in the ${DATASET_NAME}
with dataset name and ${GPU_ID} with GPU_ID you want to run on.
@inproceedings{zhang-etal-2022-new,
title = "New Intent Discovery with Pre-training and Contrastive Learning",
author = "Zhang, Yuwei and
Zhang, Haode and
Zhan, Li-Ming and
Wu, Xiao-Ming and
Lam, Albert",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.21",
pages = "256--269"
}
Some of the code was adapted from:
- https://github.com/thuiar/DeepAligned-Clustering
- https://github.com/wvangansbeke/Unsupervised-Classification
- https://github.com/HobbitLong/SupContrast
Yuwei Zhang zhangyuwei.work@gmail.com