Check our latest topic modeling toolkit TopMost !
Code for Short Text Topic Modeling with Topic Distribution Quantization and Negative Sampling Decoder
EMNLP2020 | Presentation Video
requirements:
python==3.6
tensorflow-gpu==1.13.1
scipy==1.5.2
scikit-learn==0.23.2
Note: the data in the path ./data has been preprocessed with tokenization, filtering non-Latin characters, etc before.
python utils/preprocess.py --data_path data/stackoverflow --output_dir input/stackoverflow
python run_NQTM.py --data_dir input/stackoverflow --output_dir output
topic coherence: coherence score.
topic diversity:
python utils/TU.py --data_path output/top_words_T15_K50_1th
If you want to use our code, please cite as
@inproceedings{wu2020short,
title = "Short Text Topic Modeling with Topic Distribution Quantization and Negative Sampling Decoder",
author = "Wu, Xiaobao and
Li, Chunping and
Zhu, Yan and
Miao, Yishu",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.138",
doi = "10.18653/v1/2020.emnlp-main.138",
pages = "1772--1782"
}