The source code used for Weakly-Supervised Hierarchical Text Classification, published in AAAI 2019.
Before running, you need to first install the required packages by typing following commands:
$ pip3 install -r requirements.txt
Also, be sure to download 'punkt'
in python:
import nltk
nltk.download('punkt')
python main.py --dataset ${dataset} --sup_source ${sup_source} --with_eval ${with_eval} --pseudo ${pseudo}
where you need to specify the dataset in ${dataset}
, the weak supervision type in ${sup_source}
(could be one of ['keywords', 'docs']
), the evaluation type in ${with_eval}
and the pseudo document generation method in ${pseudo}
(bow
uses bag-of-words method introduced in the CIKM paper. lstm
uses LSTM language model method introduced in the AAAI paper; it generates better-quality pseudo documents, but requires much longer time for training an LSTM language model).
An example run is provided in test.sh
, which can be executed by
./test.sh
More advanced settings on training and hyperparameters are commented in main.py
.
To run the algorithm, you need to provide the following files under the directory ./${dataset}
:
- A corpus (
dataset.txt
) that contains all documents to be classified. Each line indataset.txt
corresponds to one document. - Class hierarchy (
label_hier.txt
) that indicates the parent children relationships between classes (each class can have at most one parent class). The first class of each line is the parent class, followed by all its children classes. Tab is used as the delimiter. - Weak supervision sources (can be either of the following two) for each leaf class in the class hierarchy:
- Class-related keywords (
keywords.txt
). You need to provide class-related keywords for each leaf class inkeywords.txt
, where each line begins with the class name (must correspond to that inlabel_hier.txt
), followed by a tab, and then the class-related keywords separated by space. - Labeled documents (
doc_id.txt
). You need to provide labeled document ids for each leaf class indoc_id.txt
, where each line begins with the class name (must correspond to that inlabel_hier.txt
), followed by a tab, and then document ids in the corpus (starting from0
) of the corresponding class separated by space.
- (Optional) Ground truth labels to be used for evaluation (the provided labels will not be used for training). You need to set the evaluation type argument (
--with_eval ${with_eval}
) correspondingly.
- If ground truth labels are available for all documents, put all document labels in
labels.txt
where thei
th line is the class name (must correspond to that inlabel_hier.txt
) for thei
th document indataset.txt
. - If ground truth labels are available for some but not all documents, put the partial labels in
labels_sub.txt
where each line begins with the class name (must correspond to that inlabel_hier.txt
), followed by a tab, and then document ids in the corpus (starting from0
) of the corresponding class separated by space. - If no ground truth labels are available, no files are required.
Examples are given under the three dataset directories.
The final results (document labels) will be written in ./${dataset}/out.txt
, where each line is the class label id for the corresponding document.
Intermediate results (e.g. trained network weights, self-training logs) will be saved under ./results/${dataset}/${sup_source}/
.
To execute the code on a new dataset, you need to
- Create a directory named
${dataset}
. - Prepare input files (see the above Inputs section).
- Modify
main.py
to accept the new dataset; you need to add${dataset}
to argparse, and then specify parameter settings (e.g.update_interval
,pretrain_epochs
) for the new dataset.
You can always refer to the example datasets when adapting the code for a new dataset.
Please cite the following papers if you find the code helpful for your research.
@inproceedings{meng2018weakly,
title={Weakly-Supervised Neural Text Classification},
author={Meng, Yu and Shen, Jiaming and Zhang, Chao and Han, Jiawei},
booktitle={Proceedings of the 27th ACM International Conference on Information and Knowledge Management},
pages={983--992},
year={2018},
organization={ACM}
}
@inproceedings{meng2019weakly,
title={Weakly-supervised hierarchical text classification},
author={Meng, Yu and Shen, Jiaming and Zhang, Chao and Han, Jiawei},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={33},
pages={6826--6833},
year={2019}
}