This is the code repository for our KDD'20 paper STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths.
- Python >= 3.6
- PyTorch >= 1.2
- tqdm
- Scipy
- Numpy
- transformers
Use run.sh
from model/
to run the code.
Some Key parameters:
{
"epochs": 20, // number of training epochs
"lr": 1e-3, // number of learning rate
"cudaid": 1, // id of gpu
"dropout": 0.4, // dropout rate
"hidden": 200, // number of hidden layers
"weight_decay": 5e-4, // L2 Regularization
"fp": "../data_environment_eurovoc_en_0.2", // file path
"path_len": 3, // length of mini-path
"lambda1": 0.1, // weight of loss 1 (regularization for base classifiers)
"lambda2": 0.1, // weight of loss 2 (regularization for consistency)
"taxi_feature": 1 // whether to load lexico-syntactic embeddings
"load_gcn": 1 // whether to load gnn-propogated term embeddings
}
├── model/ - models, losses, and metrics
│ ├── model_fuse.py // main modules of STEAM
│ ├── layers_path.py // neural layers of STEAM
│ ├── run_fuse.sh // script to run the code
│ ├── utils_path.py // utility functions: loading train data, test data and sample mini-paths
│ └── test_fuse.py // script for testing the model
├── data_science_wordnet_en_0.2/ - folder for science wordnet
│ ├── score_gnn.txt - scores for PGAT propogated embeddings
│ ├── LD.txt, gene_diff.txt, nfd_norm.txt, LCS.txt, Contains.txt, Suffix.txt, Ends.txt - value matrix of term pairs with 7 lexico-syntactic patterns
│ ├── paths.json - dependency path information for all possible paths
│ ├── paths_index.json - the index information for all dependency paths
│ ├── taxo_path.json - all the paths from the training set of the seed taxonomy
│ ├── taxo_node_info.json - all the term information in the seed taxonomy
├── data_environment_eurovoc_en_0.2/ - folder for environment wordnet
│ └── structure similar to above one
└── log_results/ - store results
The way to obtain your own corpus is described as follows
- For GNN-propagated embeddings:
- Use
model/bert_emb_extractor.py
to obtain the BERT Embeddings of terms. - Please follow the link of the paper TaxoExpan to generate the GNN-propagated embeddings for terms.
- Use
- For text corpus / contextual features:
- To build everything from scratch, first download corpora such as Wikipedia, UMBC, and 1 Billion Word Language Model Benchmark.
- To preprocess the corpus, generate a vocabulary file and use the scripts based on LexNET. Please refer to the instructions here. It may take several hours to finish this process.
- For Lexico-Syntactic Features:
- Use
model/gen_lexico_features.py
to generate linguistic patterns based on surface name of terms. - For term frequency patterns from TAXI, please refer to the instructions here.
- Use
- Support more tensorboard functions
- Using fixed random seed
If you find this paper useful for your research, please cite the following paper in your publication:
@inproceedings{yu2020steam,
title={STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths},
author={Yu, Yue and Li, Yinghao and Shen, Jiaming and Feng, Hao and Sun, Jimeng and Zhang, Chao},
booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
publisher = {ACM},
year={2020}
}