Skip to content

[KDD 2020] This is the code repository for our KDD'20 paper STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths.

Notifications You must be signed in to change notification settings

yueyu1030/STEAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STEAM

This is the code repository for our KDD'20 paper STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths.

Requirements

  • Python >= 3.6
  • PyTorch >= 1.2
  • tqdm
  • Scipy
  • Numpy
  • transformers

Usage

Use run.sh from model/ to run the code. Some Key parameters:

{
  "epochs": 20,             // number of training epochs
  "lr": 1e-3,               // number of learning rate
  "cudaid": 1,              // id of gpu
  "dropout": 0.4,           // dropout rate
  "hidden": 200,            // number of hidden layers
  "weight_decay": 5e-4,     // L2 Regularization
  "fp": "../data_environment_eurovoc_en_0.2",     // file path
  "path_len": 3,            // length of mini-path
  "lambda1": 0.1,           // weight of loss 1 (regularization for base classifiers)
  "lambda2": 0.1,           // weight of loss 2 (regularization for consistency)
  "taxi_feature": 1         // whether to load lexico-syntactic embeddings
  "load_gcn": 1             // whether to load gnn-propogated term embeddings
}

Folder Structure

├── model/ - models, losses, and metrics
│   ├── model_fuse.py // main modules of STEAM
│   ├── layers_path.py // neural layers of STEAM
│   ├── run_fuse.sh // script to run the code
│   ├── utils_path.py // utility functions: loading train data, test data and sample mini-paths
│   └── test_fuse.py // script for testing the model
├── data_science_wordnet_en_0.2/ - folder for science wordnet
│   ├── score_gnn.txt - scores for PGAT propogated embeddings
│   ├── LD.txt, gene_diff.txt, nfd_norm.txt, LCS.txt, Contains.txt, Suffix.txt, Ends.txt  - value matrix of term pairs with 7 lexico-syntactic patterns 
│   ├── paths.json - dependency path information for all possible paths
│   ├── paths_index.json - the index information for all dependency paths
│   ├── taxo_path.json - all the paths from the training set of the seed taxonomy
│   ├── taxo_node_info.json - all the term information in the seed taxonomy
├── data_environment_eurovoc_en_0.2/ - folder for environment wordnet
│   └── structure similar to above one
└── log_results/ - store results

Processing Text Data on Your Own

The way to obtain your own corpus is described as follows

  • For GNN-propagated embeddings:
    • Use model/bert_emb_extractor.py to obtain the BERT Embeddings of terms.
    • Please follow the link of the paper TaxoExpan to generate the GNN-propagated embeddings for terms.
  • For text corpus / contextual features:
  • For Lexico-Syntactic Features:
    • Use model/gen_lexico_features.py to generate linguistic patterns based on surface name of terms.
    • For term frequency patterns from TAXI, please refer to the instructions here.

TODOs

  • Support more tensorboard functions
  • Using fixed random seed

Acknowledgements

If you find this paper useful for your research, please cite the following paper in your publication:

@inproceedings{yu2020steam,
  title={STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths},
  author={Yu, Yue and Li, Yinghao and Shen, Jiaming and Feng, Hao and Sun, Jimeng and Zhang, Chao},
  booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  publisher = {ACM},
  year={2020}
}

Releases

No releases published

Packages

No packages published