Hierarchical Embeddings for Hypernymy Detection and Directionality
- spaCy: for parsing, version 2.0.11
- a corpus such as wikipedia corpus (plain-text)
-
Create the feature files:
python create_features.py -input corpus-file.txt -output output-file-name -pos pos_tag
in which: pos_tag is either NN (for the noun features) or VB (for the verb features)
See the config.cfg to set agruments for model.
java -jar HyperVec.jar config.cfg vector-size window-size
For example, training embeddings with 100 dimensions; window-size = 5:
java -jar HyperVec.jar config.cfg 100 5
The embeddings used in our paper can be downloaded by using the script in get-pretrainedHyperVecEmbeddings/download_embeddings.sh
. Note that the script downloads 9 files and concatenates them again to a single file (hypervec.txt.gz
). The format is the default word2vec format: first line with header information, other lines word followed by whitespace seperated vector.
Information about the embeddings: creatd using the ENCOW14A corpus (14.5bn token), 100 dimensions, sym. window of 5, 15 negative samples, 0.025 learning rate, threshhold set to 0.05. The resulting vocabulary contains about 2.7m words.
To reproduce our experiments from Table 3 use the code in the datasets_classification/
,
assuming your vector file is located in the same folder and named hypervec.txt.gz
.
java -jar eval-dir.jar hypervec.txt.gz
(Evaluate directionality on BLESS.txt
using hyperscore)
java -jar eval-bless.jar hypervec.txt.gz 2 1000
(Evaluate classification on BIBLESS.txt, AWBLESS.txt
using 2% of the training data and 1000 random iterations)
If you use the code or the created feature norms, please cite our paper (Bibtex), the paper can be found here: PDF, the poster from EMNLP can be found here: Poster