Analogy Tools

This repository is a collection of resources for word analogy and lexical relation research.

Analogy Questions Dataset: link, huggingface dataset
Lexical Relation Dataset: link, huggingface dataset
RELATIVE embedding model:
- GoogleNews-vectors-negative300 based model. link
- wiki-news-300d-1M based model. link
- glove.840B.300d based model. link
Word embedding model:
- Largest GloVe embedding model shared by Stanford, converted to gensim format. link

Aliases of released resource by third party:

LICENSE

The LICENSE of all the resources are under CC-BY-NC-4.0. Thus, they are freely available for academic purpose or individual research, but restricted for commercial use.

Analogy Questions Dataset

We release the five different word analogy dataset in the following links:

The first file contains the dataset while second file has model prediction from PMI and some word embedding models. Each contains jsonline files for validation and test, in which each line consists of following dictionary,

{"stem": ["raphael", "painter"],
 "answer": 2,
 "choice": [["andersen", "plato"],
            ["reading", "berkshire"],
            ["marx", "philosopher"],
            ["tolstoi", "edison"]]}

where stem is the query word pair, choice has word pair candidates, and answer indicates the index of correct candidate which starts from 0. Data statistics are summarized as below.

Dataset	Size (valid/test)	Num of choice	Num of relation group	Original Reference
sat	37/337	5	2	Turney (2005)
u2	24/228	5,4,3	9	EnglishForEveryone
u4	48/432	5,4,3	5	EnglishForEveryone
google	50/500	4	2	Mikolov et al., (2013)
bats	199/1799	4	3	Gladkova et al., (2016)

All data is lowercased except Google dataset. The model predictions stored in the dataset can be reproduced by following script.

python analogy_test.py

When the model suffers out-of-vocabulary error, we use PMI prediction to ensure the baseline can be compared with other methods to cover all the data points.

Please read our paper for more information about the dataset and cite it if you use the dataset:

@inproceedings{ushio-etal-2021-bert,
    title = "{BERT} is to {NLP} what {A}lex{N}et is to {CV}: Can Pre-Trained Language Models Identify Analogies?",
    author = "Ushio, Asahi  and
      Espinosa Anke, Luis  and
      Schockaert, Steven  and
      Camacho-Collados, Jose",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.280",
    doi = "10.18653/v1/2021.acl-long.280",
    pages = "3609--3624",
    abstract = "Analogies play a central role in human commonsense reasoning. The ability to recognize analogies such as {``}eye is to seeing what ear is to hearing{''}, sometimes referred to as analogical proportions, shape how we structure knowledge and understand language. Surprisingly, however, the task of identifying such analogies has not yet received much attention in the language model era. In this paper, we analyze the capabilities of transformer-based language models on this unsupervised task, using benchmarks obtained from educational settings, as well as more commonly used datasets. We find that off-the-shelf language models can identify analogies to a certain extent, but struggle with abstract and complex relations, and results are highly sensitive to model architecture and hyperparameters. Overall the best results were obtained with GPT-2 and RoBERTa, while configurations using BERT were not able to outperform word embedding models. Our results raise important questions for future work about how, and to what extent, pre-trained language models capture knowledge about abstract semantic relations.",
}

Lexical Relation Dataset

Five different datasets for lexical relation classification used in SphereRE. This contains BLESS, CogALexV, EVALution, K&H+N, ROOT09 and each of them has test.tsv and train.tsv. Each tsv file consists of lines which describe the relation type given word A and B.

A   B   relation_type

For more detailed discussion, please take a look the SphereRE paper.

To get word embedding baseline,

python lexical_relation.py

When the model suffers out-of-vocabulary error in evaluation, we use the most frequent label in training data, to ensure the baseline can be compared with other methods to cover all the data points.

RELATIVE Embedding

RELATIVE embedding models extract relation embedding from the anchor word embedding model by aggregating coocurring word in between the word pairs given a large corpus. We present three models each corresponds to major pretrained public word embedding model, GoogleNews-vectors-negative300, wiki-news-300d-1M, and glove.840B.300d. The binary files are supported by gensim:

In [1] from gensim.models import KeyedVectors
In [2] relative_model = KeyedVectors.load_word2vec_format('relative_init.glove.bin', binary=True)
In [3] relative_model['paris__france']
Out[4] 
array([-1.16878878e-02, ... 7.91083463e-03], dtype=float32)  # 300 dim array

Note that words are joined by __ and all the vocabulary is uncased. Multiple token should be combined by _ such as new_york__tokyo for the relation across New York and Tokyo.

To reproduce relative model, run the following code.

python calculate_relative_embedding.py

Please refer the official implementation and the paper for further information about RELATIVE embedding.

Name		Name	Last commit message	Last commit date
Latest commit History 267 Commits
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analogy_test.py		analogy_test.py
calculate_relative_embedding.py		calculate_relative_embedding.py
google_word_analogy.py		google_word_analogy.py
lexical_relation.py		lexical_relation.py
relation_mapping_problem.py		relation_mapping_problem.py
requirement.txt		requirement.txt
stopwords_en.txt		stopwords_en.txt
util.py		util.py
visualize_conceptnet.py		visualize_conceptnet.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analogy Tools

LICENSE

Analogy Questions Dataset

Lexical Relation Dataset

RELATIVE Embedding

About

Releases 1

Packages

Languages

License

asahi417/AnalogyTools

Folders and files

Latest commit

History

Repository files navigation

Analogy Tools

LICENSE

Analogy Questions Dataset

Lexical Relation Dataset

RELATIVE Embedding

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages