A Python-based tool for generating a corpus of pseudo-English sentences with experimenter-controlled statistics.
This repository generates a corpus for training and evaluating distributional semantic models on a task that requires inferring a missing adjunct (e.g. instrument, location).
Install the repository into your virtual environment using, e.g. pip
.
To use the corpus in your project, see load_corpus.py
.
Then,
for sentence in corpus.get_sentences():
print(sentence)
Note: Use a different random seed to produce corpora that differ slightly in their random uniform distributions over agents and themes. The purpose of seeds is to enable statistical hypothesis testing.
Developed on Ubuntu 18.04 and Python 3.7