Lexical Semantic Change Detection (LSCD) is a field of NLP that studies methods automating the analysis of changes in word meanings over time. In recent years, this field has seen much development in terms of models, datasets and tasks [1]. This has made it hard to keep a good overview of the field. Additionally, with the multitude of possible options for preprocessing, data cleaning, dataset versions, model parameter choice or tuning, clustering algorithms, and change measures a shared testbed with common evaluation setup is needed in order to precisely reproduce experimental results. Hence, we present a benchmark repository implementing evaluation procedures for models on most available LSCD datasets.
Main Branch Version (may include documentation fixes)
To get started, make sure that you have Python 3.10.0
. After that, clone the repository, then create a new virtual environment:
git clone https://github.com/ChangeIsKey/LSCDBenchmark.git
virtualenv venv --python="/path/to/python/3.10.0"
source venv/bin/activate
pip install -r requirements.txt --find-links https://download.pytorch.org/whl/torch_stable.html
LSCDBenchmark
heavily relies on Hydra for easily configuring experiments.
By running python main.py
, the tool will guide you towards specifying some of its required parameters. The main parameters are:
- dataset
- evaluation
- task
From the shell, Hydra will ask you to provide values for all these parameters, and will provide you with a list of options. Once you select a value for each of these parameters, you might have to input other, deeply nested required parameters. You can define a script to run your experiments if you constantly find yourself typing the same command, as these can get quite verbose.
An example, using the dataset dwug_de
, with model apd_compare_all
using BERT
as a WiC model and evaluating against graded change labels would be the
following:
python main.py \
dataset=dwug_de_210 \
dataset/split=dev \
dataset/spelling_normalization=german \
dataset/preprocessing=raw \
task=lscd_graded \
task/lscd_graded@task.model=apd_compare_all \
task/wic@task.model.wic=contextual_embedder \
task/wic/metric@task.model.wic.similarity_metric=cosine \
task.model.wic.ckpt=bert-base-german-cased \
task.model.wic.gpu=0 \
'dataset.test_on=[abbauen,abdecken,"abgebrüht"]' \
evaluation=change_graded
Here, we chose contextual_embedder
as a word-in-context model. This model
requires a ckpt
parameter, which represents any model stored in Huggingface
Hub, like bert-base-cased
,
bert-base-uncased
, xlm-roberta-large
, or
dccuchile/bert-base-spanish-wwm-cased
.
contextual_embedder
can also accept a gpu
parameter. This parameter takes an
integer, and represents the ID of a certain GPU (there might be multiple on a
single machine).
If you don't want to evaluate a model, you can use tilde notation (~) to remove a certain required parameter. For example, to run the previous command without any evaluation, you can run the following:
python main.py \
dataset=dwug_de_210 \
dataset/split=dev \
dataset/spelling_normalization=german \
dataset/preprocessing=normalization \
task=lscd_graded \
task/lscd_graded@task.model=apd_compare_all \
task/wic@task.model.wic=contextual_embedder \
task/wic/metric@task.model.wic.similarity_metric=cosine \
task.model.wic.ckpt=bert-base-german-cased \
~evaluation
- you may have to adjust the CUDA version specified in requirements.txt to your local requirements for the GPU.
[1] Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. PhD thesis. University of Stuttgart.