This repository contains the code for our submission to the SMART 2022 Task, and builds upon the CLOCQ repository. It contains code for a post-hoc pruning module, that operates on the CLOCQ outputs and prunes noisy linkings, to improve precision on entity linking tasks.
In case of any questions, please let us know.
Please first clone and install the CLOCQ code CLOCQ repository for accessing the public CLOCQ API. Downloading any data is not required.
Clone the repo via:
git clone https://github.com/PhilippChr/CLOCQ-pruning-module.git
cd CLOCQ-pruning-module/
conda create --name clocq python=3.8
conda activate clocq
# install dependencies
git clone https://github.com/PhilippChr/CLOCQ.git
cd CLOCQ/
pip install -e .
cd ..
pip install transformers
Pytorch is required as well:
# install PyTorch without CUDA
conda install pytorch torchvision torchaudio -c pytorch
# install PyTorch for CUDA 10.2 (using GPU)
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
# install PyTorch for CUDA 11.3 (using GPU)
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
First download the SMART data and put it in ./data
.
You can use the following script, that also downloads the trained model and initializes required folders.
bash initialize.sh
Next, split the original train set in a train and dev split:
python prepare_smart_data.py
Then, run original CLOCQ code on the data to identify potential linkings. The parameters can be adjusted in the config.
python run_clocq.py <PATH_TO_INPUT> <PATH_TO_OUTPUT> [<PATH_TO_CONFIG>]
For example:
python run_clocq.py data/SMART2022-EL-wikidata-train-split.json results/SMART2022-EL-wikidata-train-split-clocq.json
python run_clocq.py data/SMART2022-EL-wikidata-dev-split.json results/SMART2022-EL-wikidata-dev-split-clocq.json
python run_clocq.py data/SMART2022-EL-wikidata-test.json results/SMART2022-EL-wikidata-test-split-clocq.json
Then, train the pruning module:
python pruning_module.py --train <PATH_TO_TRAIN> <PATH_TO_DEV> [<PATH_TO_CONFIG>]
For example:
python pruning_module.py --train results/SMART2022-EL-wikidata-train-split-clocq.json results/SMART2022-EL-wikidata-dev-split-clocq.json
Finally, run the pruning module via:
python pruning_module.py --inference <PATH_TO_INPUT> <PATH_TO_OUTPUT> [<PATH_TO_CONFIG>]
For example:
python pruning_module.py --inference results/SMART2022-EL-wikidata-test-split-clocq.json results/SMART2022-EL-wikidata-test-final-results.json
You can find the results in the specified output file (results/SMART2022-EL-wikidata-test-final-results.json
in the example).