Source code for our experiments of our ACL 2020 article.
Abstract: Entity linking (EL) is concerned with disambiguating entity mentions in a text against knowledge bases (KB). It is crucial in a considerable number of fields like humanities, technical writing and biomedical sciences to enrich texts with semantics and discover more knowledge. The use of EL in such domains requires handling noisy texts, low resource settings and domain-specific KBs. Existing approaches are mostly inappropriate for this, as they depend on training data. However, in the above scenario, there exists hardly annotated data, and it needs to be created from scratch. We therefore present a novel domain-agnostic Human-In-The-Loop annotation approach: we use recommenders that suggest potential concepts and adaptive candidate ranking, thereby speeding up the overall annotation process and making it less tedious for users. We evaluate our ranking approach in a simulation on difficult texts and show that it greatly outperforms a strong baseline in ranking accuracy. In a user study, the annotation speed improves by 35 % compared to annotating without interactive support; users report that they strongly prefer our system.
- Contact person: Jan-Christoph Klie, ukp@mrklie.com
- UKP Lab: http://www.ukp.tu-darmstadt.de/
- TU Darmstadt: http://www.tu-darmstadt.de/
Drop me a line or report an issue if something is broken (and shouldn't be) or if you have any questions.
For license information, please see the LICENSE
and README
files.
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
This repository contains two projects, data-converter
is a Java application for converting the data and linker
, a Python project which contains all relevant experiments.
In the linker
folder, run
pip install -r requirements.txt
Extract the zero_to_hero.zip
from here into linker/generated
.
In order to evaluate the different ranker on the full train/dev/test split, change the models and datasets you want to evaluate and run linker/scripts/evaluate_ranking.py
.
Change the models and datasets you want to evaluate, then run linker/scripts/simulation.py
. The results can be found under linker/results/${TIMESTAMP}
.
This is only needed if you want to recreate the 1641
and WWO
dataset. For most cases, use the dataset provided.
TBD due to licensing reasons
The following steps describe how to convert the depositions NIF file to documents and how to generate the knowledge base. The data comes from here . We use DKPro for the NIF to text conversion. Please refer to the supplementary material for further explanations regarding preprocessing.
- Import the
pom.xml
indata-converter
in your favorite Java IDE, I like IntelliJ IDEA. This should automatically download all dependencies. - Run
de.tudarmstadt.ukp.gleipnir.depositions.App1641
- The corpus should be written to
linker/generated/depositions
and be already be split - Run
linker/gleipnir/converter/depositions.py
to generate the knowledge base. The file is written tolinker/generated/depositions/kb/depositions_kb.ttl
The following steps describe how to convert the WWO data to documents and how to generate the knowledge base. We use DKPro for the TEI to text conversion. Please refer to the supplementary material for further explanations regarding preprocessing.
-
Copy the WWO data you obtained from the Women Writers project to
data-converter/wwo
. It should look like the following:$ data-converter/wwo: ls -l README aggregate common-boilerplate.xml files personography.xml persons.diff schema words
-
Run
de.tudarmstadt.ukp.gleipnir.wwo.AppWwo
-
The corpus should be written to
linker/generated/wwo
and already be split -
Run
linker/gleipnir/converter/wwo.py
to generate the WWO knowledge base from the personography. It will be written tolinker/generated/wwo/personography.ttl
.
For WWO and 1641, we use Fuseki as the knowledge base server. To setup, follow the following steps:
- Download Apache Jena and Fuseki 3.12.0 from the project page. The version is important.
- Build the search index by running
build_index.sh
in thefuseki
folder. - The knowledge base then can be started by running
run_fuseki.sh 1641
orrun_fuseki.sh wwo
We precompute datasets and their features before experiments. Adjust which datasets you want to run. We cache requests to knowledge bases in order to make it feasible and not stress the endpoint too much. If you run the wrong KB with a dataset, remove the cache folder. Start the knowledge base as described in Knowledge base setup
. Create datasets by runninglinker/scripts/create_dataset.py
.
This project uses data from DBPedia for 1641
. Please refer to their license when using the generated data in this project. All rights for the WWO
data are by the Women Writers project.