-
Notifications
You must be signed in to change notification settings - Fork 1
Generating Data
If you want to generate the data files which ELEVANT needs, instead of downloading them, follow the instructions below. Data generation has to be done only once unless you want to update the generated data files to a more recent Wikidata or Wikipedia version.
Note, that if you want to generate all data files yourself, you need to start the docker container with the following command instead of the shorter command given in the Quick Start guide:
docker run -it -p 8000:8000 \
-v <data_directory>:/data \
-v $(pwd)/evaluation-results/:/home/evaluation-results \
-v $(pwd)/benchmarks/:/home/benchmarks \
-v /var/run/docker.sock:/var/run/docker.sock \
-v $(pwd)/wikidata-types/:/home/wikidata-types \
-e WIKIDATA_TYPES_PATH=$(pwd) elevant
Mounting the docker socket is necessary in order to be able to start a QLever Docker container as a sibling instead of a child of the ELEVANT Docker container.
If, for some reason, you don't want to run the data generation within the docker container, make sure to set the
DATA_DIR
variable in the Makefile to your <data_directory>
. In the docker container, DATA_DIR
is automatically
set to /data/
, so you don't have to do anything.
NOTE: The following steps will overwrite existing Wikidata and Wikipedia mappings in your <data_directory>
so make
sure this is what you want to do.
To generate the Wikidata mappings run
make generate_wikidata_mappings
This will use the Wikidata SPARQL endpoint defined in the Makefile variable WIKIDATA_SPARQL_ENDPOINT
and download
the Wikidata mappings. It will then generate Python dbm databases from the Wikidata mappings for fast
loading and reduced RAM usage. The database generation will take several hours. Finally, it will create two additional
files from the downloaded Wikidata mappings which are only needed if you want to use coreference resolution in
addition to entity linking.
See Wikidata Mappings for a description of files generated in this step.
To generate the Wikipedia mappings run
make generate_wikipedia_mappings
This will download and extract the most recent Wikipedia dump, split it into training, development and test files and generate the Wikipedia mappings. NOTE: This step needs the Wikidata mappings, so make sure you build or download them before.
See Wikipedia Mappings for a description of files generated in this step.
To generate the entity-type mapping, run
make generate_entity_types_mapping
This will run the steps described in detail in wikidata-types/README.md
. Roughly, it pulls the QLever docker image
from DockerHub, builds a QLever
index with corrections from wikidata-types/corrections.txt
and issues a query for all Wikidata entities and all
their types from a given whitelist of types (wikidata-types/types.txt
). The resulting file is moved to
<data_directory>/wikidata-mappings/entity-types.tsv
. The file is then transformed to a Python dbm database
<data_directory>/wikidata-mappings/qid_to_whitelist_types.db
which can take several hours.
Building the entity-types mapping requires about 25 GB of RAM and 100 GB of disk space and assumes that there is a
running QLever instance for Wikidata under the URL specified by the variable API_WIKIDATA
in
wikidata-types/Makefile
(by default, this is set to https://qlever.cs.uni-freiburg.de/api/wikidata).
See Entity-Type Mapping for a description of the file generated in this step.
You can free up some disk space after running the steps mentioned above by running
make cleanup
This will remove the entity-types.tsv
as well as those Wikidata mappings, that have been transformed to database
files.