Skip to content

Generating Data

Natalie Prange edited this page Oct 8, 2024 · 2 revisions

If you want to generate the data files which ELEVANT needs, instead of downloading them, follow the instructions below. Data generation has to be done only once unless you want to update the generated data files to a more recent Wikidata or Wikipedia version.

Note, that if you want to generate all data files yourself, you need to start the docker container with the following command instead of the shorter command given in the Quick Start guide:

docker run -it -p 8000:8000 \
    -v <data_directory>:/data \
    -v $(pwd)/evaluation-results/:/home/evaluation-results \
    -v $(pwd)/benchmarks/:/home/benchmarks \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v $(pwd)/wikidata-types/:/home/wikidata-types \
    -e WIKIDATA_TYPES_PATH=$(pwd) elevant

Mounting the docker socket is necessary in order to be able to start a QLever Docker container as a sibling instead of a child of the ELEVANT Docker container.

If, for some reason, you don't want to run the data generation within the docker container, make sure to set the DATA_DIR variable in the Makefile to your <data_directory>. In the docker container, DATA_DIR is automatically set to /data/, so you don't have to do anything.

NOTE: The following steps will overwrite existing Wikidata and Wikipedia mappings in your <data_directory> so make sure this is what you want to do.

Generate Wikidata Mappings

To generate the Wikidata mappings run

make generate_wikidata_mappings

This will use the Wikidata SPARQL endpoint defined in the Makefile variable WIKIDATA_SPARQL_ENDPOINT and download the Wikidata mappings. It will then generate Python dbm databases from the Wikidata mappings for fast loading and reduced RAM usage. The database generation will take several hours. Finally, it will create two additional files from the downloaded Wikidata mappings which are only needed if you want to use coreference resolution in addition to entity linking.

See Wikidata Mappings for a description of files generated in this step.

Generate Wikipedia Mappings

To generate the Wikipedia mappings run

make generate_wikipedia_mappings

This will download and extract the most recent Wikipedia dump, split it into training, development and test files and generate the Wikipedia mappings. NOTE: This step needs the Wikidata mappings, so make sure you build or download them before.

See Wikipedia Mappings for a description of files generated in this step.

Generate Entity-Type Mapping

To generate the entity-type mapping, run

make generate_entity_types_mapping

This will run the steps described in detail in wikidata-types/README.md. Roughly, it pulls the QLever docker image from DockerHub, builds a QLever index with corrections from wikidata-types/corrections.txt and issues a query for all Wikidata entities and all their types from a given whitelist of types (wikidata-types/types.txt). The resulting file is moved to <data_directory>/wikidata-mappings/entity-types.tsv . The file is then transformed to a Python dbm database <data_directory>/wikidata-mappings/qid_to_whitelist_types.db which can take several hours.

Building the entity-types mapping requires about 25 GB of RAM and 100 GB of disk space and assumes that there is a running QLever instance for Wikidata under the URL specified by the variable API_WIKIDATA in wikidata-types/Makefile (by default, this is set to https://qlever.cs.uni-freiburg.de/api/wikidata).

See Entity-Type Mapping for a description of the file generated in this step.

Cleanup

You can free up some disk space after running the steps mentioned above by running

make cleanup

This will remove the entity-types.tsv as well as those Wikidata mappings, that have been transformed to database files.