wd2tantivy

A program for generating a tantivy index from a Wikidata dump.

Usage

Clone the repository and run the following command to install the package inside of a virtual environment:

poetry install

wd2tantivy requires only the compressed (.gz) Wikidata truthy dump in the N-Triples format as input. You can download it with the following command:

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.gz

wd2tantivy uses spaCy to lemmatize the aliases. This means you need to download the appropriate model you wish to use first.

For example:

poetry run python -m spacy download en_core_web_lg

After downloading the dump and the model, you can generate the tantivy index with the following command:

pigz -dc latest-truthy.nt.gz | \
poetry run wd2tantivy --language "${LANGUAGE}" \
                      --spacy-model "${SPACY_MODEL}" \
                      --output "${OUTPUT_DIR}"

Where ${LANGUAGE} is a BCP-47 language code.

The tantivy index will be written into ${OUTPUT_DIR}.

Each document in the index contains 3 stored and indexed fields:

qid (integer)
preferred name (NFC normalized, UTF-8 encoded text)
alias (lemmatized, NFC normalized, UTF-8 encoded text); this field can have multiple values

Performance

wd2tantivy uses as many threads as there are logical CPU cores. On a dump from March 2023, containing ~100,000,000 nodes, it takes ~5 hours to complete for English with peak memory usage of ~70GB on an AMD Ryzen Threadripper 3970X CPU and an SSD.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
wd2tantivy		wd2tantivy
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wd2tantivy

Usage

Performance

About

Releases

Languages

License

cyanic-selkie/wd2tantivy

Folders and files

Latest commit

History

Repository files navigation

wd2tantivy

Usage

Performance

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages