A program for efficiently extracting the graph structure from a Wikidata truthy N-Triples dump.
You can install wd2graph
by running the following command:
cargo install wd2graph
Of course, you can also build it from source.
wd2graph
requires only the compressed (.gz
) Wikidata truthy dump in the N-Triples format as input. You can download it with the following command:
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.gz
After downloading the dump, you can extract the graph data with the following command:
wd2graph --input latest-truthy.nt.gz \
--output-graph graph.parquet \
--output-nodes nodes.parquet
The outputs are written into zstd compressed Apache Parquet files.
The file given as the --output-nodes
argument contains a single column named qid
(UInt32
) filled with all of the QIDs.
The file given as the --output-graph
argument contains 3 columns named lhs
(UInt32
), property
(UInt32
), and rhs
(UInt32
) filled with triplets representing directional edges. lhs
and rhs
are the QIDs, while property
is the PID.
wd2graph
uses a single thread. On a dump from March 2023, containing ~100,000,000 nodes and ~700,000,000 edges, it takes ~16 minutes to complete with peak memory usage of ~22GB on an AMD Ryzen Threadripper 3970X CPU and an SSD.