Skip to content

A program for efficiently extracting the graph structure from a Wikidata truthy N-Triples dump.

License

Notifications You must be signed in to change notification settings

cyanic-selkie/wd2graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wd2graph

A program for efficiently extracting the graph structure from a Wikidata truthy N-Triples dump.

Release Docs License Downloads

Usage

You can install wd2graph by running the following command:

cargo install wd2graph

Of course, you can also build it from source.

wd2graph requires only the compressed (.gz) Wikidata truthy dump in the N-Triples format as input. You can download it with the following command:

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.gz

After downloading the dump, you can extract the graph data with the following command:

wd2graph --input latest-truthy.nt.gz \
         --output-graph graph.parquet \
         --output-nodes nodes.parquet

The outputs are written into zstd compressed Apache Parquet files.

The file given as the --output-nodes argument contains a single column named qid (UInt32) filled with all of the QIDs.

The file given as the --output-graph argument contains 3 columns named lhs (UInt32), property (UInt32), and rhs (UInt32) filled with triplets representing directional edges. lhs and rhs are the QIDs, while property is the PID.

Performance

wd2graph uses a single thread. On a dump from March 2023, containing ~100,000,000 nodes and ~700,000,000 edges, it takes ~16 minutes to complete with peak memory usage of ~22GB on an AMD Ryzen Threadripper 3970X CPU and an SSD.

About

A program for efficiently extracting the graph structure from a Wikidata truthy N-Triples dump.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages