PyTorch-BigGraph is a project by Facebook Research and a "distributed system for learning graph embeddings for large graphs" which -in turn- is based on the PyTorch-BigGraph: A Large-scale Graph Embedding Framework paper. As an example dataset, they trained a PBG model on the full Wikidata graph.
In this repository, you'll find a guide on how you can import the complete Wikidata PBG model into a Weaviate and search through the entire dataset in < 50 milliseconds (excluding internet latency). The demo GraphQL queries below contain both pure vector search and scalar and vector searched mixed queries.
If you like what you see, a ⭐ on the Weaviate Github repo or joining our Slack is appreciated.
Additional links:
- 💡 Live Demo HTML front-end
- 💡 Live Demo Weaviate GraphQL front-end
- 💡 Live Demo Weaviate RESTful Endpoint
- Weaviate documentation
- Weaviate on Github
- Complete english language WikiPedia vectorized in Weaviate (similar project)
- The folks from Facebook Research who trained the PBG
- Thanks to the team of Obsei for sharing the idea on our Slack channel
description | value |
---|---|
Data objects imported | 78.404.883 |
Machine | 16 CPU, 128Gb Mem |
Weaviate version | v1.8.0-rc.2 |
Dataset size | 125G |
Note:
- This dataset is indexed on a single Weaviate node to show the capabilities of a single Weaviate instance. You can also set up a Weaviate Kubernetes cluster and import the complete dataset in that way.
You can import the data yourself in two ways: by running the python script included in this repo or by restoring a Weaviate backup (this is the fastest!).
$ wget https://dl.fbaipublicfiles.com/torchbiggraph/wikidata_translation_v1.tsv.gz
$ gzip -d wikidata_translation_v1.tsv.gz
$ pip3 install -f requirements.txt
$ docker-compose up -d
$ python3 import.py
The import takes a few hours, so probably you want to do something like:
$ nohup python3 -u import.py &
Note:
- The script assumes that the tsv file is called:
wikidata_translation_v1.tsv
You can download a backup and restore it. This is by far the fastest way to get the dataset up and running
# clone this repository
$ git clone https://github.com/semi-technologies/biggraph-wikidata-search-with-weaviate
# download the Weaviate backup
$ curl https://storage.googleapis.com/semi-technologies-public-data/weaviate-1.8.0-rc.2-backup-wikipedia-pytorch-biggraph.tar.gz -O
# untar the backup (125G unpacked)
$ tar -xvzf weaviate-1.8.0-rc.2-backup-wikipedia-pytorch-biggraph.tar.gz
# get the unpacked directory
$ echo $(pwd)/var/weaviate
# use the above result (e.g., /home/foobar/weaviate-disk/var/weaviate)
# update volumes in docker-compose.yml (NOT PERSISTENCE_DATA_PATH!) to the above output
# (e.g., PERSISTENCE_DATA_PATH: '/home/foobar/weaviate-disk/var/weaviate:/var/lib/weaviate')
# With 16 CPUs this process takes about 12 to 15 minutes
# start the container
$ docker-compose up -d
Notes:
- Weaviate needs some time to restore the backup, in the docker logs, you can see the status of the import. For more verbose information regarding the import. Add
LOG_LEVEL: 'debug'
indocker-compose.yml
- This setup is tested with
Ubuntu 20.04.3 LTS
and the Weaviate version in the Docker-compose file attached
##
# The one and only Stanley Kubrick 🚀⬛🐒
##
{
Get {
Entity(
nearObject: {id: "7392bc9d-a3c0-4738-9d25-a473245971c5", certainty: 0.75}
limit: 24
) {
url
_additional {
id
certainty
}
}
Label(nearObject: {id: "7392bc9d-a3c0-4738-9d25-a473245971c5", certainty: 0.8}) {
content
language
_additional {
id
certainty
}
}
}
}
##
# Na na na na na na na na na na na na na na na na... BATMAN! 🦇
##
{
Get {
Entity(
nearObject: {id: "72784488-d8a9-4fa5-8c5c-208465a31fe2", certainty: 0.75}
limit: 3
) {
url
_additional {
id
certainty
vector
}
}
}
}