PyTorch-BigGraph Wikidata search with the Weaviate vector search engine

PyTorch-BigGraph is a project by Facebook Research and a "distributed system for learning graph embeddings for large graphs" which -in turn- is based on the PyTorch-BigGraph: A Large-scale Graph Embedding Framework paper. As an example dataset, they trained a PBG model on the full Wikidata graph.

In this repository, you'll find a guide on how you can import the complete Wikidata PBG model into a Weaviate and search through the entire dataset in < 50 milliseconds (excluding internet latency). The demo GraphQL queries below contain both pure vector search and scalar and vector searched mixed queries.

If you like what you see, a ⭐ on the Weaviate Github repo or joining our Slack is appreciated.

Additional links:

Acknowledgments

The folks from Facebook Research who trained the PBG
Thanks to the team of Obsei for sharing the idea on our Slack channel

Stats

description	value
Data objects imported	`78.404.883`
Machine	`16 CPU, 128Gb Mem`
Weaviate version	`v1.8.0-rc.2`
Dataset size	`125G`

Note:

This dataset is indexed on a single Weaviate node to show the capabilities of a single Weaviate instance. You can also set up a Weaviate Kubernetes cluster and import the complete dataset in that way.

Index

Import
- Import using Python from source
- Restore as Weaviate backup
Example queries

Import

You can import the data yourself in two ways: by running the python script included in this repo or by restoring a Weaviate backup (this is the fastest!).

Import using Python from source

$ wget https://dl.fbaipublicfiles.com/torchbiggraph/wikidata_translation_v1.tsv.gz
$ gzip -d wikidata_translation_v1.tsv.gz
$ pip3 install -f requirements.txt
$ docker-compose up -d
$ python3 import.py

The import takes a few hours, so probably you want to do something like:

$ nohup python3 -u import.py &

Note:

The script assumes that the tsv file is called: wikidata_translation_v1.tsv

Restore as Weaviate backup

You can download a backup and restore it. This is by far the fastest way to get the dataset up and running ⁉️

# clone this repository
$ git clone https://github.com/semi-technologies/biggraph-wikidata-search-with-weaviate
# download the Weaviate backup
$ curl https://storage.googleapis.com/semi-technologies-public-data/weaviate-1.8.0-rc.2-backup-wikipedia-pytorch-biggraph.tar.gz -O
# untar the backup (125G unpacked)
$ tar -xvzf weaviate-1.8.0-rc.2-backup-wikipedia-pytorch-biggraph.tar.gz
# get the unpacked directory
$ echo $(pwd)/var/weaviate
# use the above result (e.g., /home/foobar/weaviate-disk/var/weaviate)
#   update volumes in docker-compose.yml (NOT PERSISTENCE_DATA_PATH!) to the above output
#   (e.g., PERSISTENCE_DATA_PATH: '/home/foobar/weaviate-disk/var/weaviate:/var/lib/weaviate')
#   With 16 CPUs this process takes about 12 to 15 minutes
# start the container
$ docker-compose up -d

Notes:

Weaviate needs some time to restore the backup, in the docker logs, you can see the status of the import. For more verbose information regarding the import. Add LOG_LEVEL: 'debug' in docker-compose.yml
This setup is tested with Ubuntu 20.04.3 LTS and the Weaviate version in the Docker-compose file attached

Example queries

Finding Stanley

##
# The one and only Stanley Kubrick 🚀⬛🐒
##
{
  Get {
    Entity(
      nearObject: {id: "7392bc9d-a3c0-4738-9d25-a473245971c5", certainty: 0.75}
      limit: 24
    ) {
      url
      _additional {
        id
        certainty
      }
    }
    Label(nearObject: {id: "7392bc9d-a3c0-4738-9d25-a473245971c5", certainty: 0.8}) {
      content
      language
      _additional {
        id
        certainty
      }
    }
  }
}

Show those vectors!

##
# Na na na na na na na na na na na na na na na na... BATMAN! 🦇
##
{
  Get {
    Entity(
      nearObject: {id: "72784488-d8a9-4fa5-8c5c-208465a31fe2", certainty: 0.75}
      limit: 3
    ) {
      url
      _additional {
        id
        certainty
        vector
      }
    }
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
html		html
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
import.py		import.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyTorch-BigGraph Wikidata search with the Weaviate vector search engine

Acknowledgments

Stats

Index

Import

Import using Python from source

Restore as Weaviate backup

Example queries

About

Releases

Packages

Languages

License

weaviate/biggraph-wikidata-search-with-weaviate

Folders and files

Latest commit

History

Repository files navigation

PyTorch-BigGraph Wikidata search with the Weaviate vector search engine

Acknowledgments

Stats

Index

Import

Import using Python from source

Restore as Weaviate backup

Example queries

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages