Skip to content

Build process

Robyn Speer edited this page Sep 14, 2020 · 25 revisions

Setting up a complete installation of ConceptNet requires some Python code, some associated technology such as PostgreSQL, and various dependencies. This guide attempts to walk you through how to set it up.

This is no longer our recommended way to run ConceptNet. We would rather automate the dependencies, instead of having to describe all the steps here. The conceptnet-deployment repository describes how to set up ConceptNet using either Packer or Puppet, which will take care of almost all of these steps for you.

Okay but I still want to do this the hard way

If you are running this on an existing computer, you will need:

  • A Unix system with command-line tools like sort and grep
  • Python 3.7 or later, with development headers (python3-dev)
  • A Python environment where you can install packages without sudo (for example, using virtualenv)
  • PostgreSQL 10 or later, and the ability to create databases
    • Set up PostgreSQL's permissions so that you can run "createdb conceptnet5" as your current user, without sudo.
  • Git
  • 300 GB of free disk space
  • At least 30 GB of available RAM
  • The time and bandwidth to download 24 GB of raw data
  • The numpy and scipy libraries
  • The libhdf5-dev library for reading and writing HDF5 tables
  • The libmecab-dev library for tokenizing Japanese, and its dictionary, mecab-ipadic-utf8

Installing code and dependencies

Check out the source code of ConceptNet from Git:

git clone git@github.com:commonsense/conceptnet5
cd conceptnet5

Make sure that the development libraries that ConceptNet needs are available. For example, on Ubuntu:

sudo apt install build-essential python3-pip python3-dev libhdf5-dev libmecab-dev mecab-ipadic-utf8

mecab-ipadic-utf8 is the Japanese dictionary needed by MeCab to tokenize Japanese text. If you're on a non-Ubuntu system, the package may be called something else. Be sure to get the UTF-8 version. ConceptNet uses UTF-8 consistently. The default EUC-JP version of IPADic will not work.

If you are installing a version of ConceptNet 5 prior to 5.5.5, such as to reproduce a published result, you should run pip install xmltodict==0.10.2 to satisfy its dependency on a library that has made breaking changes since then.

Setting up PostgreSQL

Install PostgreSQL 10 or later. This command, for example, will install PostgreSQL 10 on Ubuntu:

sudo apt install postgresql-10

You'll need to configure PostgreSQL's permissions so that you can create and write to a database as your current user. The details of this are outside the scope of this tutorial. See How to install and use PostgreSQL on Ubuntu, though this article is dated.

Your PostgreSQL user account has to be able to access the database by connecting to a local address, not just using the "Unix domain socket" that the psql command uses. You'll either need to set a password on your PostgreSQL account and store that in the CONCEPTNET_DB_PASSWORD environment variable, or follow a guide such as this one to not require a password when connecting locally.

Create a PostgreSQL database named conceptnet5 that you have the ability to write to:

createdb conceptnet5

Create a data directory within conceptnet5 that will contain ConceptNet's data. If necessary, make it a symbolic link to a hard drive with more space on it.

mkdir data

Install ConceptNet as a python package in your environment, including the optional "vectors" dependencies:

pip install -e '.[vectors]'

Running the build

Now that you've either done the manual installation described in the section above, or used Puppet to automate it, you can run the build process which creates the ConceptNet graph from raw data. This process uses a build tool for reproducible data science called Snakemake.

Start the build by running:

./build.sh

Testing

You can test that the ConceptNet code and build process work as expected by running the test suite using pytest. The actual database doesn't necessarily have to be built, because the tests run a small example build as part of their setup.

First install the test dependencies:

pip install pytest PyLD

Then you can run the test suite:

pytest

If you have built the full ConceptNet database, you can add tests that are usually skipped that test that the database is working correctly:

pytest --fulldb

What you get

Here are some useful outputs of the build process:

  • The conceptnet5 PostgreSQL database, containing an index of all the edges
  • assertions/assertions.csv: A CSV file of all the assertions in ConceptNet
  • assertions/assertions.msgpack: The same data in the more efficient (and less readable) msgpack format
  • edges/: The edges from individual sources that these assertions were built from.
  • stats/: Some text files that count the distribution of different languages, relations, and datasets in the built data.
  • assoc/reduced.csv: A tabular text file of just the concept-to-concept associations (plus additional 'negated concept' nodes that represent negative relations), filtered for concepts that are referred to frequently enough
  • vectors/mini.h5: A vector space of high-quality word embeddings built from an ensemble of ConceptNet, word2vec, and GloVe, stored as a Pandas data frame in HDF5 format

Some other files you can build by request (type snakemake followed by the file name):

  • data/vectors/numberbatch.h5: the full ConceptNet Numberbatch matrix, with a larger vocabulary and more precision than vectors/mini.h5
  • data/stats/evaluation.h5: evaluation results comparing numberbatch.h5 to other pre-computed word embeddings

Running the Web server

If you ran the Puppet installation, then the Web server that serves the API will be running for you, and all you need to do is restart the process:

sudo systemctl restart conceptnet

Otherwise, you've got more installation steps. Install the sub-package for the Web server:

cd web
pip install -e .

You can serve the API by running it as a Python script. You have to be in the web subdirectory of the repository (the one we just cded to above), or else it won't be able to find its files:

python conceptnet_web/api.py

This will run the API inside Flask's simple Web server. The Puppet version of the setup actually sets up a more efficient web server, using Nginx and uWSGI. You could configure these yourself, but at this point you're probably better off using conceptnet-deployment.