Details

Running Dorylus from Scratch

If running Dorylus from your own instance, first you will need to get the dependencies and then you will need to configure your AWS account. The following instructions were written using Ubuntu 20.04.

For dependencies, run the following two commands:

sudo apt install awscli
sudo apt install python3-venv

git clone https://github.com/uclasystem/dorylus.git

cd dorylus/
python3 -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt

Next you will need to run the aws configure command which will allow you to access AWS services. It will require your AWS ID, AWS Secret Key, and an AWS region. Make sure that your AWS profile associated with the AWS ID has the permissions that allow it to manage EC2 instances, such as launching them, starting and stopping them, and getting all their information (this is used to configure the cluster). In addition, make sure it has permissions for creating, updating, and launching Lambda functions.

Using the 'ec2man' module

We use a python module called ec2man to manage our EC2 instances. To set it up, the first step is to create a file called ec2man/profile which will contain some basic information about our user. For OSDI Artifact Evaluators using our preconfigured instance this step has already been done.

The file should contain the following information

$ vim ec2man/profile

default                         # AWS profile to use in ~/.aws/credentials
ubuntu                          # Linux username on cluster
/home/ubuntu/.ssh/id_rsa        # SSH key used to access cluster
us-east-2                       # AWS region

The next step is to configure the machines that will be used for training. There are 2 different types of machines that will be used, which we refer to as contexts, the graph context which are the workers that do the model computation and the weight context which are the parameter servers.

In addition, the graph servers come in 3 variants: a serverless backend, a CPU-only backend, and a GPU-only backend.

For OSDI Artifact Evaluators we provide pre-compiled versions of each of the different variants of the system. However, we do provide instructions on how to reproduce the procedure, including setting up an instance which has all necessary dependencies installed, then compiling Dorylus directly on the cluster.

The next step is to setup a file called ec2man/machines which will be used for keeping track of the different instances in our training cluster. There are two main types of resources (which we refer to as contexts) required by Dorylus, the Graph Servers and the Weight Servers. To set up a cluster of servers, run the command

python3 -m ec2man allocate --ami [AMI] --type [ec2 instance type] --cnt [# servers] --ctx [context]

We have provided a base AMI in the AWS region us-east-2 that has all the dependencies required for Dorylus.

For the serverless based backend: ami-07aec0eb32327b38d
For the CPU based backend: ami-04934e5a63d144d88
For the GPU based backend: ami-01c390943eecea45c
For the Weight Server: ami-0901fc9a7bc310a8a

For example to create a group of 4 serverless graph servers that use the c5.xlarge instance type and 2 weight servers of type c5.large, run

python3 -m ec2man allocate --ami ami-07aec0eb32327b38d --type c5.xlarge --cnt 4 --ctx graph
python3 -m ec2man allocate --ami ami-0901fc9a7bc310a8a --type c5.large --cnt 2 --ctx weight

This should add entries into the ec2man/machines file, and now the cluster should be ready to use. To configure the module with the current machines run the command For non-Artifact Evaluators we describe in the section below called Building Dorylus how to set up an instance that has the dependencies and code for Dorylus. To recreate the process described above, skip to the Building Dorylus section, create an AMI from the machine that has Dorylus prepared, and then use the allocate command to setup the cluster using this AMI.

python3 -m ec2man setup

This will create both graph and weight contexts for Dorylus to use for training and retrieve all the required information about the instances. After this step, you can make sure that it has been configured correctly by connecting to a few servers, for example using python3 -m ec2man graph 0 ssh to SSH into the graph server with id 0. Run python3 -m ec2man help to get a full list of options.

Configuring the Cluster

To make sure that each instance is aware of the other instances in the cluster, next run

$ ./gnnman/setup-cluster

This script will setup allow the servers to be aware of each other and the roles each server plays. It will also configure the runtime parameters found in the run/ directory. More information about those can be found in the Building Dorylus section.

Preparing the Data

There are 4 main inputs to Dorylus:

The graph structure
The graph partition info
The input features (input tensor)
The training labels

Graph Input: To prepare an input Graph for Dorylus the format should be a binary edge list with vertices numbered from 0 to |V| with no breaks using 4 byte values. This should be in a file called graph.bsnap.

Partition Info: Dorylus uses edge cut based partitioning. While we limit partitioning to edge-cuts, we give flexibility of the chosen algorithm for performing the partitions by running partitioning at runtime. We use the partitioner from Gemini for our partitioning strategy. The user needs to supply a text partition file in the form of:

where the line numbers correspond to the vertex ID and the value at each line corresponds to the partition which the vertex is assigned to. This file should be called graph.bsnap.parts.

Input Features: The input features take the form of a tensor of size |V| x d where d is the number of input features. This should be a binary file of format:

[numFeats][vid0_feats][vid1_feats][vid2_feats]...

This file should be called features.bsnap.

Training Labels: The training labels should be of the form

[num_labels][label0][label1][label2]...

This file should be called labels.bsnap

Finally, given a dataset, say amazon, the file dataset should be prepared with the following structure:

amazon
|-- features.bsnap
|-- graph.bsnap                 # File names should be exactly the same.
|-- labels.bsnap
|-- parts_<#partitions>         # Need a partition folder for every configuration
    |-- graph.bsnap.edges       # Symlink to ../graph.bsnap
    |-- graph.bsnap.parts

Dorylus will partition the files at runtime in a preprocessing step, so there is no need to create the partitions yourself. Once the data set has been prepared, the easiest way to access it is to host it on an NFS instance.

For OSDI Artifact Evaluators we provide an NFS server which has the prepared data that we used to create the results in the paper. However, we also include instructions in our optional section to show how new datasets should be prepared.

The NFS server's ID EC2 Instance ID i-098eda4a36fda9788. To add this instance to your cluster run the following command.

python3 -m ec2man add nfs i-098eda4a36fda9788
python3 -m ec2man setup

This command is used to add an existing EC2 instance to the cluster under a specific context, in this case a special context called nfs. We also need to run setup again to register the changes.

After the NFS server has been added to make it available to the servers, simply run

local$./gnnman/mount-nfs-server

Building Dorylus

After you have finished configuring the graph and weight contexts and have run ./gnnman/setup-cluster then you can start building Dorylus.

Put the Source Code onto the Machines

Transfer the source code to the cluster using the command:

local$ ./gnnman/send-source [--force]        # '--force' deletes existing code

This utility uses rsync so code can efficiently synchronized between the local and the remote copies.

Installing Dependencies

You will need to install some libraries in order for Dorylus to work. These can be easily installed using the following scripts.

local$ ./gnnman/install-dep

If for some reason this has failures you may need to go onto each node and run:

remote$ ./gnnman/helpers/graphserver.isntall
remote$ ./gnnman/helpers/weightserver.install

Setting up the Parameters Files

The system also uses a set of file that are read at runtime from the run/ directory to configure certain parameters, such as the ports used on each machine or the layer configuration. The defaults provided in the repo should work on most EC2 machines but if there are conflicts please change it.

Compiling the Code

To build and synchronize Dorylus among all nodes in a cluster, we build it on the main node (node 0) and send the binary to all other nodes in the cluster. Due to this, it is very important to first run

local$ ./gnnman/setup-cluster

before building so that the cluster is configured properly and the binary will be sent to the correct nodes. This step will also send all of the parameter files mentioned previously, as well as the IP addresses of all the machines in the cluster.

The final step is to run:

local$ ./gnnman/build-system [graph|weight] [cpu|gpu]

This command allows you to build the system only for a specific context (graph or weight) or to build for a specific backend. Not specifying either cpu or gpu for the backend builds the Lambda-based backend.

Once you have finished up to this point you can create an AMI of your instance to launch a cluster with Dorylus precompiled.

Setting up Lambda Functions

If you are running Artifact Evaluation for OSDI 2021, then you should have access to our account which has the lambda functions preinstalled.

If you would like to set up the Lambda functions, you can start by allocating an instance to build them using the allocate functionality described above, or you can use the one of the weight or graph servers. First, make sure to install the dependencies:

remote$ ./funcs/manage-funcs.install

Once the dependencies have been installed, you can do the following:

remote$ cd src/funcs
remote$ ./<function-name>/upload-func

Provide feedback

Saved searches

Use saved searches to filter your results more quickly