-
Notifications
You must be signed in to change notification settings - Fork 13
Details
Back to Artifact (Home)
If running Dorylus from your own instance, first you will need to get the dependencies and then you will need to configure your AWS account. The following instructions were written using Ubuntu 20.04.
For dependencies, run the following two commands:
sudo apt install awscli
sudo apt install python3-venv
git clone https://github.com/uclasystem/dorylus.git
cd dorylus/
python3 -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt
Next you will need to run the aws configure
command which will allow you to
access AWS services.
It will require your AWS ID, AWS Secret Key, and an AWS region.
Make sure that your AWS profile associated with the AWS ID has the permissions
that allow it to manage EC2 instances, such as launching them, starting and
stopping them, and getting all their information (this is used to configure the
cluster).
In addition, make sure it has permissions for creating, updating, and launching
Lambda functions.
We use a python module called ec2man
to manage our EC2 instances.
To set it up, the first step is to create a file called ec2man/profile
which
will contain some basic information about our user.
For OSDI Artifact Evaluators using our preconfigured instance this step has
already been done.
The file should contain the following information
$ vim ec2man/profile
default # AWS profile to use in ~/.aws/credentials
ubuntu # Linux username on cluster
/home/ubuntu/.ssh/id_rsa # SSH key used to access cluster
us-east-2 # AWS region
The next step is to configure the machines that will be used for training.
There are 2 different types of machines that will be used, which we refer to
as contexts, the graph
context which are the workers that do the model
computation and the weight
context which are the parameter servers.
In addition, the graph
servers come in 3 variants: a serverless backend, a
CPU-only backend, and a GPU-only backend.
For OSDI Artifact Evaluators we provide pre-compiled versions of each of the different variants of the system. However, we do provide instructions on how to reproduce the procedure, including setting up an instance which has all necessary dependencies installed, then compiling Dorylus directly on the cluster.
The next step is to setup a file called ec2man/machines
which will be used for
keeping track of the different instances in our training cluster.
There are two main types of resources (which we refer to as contexts) required
by Dorylus, the Graph Servers and the Weight Servers.
To set up a cluster of servers, run the command
python3 -m ec2man allocate --ami [AMI] --type [ec2 instance type] --cnt [# servers] --ctx [context]
We have provided a base AMI in the AWS region us-east-2
that has all the
dependencies required for Dorylus.
- For the serverless based backend:
ami-07aec0eb32327b38d
- For the CPU based backend:
ami-04934e5a63d144d88
- For the GPU based backend:
ami-01c390943eecea45c
- For the Weight Server:
ami-0901fc9a7bc310a8a
For example to create a group of 4 serverless graph servers that use the c5.xlarge instance type and 2 weight servers of type c5.large, run
python3 -m ec2man allocate --ami ami-07aec0eb32327b38d --type c5.xlarge --cnt 4 --ctx graph
python3 -m ec2man allocate --ami ami-0901fc9a7bc310a8a --type c5.large --cnt 2 --ctx weight
This should add entries into the ec2man/machines
file, and now the cluster should
be ready to use.
To configure the module with the current machines run the command
For non-Artifact Evaluators we describe in the section below called
Building Dorylus how to set up an instance that has the dependencies and
code for Dorylus.
To recreate the process described above, skip to the Building Dorylus section,
create an AMI from the machine that has Dorylus prepared, and then use the
allocate
command to setup the cluster using this AMI.
python3 -m ec2man setup
This will create both graph
and weight
contexts for Dorylus to use for training
and retrieve all the required information about the instances.
After this step, you can make sure that it has been configured correctly by
connecting to a few servers, for example using python3 -m ec2man graph 0 ssh
to SSH
into the graph server with id 0.
Run python3 -m ec2man help
to get a full list of options.
To make sure that each instance is aware of the other instances in the cluster, next run
$ ./gnnman/setup-cluster
This script will setup allow the servers to be aware of each other and the
roles each server plays.
It will also configure the runtime parameters found in the run/
directory.
More information about those can be found in the Building Dorylus section.
There are 4 main inputs to Dorylus:
- The graph structure
- The graph partition info
- The input features (input tensor)
- The training labels
Graph Input: To prepare an input Graph for Dorylus the format should be a binary
edge list with vertices numbered from 0
to |V|
with no breaks using 4 byte
values.
This should be in a file called graph.bsnap
.
Partition Info: Dorylus uses edge cut based partitioning. While we limit partitioning to edge-cuts, we give flexibility of the chosen algorithm for performing the partitions by running partitioning at runtime. We use the partitioner from Gemini for our partitioning strategy. The user needs to supply a text partition file in the form of:
0
3
2
1
2
3
...
where the line numbers correspond to the vertex ID and the value at each line
corresponds to the partition which the vertex is assigned to.
This file should be called graph.bsnap.parts
.
Input Features: The input features take the form of a tensor of size
|V| x d
where d
is the number of input features.
This should be a binary file of format:
[numFeats][vid0_feats][vid1_feats][vid2_feats]...
This file should be called features.bsnap
.
Training Labels: The training labels should be of the form
[num_labels][label0][label1][label2]...
This file should be called labels.bsnap
Finally, given a dataset, say amazon
, the file dataset should be prepared
with the following structure:
amazon
|-- features.bsnap
|-- graph.bsnap # File names should be exactly the same.
|-- labels.bsnap
|-- parts_<#partitions> # Need a partition folder for every configuration
|-- graph.bsnap.edges # Symlink to ../graph.bsnap
|-- graph.bsnap.parts
Dorylus will partition the files at runtime in a preprocessing step, so there is no need to create the partitions yourself. Once the data set has been prepared, the easiest way to access it is to host it on an NFS instance.
For OSDI Artifact Evaluators we provide an NFS server which has the prepared data that we used to create the results in the paper. However, we also include instructions in our optional section to show how new datasets should be prepared.
The NFS server's ID EC2 Instance ID i-098eda4a36fda9788. To add this instance to your cluster run the following command.
python3 -m ec2man add nfs i-098eda4a36fda9788
python3 -m ec2man setup
This command is used to add an existing EC2 instance to the cluster under
a specific context, in this case a special context called nfs
.
We also need to run setup again to register the changes.
After the NFS server has been added to make it available to the servers, simply run
local$./gnnman/mount-nfs-server
After you have finished configuring the graph
and weight
contexts and have
run ./gnnman/setup-cluster
then you can start building Dorylus.
Transfer the source code to the cluster using the command:
local$ ./gnnman/send-source [--force] # '--force' deletes existing code
This utility uses rsync
so code can efficiently synchronized between the local
and the remote copies.
You will need to install some libraries in order for Dorylus to work. These can be easily installed using the following scripts.
local$ ./gnnman/install-dep
If for some reason this has failures you may need to go onto each node and run:
remote$ ./gnnman/helpers/graphserver.isntall
remote$ ./gnnman/helpers/weightserver.install
The system also uses a set of file that are read at runtime from the run/
directory to configure certain parameters, such as the ports used on each
machine or the layer configuration.
The defaults provided in the repo should work on most EC2 machines but if there
are conflicts please change it.
To build and synchronize Dorylus among all nodes in a cluster, we build it on the main node (node 0) and send the binary to all other nodes in the cluster. Due to this, it is very important to first run
local$ ./gnnman/setup-cluster
before building so that the cluster is configured properly and the binary will be sent to the correct nodes. This step will also send all of the parameter files mentioned previously, as well as the IP addresses of all the machines in the cluster.
The final step is to run:
local$ ./gnnman/build-system [graph|weight] [cpu|gpu]
This command allows you to build the system only for a specific context
(graph
or weight
) or to build for a specific backend.
Not specifying either cpu
or gpu
for the backend builds the
Lambda-based backend.
Once you have finished up to this point you can create an AMI of your instance to launch a cluster with Dorylus precompiled.
If you are running Artifact Evaluation for OSDI 2021, then you should have access to our account which has the lambda functions preinstalled.
If you would like to set up the Lambda functions, you can start by allocating
an instance to build them using the allocate
functionality described above,
or you can use the one of the weight or graph servers.
First, make sure to install the dependencies:
remote$ ./funcs/manage-funcs.install
Once the dependencies have been installed, you can do the following:
remote$ cd src/funcs
remote$ ./<function-name>/upload-func