Skip to content

5. Run Dorylus

JOHN THORPE edited this page Apr 2, 2021 · 1 revision

Chapter 5. Run Dorylus

Previous page: 4. Setup Lambda Functions | Next page: None | Home: Home

Make sure you have built the system properly before running it (see 2. Build Dorylus), the dataset has been uploaded to all graphserver nodes (see 3. Prepare Input Dataset), and the Lambda functions are ready on AWS cloud (see 4. Setup Lambda Functions). Remote machines within the same context should be executing the same executable. Additionally, all graphservers should be using the same full copy of input dataset.

Don't forget to re-build when you changed something in the source code.

Setup the Parameter Files

Refer to 2. Build Dorylus chapter for how to send the paramter files.

We do a setup-cluster before building the system because the master node needs a proper dshmachines file to sync the executable. All graphserver nodes receive the same copy of grpah dshmachines, and weightserver nodes receive the same copy of weight dshmachines. Coordserver is alone and do not need this info.

The layerconfig file provided by user is related to the dataset and specifies the convolutional network you will be playing with. The file contains l lines, where each line is a number, as shown below. This means your GCN has l layers, where the i-th layer has a feature dimension (length of the feature vector per vertex) specified by the number in i-th line. Number of propagation steps = l - 1.

602
300
2

Remember to do a parameter config update whenever you have changed some of the configurations. ONLY ports info & layerconfig info need to be set manually; Other info like dshmachines and IPs will be handled by the setup-cluster command automatically.

Run the System from Remote

You can run the system on remote shells. Each context can be invoked from its master node [0]. Coordiniation server and weight servers should be run before the graph servers start.

To run the weight servers, on weight context master node [0]:

Weight$ cd dorylus/
Weight$ ./run/run-onnode weight <dataset>

To run the graph servers, on graph context master node [0]:

Graph$ cd dorylus/
Graph$ ./run/run-onnode graph <Dataset> <--l=num_lambdas> [--e=num_epochs] [--p] [--s=staleness_bound]     # Dataset name should match what you specified on 'send-dataset'.
    --p: Enable async-pipeline
Graph$ ./run/run-onnode graph <Dataset> cpu # For CPU version
Graph$ ./run/run-onnode graph <Dataset> gpu # For GPU version

NOTE: Have to ./gnnman/build-system graph [MODE] to use CPU or GPU version

When the graph server finishes, termination messages will] be automatically sent to coordserver and weightservers.

Kill Zombie Server Processes

When having trouble like "Text file busy" or "Port / Address in use", it is probably that on worker nodes your server processes are still running. Kill them by:

Remote$ ./gnnman/kill-zombies <Context>     # Specify the corresponding context you want to kill.

The run-onnode script automatically kills zombie processes before running, so you do not need to worry about it every time you run.

Check Logs & Output Results

Running logs and output results are stored on the graph master node [0]. Check them by:

Graph$ vim ~/logfiles/<RunMark>.<IK>.log                # Log file of run # <RunMark>.
Graph$ vim ~/outfiles/<RunMark>.<IK>/output_<Idx>       # Output results of node <Idx> of run # <RunMark>.

Clear the output & log & temporary files on graph master node by:

Graph$ ./gnnman/clear-out

Previous page: 4. Setup Lambda Functions | Next page: None | Home: Home