pyg-team · rusty1s · Aug 4, 2023 · May 19, 2023 · May 19, 2023 · May 19, 2023
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 ### Added
 
+- Added a distributed example using `graphlearn-for-pytorch` ([#7402](https://github.com/pyg-team/pytorch_geometric/pull/7402))
 - Integrate `neg_sampling_ratio` into `TemporalDataLoader` ([#7644](https://github.com/pyg-team/pytorch_geometric/pull/7644))
 - Added `faiss`-based `KNNINdex` classes for L2 or maximum inner product search ([#7842](https://github.com/pyg-team/pytorch_geometric/pull/7842))
 - Added the `OSE_GVCS` dataset ([#7811](https://github.com/pyg-team/pytorch_geometric/pull/7811))

@@ -0,0 +1,98 @@
+# Using GraphLearn-for-PyTorch (GLT) for Distributed Training with PyG
+
+**[GraphLearn-for-PyTorch (GLT)](https://github.com/alibaba/graphlearn-for-pytorch)** is a graph learning library for PyTorch that makes distributed GNN training easy and efficient.
+GLT leverages GPUs to accelerate graph sampling and utilizes UVA and GPU caches to reduce the data conversion and transferring costs during graph sampling and model training.
+Most of the APIs of GLT are compatible with PyG, so PyG users only need to modify a few lines of their PyG code to train their model with GLT.
+
+## Requirements
+
+- `python >= 3.6`
+- `torch >= 1.12`
+- `graphlearn-torch`
+
+## Distributed (Multi-Node) Example
+
+This example shows how to leverage [GraphLearn-for-PyTorch (GLT)](https://github.com/alibaba/graphlearn-for-pytorch) to train PyG models in a distributed scenario with GPUs. The dataset in this example is `ogbn-products` from the [Open Graph Benchmark](https://ogb.stanford.edu/), but you can also train on `ogbn-papers100M` with only minor modifications.
+
+To run this example, you can run the example as described below or directly make use of our [`launch.py`](launch.py) script.
+The training results will be generated and saved in `dist_sage_sup.txt`.
+
+### Running the Example
+
+#### Step 1: Prepare and partition the data
+
+Here, we use `ogbn-products` and partition it into two partitions:
+
+```bash
+python partition_ogbn_dataset.py --dataset=ogbn-products --root_dir=../../../data/ogbn-products --num_partitions=2
+```
+
+#### Step 2: Run the example in each training node
+
+For example, running the example in two nodes each with two GPUs:
+
+```bash
+# Node 0:
+CUDA_VISIBLE_DEVICES=0,1 python dist_train_sage_supervised.py \
+  --num_nodes=2 --node_rank=0 --master_addr=localhost \
+  --dataset=ogbn-products --dataset_root_dir=../../../data/ogbn-products \
+  --in_channel=100 --out_channel=47
+
+# Node 1:
+CUDA_VISIBLE_DEVICES=2,3 python dist_train_sage_supervised.py \
+  --num_nodes=2 --node_rank=1 --master_addr=localhost \
+  --dataset=ogbn-products --dataset_root_dir=../../../data/ogbn-products \
+  --in_channel=100 --out_channel=47
+```
+
+**Notes:**
+
+1. You should change the `master_addr` to the IP of `node#0`.
+2. Since there is randomness during data partitioning, please ensure all nodes are using the same partitioned data when running `dist_train_sage_supervised.py`.
+
+### Using the `launch.py` Script
+
+#### Step 1: Setup a distributed file system
+
+**Note**: You may skip this step if you already set up folder(s) synchronized across machines.
+
+To perform distributed sampling, files and codes need to be accessed across multiple machines.
+A distributed file system (*i.e.*, [NFS](https://wiki.archlinux.org/index.php/NFS), [SSHFS](https://www.digitalocean.com/community/tutorials/how-to-use-sshfs-to-mount-remote-file-systems-over-ssh), [Ceph](https://docs.ceph.com/en/latest/install), ...) exempts you from synchnonizing files such as partition information.
+
+#### Step 2: Prepare and partition the data
+
+In distributed training (under the worker mode), each node in the cluster holds a partition of the graph.
+Thus, before the training starts, we partition the `ogbn-products` dataset into multiple partitions, each of which corresponds to a specific training worker.
+
+The partitioning occurs in three steps:
+  1. Run the partition algorithm to assign nodes to partitions.
+  2. Construct the partitioned graph structure based on the node assignment.
+  3. Split the node features and edge features into partitions.
+
+GLT supports caching graph topology and frequently accessed features in GPU to accelerate GPU sampling and feature collection.
+For feature caching, we adopt a pre-sampling-based approach to determine the hotness of nodes, and cache features for nodes with higher hotness while loading the graph.
+The uncached features are stored in pinned memory for efficient access via UVA.
+
+For further information about partitioning, please refer to the [official tutorial](https://github.com/alibaba/graphlearn-for-pytorch/blob/main/docs/tutorial/dist.md).
+
+Here, we use `ogbn-products` and partition it into two partitions:
+
+```bash
+python partition_ogbn_dataset.py --dataset=ogbn-products --root_dir=../../../data/ogbn-products --num_partitions=2
+```
+
+#### Step 3: Set up the configure file
+
+An example configuration file in given via [`dist_train_sage_sup_config.yml`](dist_train_sage_sup_config.yml).
+
+#### Step 4: Launch the distributed training
+
+```bash
+pip install paramiko
+pip install click
+apt install tmux
+python launch.py --config=dist_train_sage_sup_config.yml --master_addr=0.0.0.0 --master_port=11234
+```
+
+Here, `master_addr` is for the master RPC address, and `master_port` is for PyTorch's process group initialization across training processes.
+Note that you should change the `master_addr` to the IP of `node#0`.
@@ -0,0 +1,38 @@
+# IP addresses for all nodes.
+# Note: The first 3 params are expected to form usernames@nodes:ports.
+nodes:
+  - 0.0.0.0
+  - 1.1.1.1
+
+# SSH ports for each node:
+ports: [22, 22]
+
+# Username for remote IPs:
+usernames:
+  - your_username_for_node_0
+  - your_username_for_node_1
+
+# Path to Python with GLT environment for each node:
+python_bins:
+  - /path/to/python
+  - /path/to/python
+
+# The dataset name, e.g., ogbn-products, ogbn-papers100M.
+# Note: make sure the name of dataset_root_dir is the same as the dataset name.
+dataset: ogbn-products
+
+# `in_channel` and `out_channel` of the dataset, e.g.,:
+# - ogbn-products: in_channel=100, out_channel=47
+# - ogbn-papers100M: in_channel=128, out_channel=172
+in_channel: 100
+out_channel: 47
+
+# Path to the pytorch_geometric directory:
+dst_paths:
+  - /path/to/pytorch_geometric
+  - /path/to/pytorch_geometric
+
+# Setup visible CUDA devices for each node:
+visible_devices:
+  - 0,1,2,3
+  - 0,1,2,3