Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GraphLearn-for-PyTorch(GLT) distributed examples #7402

Merged
merged 70 commits into from
Aug 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
f6dc0ed
Add GLT dist example
HuxleyHu98 May 19, 2023
714caf8
Merge branch 'pyg-team:master' into glt2pyg
husimplicity May 19, 2023
2422afc
refine docs
HuxleyHu98 May 19, 2023
8ff83f3
Merge branch 'glt2pyg' of https://github.com/husimplicity/pytorch_geo…
HuxleyHu98 May 19, 2023
0212ce0
minor
HuxleyHu98 May 21, 2023
6a531fe
minor
HuxleyHu98 May 22, 2023
75534b5
Merge branch 'pyg-team:master' into glt2pyg
husimplicity May 22, 2023
f8c6723
Merge branch 'glt2pyg' of https://github.com/husimplicity/pytorch_geo…
HuxleyHu98 May 22, 2023
05a2a95
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 22, 2023
4bf8f7b
minor
HuxleyHu98 May 25, 2023
6e1072f
minor
HuxleyHu98 May 25, 2023
af6413d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 25, 2023
a8e7c15
add papers100m cmd example
HuxleyHu98 May 25, 2023
7d09685
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 25, 2023
a757858
minor
HuxleyHu98 May 25, 2023
3367a76
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 25, 2023
54a871b
Merge branch 'pyg-team:master' into glt2pyg
husimplicity May 30, 2023
5cafdd6
Merge branch 'master' into glt2pyg
husimplicity Jun 1, 2023
9ab42bb
Merge branch 'master' into glt2pyg
husimplicity Jun 2, 2023
8d30a69
Merge branch 'master' into glt2pyg
husimplicity Jun 6, 2023
2753ed3
Merge branch 'master' into glt2pyg
husimplicity Jun 7, 2023
ea9aa82
Merge branch 'pyg-team:master' into glt2pyg
husimplicity Jun 13, 2023
5aa23ed
Merge branch 'master' into glt2pyg
husimplicity Jun 14, 2023
16724be
Merge branch 'master' into glt2pyg
husimplicity Jun 16, 2023
5396bec
Merge branch 'master' into glt2pyg
husimplicity Jun 19, 2023
2989b42
Merge branch 'master' into glt2pyg
husimplicity Jun 20, 2023
b4c61c4
Merge branch 'master' into glt2pyg
husimplicity Jun 26, 2023
8f9983e
Merge branch 'pyg-team:master' into glt2pyg
husimplicity Jun 27, 2023
7af3ede
Merge branch 'master' into glt2pyg
husimplicity Jun 28, 2023
aef296e
adjust directory structure to dist/glt
HuxleyHu98 Jun 28, 2023
1afe595
minor
HuxleyHu98 Jun 28, 2023
72821a5
Update README.md
husimplicity Jun 29, 2023
0bd075a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 29, 2023
0aef26c
Merge branch 'master' into glt2pyg
husimplicity Jun 29, 2023
8b93d25
style
HuxleyHu98 Jun 30, 2023
74f7091
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 30, 2023
341e6bf
style
HuxleyHu98 Jun 30, 2023
39eaabd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 30, 2023
08ccf94
add changelog
HuxleyHu98 Jun 30, 2023
3522383
Merge branch 'glt2pyg' of https://github.com/husimplicity/pytorch_geo…
HuxleyHu98 Jun 30, 2023
9e2bfd4
Merge branch 'master' into glt2pyg
husimplicity Jul 3, 2023
d3181dc
Merge branch 'master' into glt2pyg
husimplicity Jul 4, 2023
9082523
Merge branch 'master' into glt2pyg
husimplicity Jul 5, 2023
2e86235
Merge branch 'master' of https://github.com/husimplicity/pytorch_geom…
HuxleyHu98 Jul 6, 2023
f8c1f5c
Merge branch 'master' into glt2pyg
husimplicity Jul 10, 2023
632789d
Merge branch 'glt2pyg' of https://github.com/husimplicity/pytorch_geo…
HuxleyHu98 Jul 10, 2023
817e7ef
Update documentations
HuxleyHu98 Jul 10, 2023
4e1ae67
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 10, 2023
6217ad4
Merge branch 'master' into glt2pyg
husimplicity Jul 13, 2023
f1f1635
minor
HuxleyHu98 Jul 14, 2023
4c66267
Merge branch 'master' into glt2pyg
husimplicity Jul 17, 2023
2d3d22e
update
HuxleyHu98 Jul 17, 2023
fcf7c46
Merge branch 'master' into glt2pyg
husimplicity Jul 18, 2023
c4cdc26
Merge branch 'master' into glt2pyg
husimplicity Jul 19, 2023
93ff1d4
Merge branch 'master' into glt2pyg
husimplicity Jul 19, 2023
df2764f
Merge branch 'master' into glt2pyg
husimplicity Jul 20, 2023
3ecdc02
Merge branch 'master' into glt2pyg
husimplicity Jul 20, 2023
d19ad8f
Merge branch 'master' into glt2pyg
husimplicity Jul 24, 2023
1d47ef2
Merge branch 'master' into glt2pyg
husimplicity Aug 1, 2023
0edc888
Merge branch 'master' into glt2pyg
husimplicity Aug 1, 2023
958a736
Merge branch 'master' into glt2pyg
husimplicity Aug 1, 2023
161e51e
Merge branch 'master' into glt2pyg
husimplicity Aug 3, 2023
48ee10f
Merge branch 'master' into glt2pyg
husimplicity Aug 3, 2023
45d1797
Merge branch 'master' into glt2pyg
husimplicity Aug 3, 2023
a6a7d06
Merge branch 'master' into glt2pyg
husimplicity Aug 3, 2023
8113ece
Merge branch 'master' into glt2pyg
husimplicity Aug 4, 2023
4cc37f3
update
rusty1s Aug 4, 2023
4d330cd
update
rusty1s Aug 4, 2023
55df5e8
update
rusty1s Aug 4, 2023
eca1a87
Merge branch 'master' into glt2pyg
rusty1s Aug 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).

### Added

- Added a distributed example using `graphlearn-for-pytorch` ([#7402](https://github.com/pyg-team/pytorch_geometric/pull/7402))
- Integrate `neg_sampling_ratio` into `TemporalDataLoader` ([#7644](https://github.com/pyg-team/pytorch_geometric/pull/7644))
- Added `faiss`-based `KNNINdex` classes for L2 or maximum inner product search ([#7842](https://github.com/pyg-team/pytorch_geometric/pull/7842))
- Added the `OSE_GVCS` dataset ([#7811](https://github.com/pyg-team/pytorch_geometric/pull/7811))
Expand Down
98 changes: 98 additions & 0 deletions examples/distributed/graphlearn_for_pytorch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Using GraphLearn-for-PyTorch (GLT) for Distributed Training with PyG

**[GraphLearn-for-PyTorch (GLT)](https://github.com/alibaba/graphlearn-for-pytorch)** is a graph learning library for PyTorch that makes distributed GNN training easy and efficient.
GLT leverages GPUs to accelerate graph sampling and utilizes UVA and GPU caches to reduce the data conversion and transferring costs during graph sampling and model training.
Most of the APIs of GLT are compatible with PyG, so PyG users only need to modify a few lines of their PyG code to train their model with GLT.

## Requirements

- `python >= 3.6`
- `torch >= 1.12`
- `graphlearn-torch`

## Distributed (Multi-Node) Example

This example shows how to leverage [GraphLearn-for-PyTorch (GLT)](https://github.com/alibaba/graphlearn-for-pytorch) to train PyG models in a distributed scenario with GPUs. The dataset in this example is `ogbn-products` from the [Open Graph Benchmark](https://ogb.stanford.edu/), but you can also train on `ogbn-papers100M` with only minor modifications.

To run this example, you can run the example as described below or directly make use of our [`launch.py`](launch.py) script.
The training results will be generated and saved in `dist_sage_sup.txt`.

### Running the Example

#### Step 1: Prepare and partition the data

Here, we use `ogbn-products` and partition it into two partitions:

```bash
python partition_ogbn_dataset.py --dataset=ogbn-products --root_dir=../../../data/ogbn-products --num_partitions=2
```

#### Step 2: Run the example in each training node

For example, running the example in two nodes each with two GPUs:

```bash
# Node 0:
CUDA_VISIBLE_DEVICES=0,1 python dist_train_sage_supervised.py \
--num_nodes=2 --node_rank=0 --master_addr=localhost \
--dataset=ogbn-products --dataset_root_dir=../../../data/ogbn-products \
--in_channel=100 --out_channel=47

# Node 1:
CUDA_VISIBLE_DEVICES=2,3 python dist_train_sage_supervised.py \
--num_nodes=2 --node_rank=1 --master_addr=localhost \
--dataset=ogbn-products --dataset_root_dir=../../../data/ogbn-products \
--in_channel=100 --out_channel=47
```

**Notes:**

1. You should change the `master_addr` to the IP of `node#0`.
2. Since there is randomness during data partitioning, please ensure all nodes are using the same partitioned data when running `dist_train_sage_supervised.py`.

### Using the `launch.py` Script

#### Step 1: Setup a distributed file system

**Note**: You may skip this step if you already set up folder(s) synchronized across machines.

To perform distributed sampling, files and codes need to be accessed across multiple machines.
A distributed file system (*i.e.*, [NFS](https://wiki.archlinux.org/index.php/NFS), [SSHFS](https://www.digitalocean.com/community/tutorials/how-to-use-sshfs-to-mount-remote-file-systems-over-ssh), [Ceph](https://docs.ceph.com/en/latest/install), ...) exempts you from synchnonizing files such as partition information.

#### Step 2: Prepare and partition the data

In distributed training (under the worker mode), each node in the cluster holds a partition of the graph.
Thus, before the training starts, we partition the `ogbn-products` dataset into multiple partitions, each of which corresponds to a specific training worker.

The partitioning occurs in three steps:
1. Run the partition algorithm to assign nodes to partitions.
2. Construct the partitioned graph structure based on the node assignment.
3. Split the node features and edge features into partitions.

GLT supports caching graph topology and frequently accessed features in GPU to accelerate GPU sampling and feature collection.
For feature caching, we adopt a pre-sampling-based approach to determine the hotness of nodes, and cache features for nodes with higher hotness while loading the graph.
The uncached features are stored in pinned memory for efficient access via UVA.

For further information about partitioning, please refer to the [official tutorial](https://github.com/alibaba/graphlearn-for-pytorch/blob/main/docs/tutorial/dist.md).

Here, we use `ogbn-products` and partition it into two partitions:

```bash
python partition_ogbn_dataset.py --dataset=ogbn-products --root_dir=../../../data/ogbn-products --num_partitions=2
```

#### Step 3: Set up the configure file

An example configuration file in given via [`dist_train_sage_sup_config.yml`](dist_train_sage_sup_config.yml).

#### Step 4: Launch the distributed training

```bash
pip install paramiko
pip install click
apt install tmux
python launch.py --config=dist_train_sage_sup_config.yml --master_addr=0.0.0.0 --master_port=11234
```

Here, `master_addr` is for the master RPC address, and `master_port` is for PyTorch's process group initialization across training processes.
Note that you should change the `master_addr` to the IP of `node#0`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# IP addresses for all nodes.
# Note: The first 3 params are expected to form usernames@nodes:ports.
nodes:
- 0.0.0.0
- 1.1.1.1

# SSH ports for each node:
ports: [22, 22]

# Username for remote IPs:
usernames:
- your_username_for_node_0
- your_username_for_node_1

# Path to Python with GLT environment for each node:
python_bins:
- /path/to/python
- /path/to/python

# The dataset name, e.g., ogbn-products, ogbn-papers100M.
# Note: make sure the name of dataset_root_dir is the same as the dataset name.
dataset: ogbn-products

# `in_channel` and `out_channel` of the dataset, e.g.,:
# - ogbn-products: in_channel=100, out_channel=47
# - ogbn-papers100M: in_channel=128, out_channel=172
in_channel: 100
out_channel: 47

# Path to the pytorch_geometric directory:
dst_paths:
- /path/to/pytorch_geometric
- /path/to/pytorch_geometric

# Setup visible CUDA devices for each node:
visible_devices:
- 0,1,2,3
- 0,1,2,3
Loading