[Example] Add WholeGraph to accelerate PyG dataloaders with GPUs #9714

chang-l · 2024-10-17T23:04:48Z

This PR demonstrates how to integrate the NVIDIA WholeGraph into PyG’s graph and feature store base classes, providing a modular and PyG-like way to extend PyG's dataloader for better GPU utilization. Let WholeGraph handle the optimization of data access on NVIDIA hardware and manage graph and feature storage, with potential sharding across distributed disk, RAM or device memory.

Compared to existing examples, there are three key differences:

The WholeGraph library does not provide a dataloader but host underlying distributed graph and feature storage with associated efficient primitive operations (e.g., GPU-accelerated fast embedding retrieval and graph sampling).
It is efficient, minimizing CPU interruptions, and can be built into PyG's feature store and graph store (compatible with existing PyG native dataloaders). Please see feature_store.py and graph_store.py implementation.
There is no distinction between single-GPU, multi-GPU, and multi-node multi-GPU training with this new feature store or graph store. Users do not need to partition the graph or hand-craft third-party launch scripts. Everything falls under the traditional PyTorch DDP workflow, The example (papers100m_dist_wholegraph_nc.py or benchmark_data.py) shows how to achieve this from any existing PyG DDP example.

By running benchmark script (benchmark_data.py), we observed 2X, 5X and 9X speedup on single GPU with NVIDIA T4, A100, and H100 GPU, compared to native PyG NeighborLoader. Running with 4 GPUs, the speedups increase to 6.4X, 15X and 35X, respectively (numbers may vary depending on actual CPU used for baseline run).

Meanwhile, given the demonstrated compatibility in this PR and performance benefits, I’d like to propose integrating WholeGraph, as an option, to back data.FeatureStore/HeteroData.FeatureStore first; and to support the WholeMemory type as a new option in index_select function;

pytorch_geometric/torch_geometric/loader/utils.py

Line 57 in 7f844d7

if isinstance(value, Tensor):

, making it (UVA) accessible to more users.

cc. @puririshi98 @TristonC @alexbarghi-nv @linhu-nv @rusty1s

for more information, see https://pre-commit.ci

puririshi98

Overall looks good to me. @rusty1s I wonder if you think it would be a better fit to have these helper files directly integrated into torch_geometric.distributed.wholegraph or something like that.

also @chang-l please remove the stale examples from examples/multi_gpu/

for taobao and pcqm4m examples in the folder, i think it would be best to add a comment to the top mentioning mp.spawn is deprecated and to point to your new examples.
please also update the readme of that folder accordingly.

lastly. please add similar deprecation comment and pointer to new examples for these 2:
https://github.com/pyg-team/pytorch_geometric/blob/master/docs/source/tutorial/multi_gpu_vanilla.rst
https://github.com/pyg-team/pytorch_geometric/blob/master/docs/source/tutorial/multi_node_multi_gpu_vanilla.rst

alexbarghi-nv · 2024-10-22T16:58:36Z

Overall looks good to me. @rusty1s I wonder if you think it would be a better fit to have these helper files directly integrated into torch_geometric.distributed.wholegraph or something like that.

also @chang-l please remove the stale examples from examples/multi_gpu/

these: https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/data_parallel.py https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_batching.py https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_sampling.py https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_sampling_multinode.py https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_sampling_multinode.sbatch https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/mag240m_graphsage.py https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/papers100m_gcn.py https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/papers100m_gcn_multinode.py

for taobao and pcqm4m examples in the folder, i think it would be best to add a comment to the top mentioning mp.spawn is deprecated and to point to your new examples. please also update the readme of that folder accordingly.

lastly. please add similar deprecation comment and pointer to new examples for these 2: https://github.com/pyg-team/pytorch_geometric/blob/master/docs/source/tutorial/multi_gpu_vanilla.rst https://github.com/pyg-team/pytorch_geometric/blob/master/docs/source/tutorial/multi_node_multi_gpu_vanilla.rst

@puririshi98 can we hold off on this for now? We are having a meeting in a couple hours to discuss this PR and how we want to go about it.

alexbarghi-nv · 2024-10-22T19:54:51Z

@puririshi98 @chang-l We can go ahead and instruct users to use torchrun and the example WG Graph/Feature stores. At some point, we will replace the ones in the examples directory with official ones that are part of cugraph. Our long-term strategy, I think, based on our discussion, is to have this take over feature storage in cuGraph. The cuGraph loaders will remain for users that need them for extreme scale applications. Then, for sampling, we will eventually replace the WholeGraph samplers with cuGraph ones once our C++ code can support custom partitioning schemes.

chang-l · 2024-10-22T20:18:48Z

Okay, I guess, from our side, we can keep this PR as it is (as one of distributed example) for now and gradually merge it within cuGraph along the way while keeping the examples up to date. Sounds good? @alexbarghi-nv @puririshi98 @TristonC @BradReesWork

alexbarghi-nv · 2024-10-22T20:20:02Z

@chang-l sounds good to me 👍

chang-l · 2024-10-22T20:38:26Z

@puririshi98 Thank you Rishi for the suggestions. I will file another PR to update and reorg existing multiGPU/multi-node examples.

for more information, see https://pre-commit.ci

examples/distributed/NVIDIA-RAPIDS/cugraph/papers100m_gcn_cugraph_multinode.py

for more information, see https://pre-commit.ci

chang-l · 2024-11-06T22:28:13Z

Per @puririshi98 suggestion, I updated the comment for most examples in this folder.
We will continuously work on improving our single-node, multi-GPU examples.

for more information, see https://pre-commit.ci

alexbarghi-nv

👍

examples/distributed/NVIDIA-RAPIDS/wholegraph/nv_distributed_graph/dist_graph.py

Kh4L · 2024-11-19T11:03:03Z

examples/distributed/NVIDIA-RAPIDS/wholegraph/nv_distributed_graph/wholegraph.py

+    return node, row, col, edge, batch, num_sampled_nodes, num_sampled_edges
+
+
+def sample_nodes_wmb_fn(wg_sampler, seeds, fanouts):


This function is not used in the code

…raph/dist_graph.py Co-authored-by: Serge Panev <spanev@nvidia.com>

for more information, see https://pre-commit.ci

Add example

1240df9

chang-l requested a review from wsad1 as a code owner October 17, 2024 23:04

pre-commit-ci bot and others added 3 commits October 17, 2024 23:06

[pre-commit.ci] auto fixes from pre-commit.com hooks

9f170b0

for more information, see https://pre-commit.ci

Minor fix for typos and comments

1e2bd6f

[pre-commit.ci] auto fixes from pre-commit.com hooks

fb432d8

for more information, see https://pre-commit.ci

puririshi98 requested review from rusty1s, puririshi98 and akihironitta October 22, 2024 16:33

Merge branch 'master' into add-uva-ddp-pyg

f86e75b

puririshi98 requested changes Oct 22, 2024

View reviewed changes

puririshi98 assigned chang-l Oct 22, 2024

Merge branch 'master' into add-uva-ddp-pyg

9ebbd19

alexbarghi-nv mentioned this pull request Nov 1, 2024

[FEA] Add New WholeGraph Graph/Feature Stores to cuGraph-PyG rapidsai/cugraph-gnn#63

Open

chang-l and others added 2 commits November 1, 2024 16:46

Example reorg under NVIDIA RAPIDS folder

12b604b

[pre-commit.ci] auto fixes from pre-commit.com hooks

7193592

for more information, see https://pre-commit.ci

chang-l commented Nov 1, 2024

View reviewed changes

examples/distributed/NVIDIA-RAPIDS/cugraph/papers100m_gcn_cugraph_multinode.py Outdated Show resolved Hide resolved

chang-l requested a review from puririshi98 November 1, 2024 23:53

chang-l and others added 7 commits November 4, 2024 09:38

Update README

74d0830

[pre-commit.ci] auto fixes from pre-commit.com hooks

ab52677

for more information, see https://pre-commit.ci

Update README.md

47ed670

Update README.md

8d18a4b

[pre-commit.ci] auto fixes from pre-commit.com hooks

7eb6d3d

for more information, see https://pre-commit.ci

Address comment

1a85441

[pre-commit.ci] auto fixes from pre-commit.com hooks

f7a13c7

for more information, see https://pre-commit.ci

chang-l and others added 8 commits November 7, 2024 09:30

Update as torch now uses different fp.register (pytorch-pr-135030)

297ddde

[pre-commit.ci] auto fixes from pre-commit.com hooks

8111f1e

for more information, see https://pre-commit.ci

Add download script for data prep

5f27215

[pre-commit.ci] auto fixes from pre-commit.com hooks

4abb0a5

for more information, see https://pre-commit.ci

Update README

4446583

No need error-out for MNNVL check

973b6db

Add cugraph README

ed0b8d8

[pre-commit.ci] auto fixes from pre-commit.com hooks

aa66042

for more information, see https://pre-commit.ci

alexbarghi-nv approved these changes Nov 12, 2024

View reviewed changes

Kh4L reviewed Nov 19, 2024

View reviewed changes

puririshi98 and others added 4 commits November 19, 2024 08:16

Update examples/distributed/NVIDIA-RAPIDS/wholegraph/nv_distributed_g…

c4c7736

…raph/dist_graph.py Co-authored-by: Serge Panev <spanev@nvidia.com>

Merge branch 'master' into add-uva-ddp-pyg

80b9680

Update CHANGELOG.md

f46a172

[pre-commit.ci] auto fixes from pre-commit.com hooks

86003a6

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Example] Add WholeGraph to accelerate PyG dataloaders with GPUs #9714

[Example] Add WholeGraph to accelerate PyG dataloaders with GPUs #9714

chang-l commented Oct 17, 2024 •

edited

Loading

puririshi98 left a comment •

edited

Loading

alexbarghi-nv commented Oct 22, 2024 •

edited

Loading

alexbarghi-nv commented Oct 22, 2024

chang-l commented Oct 22, 2024

alexbarghi-nv commented Oct 22, 2024

chang-l commented Oct 22, 2024

chang-l commented Nov 6, 2024

alexbarghi-nv left a comment

Kh4L Nov 19, 2024

		return node, row, col, edge, batch, num_sampled_nodes, num_sampled_edges


		def sample_nodes_wmb_fn(wg_sampler, seeds, fanouts):

[Example] Add WholeGraph to accelerate PyG dataloaders with GPUs #9714

Are you sure you want to change the base?

[Example] Add WholeGraph to accelerate PyG dataloaders with GPUs #9714

Conversation

chang-l commented Oct 17, 2024 • edited Loading

puririshi98 left a comment • edited Loading

Choose a reason for hiding this comment

alexbarghi-nv commented Oct 22, 2024 • edited Loading

alexbarghi-nv commented Oct 22, 2024

chang-l commented Oct 22, 2024

alexbarghi-nv commented Oct 22, 2024

chang-l commented Oct 22, 2024

chang-l commented Nov 6, 2024

alexbarghi-nv left a comment

Choose a reason for hiding this comment

Kh4L Nov 19, 2024

Choose a reason for hiding this comment

chang-l commented Oct 17, 2024 •

edited

Loading

puririshi98 left a comment •

edited

Loading

alexbarghi-nv commented Oct 22, 2024 •

edited

Loading