Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Benchmark] Adding inference benchmark suite #4915

Merged
merged 16 commits into from
Jul 21, 2022

Conversation

mszarma
Copy link
Contributor

@mszarma mszarma commented Jul 4, 2022

Adding inference benchmark for different num_layers, hidden_channels and batch_sizes.

Datasets: reddit, ognb-products, ogbn-mag
Models: gcn, gat, edge_conv, pna_conv , rgat as to_hetero(gat) , rgcn as to_hetero(graphsage)
Loader: NeighborLoader
Modes: measure time for graph sampling + inference , pure gnn mode - prepare batches and measure just models inference time

Co-authored-by: @kgajdamo , @dszwicht

@codecov
Copy link

codecov bot commented Jul 4, 2022

Codecov Report

Merging #4915 (c97fa76) into master (b757161) will decrease coverage by 0.01%.
The diff coverage is 77.27%.

@@            Coverage Diff             @@
##           master    #4915      +/-   ##
==========================================
- Coverage   82.83%   82.82%   -0.02%     
==========================================
  Files         331      331              
  Lines       18047    18069      +22     
==========================================
+ Hits        14949    14965      +16     
- Misses       3098     3104       +6     
Impacted Files Coverage Δ
torch_geometric/nn/models/basic_gnn.py 88.88% <50.00%> (-3.42%) ⬇️
torch_geometric/nn/conv/pna_conv.py 92.68% <100.00%> (+1.25%) ⬆️

mszarma and others added 2 commits July 4, 2022 17:04
Adding inference folder
Adding inference benchmark for different hyper params and batchsize:
datasets: reddit, ognb-products, ogbn-mag
models: gcn, gat, edge_conv, pna_conv , to_hetero(gat) , to_hetero(graphsage)
using NeighborLoader

Co-authored-by: kgajdamo , dszwicht
mszarma and others added 2 commits July 7, 2022 11:30
NeighborLoader takes a long time to initialize,
and there is no need to re-create it every time
the number of layers changes. So in order to
speed up the tests evaluation, it can be elevated
to an outer loop.
Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This looks good. I think it would be great to re-use some existing code from torch_geometric.nn.models.basic_gnn (instead of writing new models for every benchmark). Happy to add an "inference-mode" to these models directly within PyG.

benchmark/inference/edgeconv.py Outdated Show resolved Hide resolved
benchmark/inference/gat.py Outdated Show resolved Hide resolved
benchmark/inference/gat.py Outdated Show resolved Hide resolved
benchmark/inference/utils.py Outdated Show resolved Hide resolved
benchmark/inference/utils.py Outdated Show resolved Hide resolved
benchmark/inference/utils.py Outdated Show resolved Hide resolved
benchmark/inference/inference_benchmark.py Outdated Show resolved Hide resolved
benchmark/inference/inference_benchmark.py Outdated Show resolved Hide resolved
@mszarma mszarma force-pushed the inference_benchmark branch from 6d21c30 to 81c4d01 Compare July 15, 2022 15:52
torch_geometric/nn/models/basic_gnn.py Outdated Show resolved Hide resolved
torch_geometric/nn/models/basic_gnn.py Outdated Show resolved Hide resolved
torch_geometric/nn/models/basic_gnn.py Outdated Show resolved Hide resolved
@@ -169,3 +170,19 @@ def __repr__(self):
return (f'{self.__class__.__name__}({self.in_channels}, '
f'{self.out_channels}, towers={self.towers}, '
f'edge_dim={self.edge_dim})')

@staticmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we briefly test this in nn/conv/test_pna_conv?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

torch_geometric/nn/conv/pna_conv.py Outdated Show resolved Hide resolved
benchmark/inference/hetero_sage.py Outdated Show resolved Hide resolved
benchmark/inference/hetero_sage.py Outdated Show resolved Hide resolved
benchmark/inference/hetero_sage.py Show resolved Hide resolved
benchmark/inference/hetero_sage.py Outdated Show resolved Hide resolved
benchmark/inference/hetero_gat.py Outdated Show resolved Hide resolved
@mszarma mszarma force-pushed the inference_benchmark branch from fa7beaf to 1c6f95e Compare July 19, 2022 10:43
Copy link
Contributor

@mananshah99 mananshah99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the updates! Left some comments, please let us know if you have any questions.

benchmark/inference/inference_benchmark.py Show resolved Hide resolved
Comment on lines 28 to 34
hetero = True if dataset_name == 'ogbn-mag' else False
mask = ('paper', None) if hetero else None
degree = None

data = dataset[0].to(device)
inputs_channels = data.x_dict['paper'].size(
-1) if hetero else dataset.num_features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 29 and 33-34 seem specific to ogbn-mag (instead of hetero), and will likely break if we add more heterogeneous datasets in the future. Can we condition on if dataset_name == 'ogbn-mag' instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done

mask = ('paper', None) if hetero else None
degree = None

data = dataset[0].to(device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return dataset[0] from get_dataset, so this logic can be handled properly for datasets that may override __getitem__?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

torch_geometric/nn/models/basic_gnn.py Outdated Show resolved Hide resolved
Comment on lines 45 to 64
subgraph_loader = NeighborLoader(
copy.copy(data),
num_neighbors=[-1],
input_nodes=mask,
batch_size=batch_size,
shuffle=False,
num_workers=args.num_workers,
)
subgraph_loader.data.n_id = torch.arange(data.num_nodes)

for layers in args.num_layers:
if hetero:
subgraph_loader = NeighborLoader(
copy.copy(data),
num_neighbors=[args.hetero_num_neighbors] * layers,
input_nodes=mask,
batch_size=batch_size,
shuffle=False,
num_workers=args.num_workers,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the differences for NeighborLoader between homogeneous and heterogeneous graphs, as well as the interplay between num_neighbors and num_layers. For the purposes of this benchmark, it should be the case that len(num_neighbors) == len(num_layers) (for both homogeneous and heterogeneous graphs).

If we want to be able to specify the number of neighbors for each layer, we can consolidate this logic and just have num_neighbors = args.num_neighbors. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mananshah99 Maybe i will point out few things to make the situation more clear:

  • num_layers is parameter for benchmark, it is list of layer sizes to bench for the models -> layer(s) is passed to model to define the number of layers to create in model in basic_gnn and hetero_* . So we are benching in the loop models created with different number of layers separately.
  • Inference for basic_gnn is performed "layer-wise" and inference for hetero is performed "batch-wise". In case of layer-wise approach we always working only on the nearest neighborhood (neighborloader num_neighbors "len" is always 1) . In case of batch-wise we need to define neighborloader num_neighbor's as deep as the layers size.
  • The hetero_num_neighbors args was added only because right now inference for hetero in batch-wise mode is so long. Added such arg to have an option for user/CI to easily change the num_neighbors to do some smaller bench.
  • Creation of loader's takes time and we are performing in the loop many benchmarks for different setups so to save the time for homogeneous workloads where we don't depend on layer size we create loader before the layers size loop but for hetero models we need to know the layer size in the moment of creating NeighborLoader object so we create it in the num_layers loop.

Let me know if that explains the situation and approaches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For homogeneous graphs, we do inference via layer-wise computation. As such, I think the code is correct. We should nonetheless document this properly here to avoid confusion.

Comment on lines 108 to 109
default=['edge_conv', 'gat', 'gcn', 'pna_conv', 'rgat',
'rgcn'], type=str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we automatically obtain this from supported_sets.values()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of that default param is to have an option to change the performed bench suites during some checking etc. - for user's convenience who can do it in CLI or change directly in script - I personally would prefer to have it written explicitly in the list - pulling from supported_sets would not give such a flexibility.
WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would need to get the union of all values. +1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably okay to leave it as it is for now :)

Comment on lines 103 to 105
argparser.add_argument('--datasets', nargs='+',
default=['ogbn-mag', 'ogbn-products',
'reddit'], type=str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we automatically obtain this from supported_sets.keys()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look into the comment above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

argparser.add_argument(
'--hetero-num-neighbors', default=-1, type=int,
help='number of neighbors to sample per layer for hetero workloads')
argparser.add_argument('--num-workers', default=2, type=int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a reasonable default (instead of cpu_count // 2, for example?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shortly: As for now - for NeighborLoader should be good enough - we can always changed it when we will do some more analysis of workloads and while improving the software stack.
More: Dataloading is tightly coupled with model inference/train loop - the dataloader "producing" efficiency interact the model "consuming" efficiency. In general empirical experience shows that X num_workers can be optimal for cpu cores in range Y to Z - the accurate values may wary and depends things like HW (e.g. single vs multi sockets, memory) and OS/env settings.

'--models', nargs='+',
default=['edge_conv', 'gat', 'gcn', 'pna_conv', 'rgat',
'rgcn'], type=str)
argparser.add_argument('--root', default='../../data', type=str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

argparser.add_argument('--root', default='../../data', type=str)
argparser.add_argument('--eval-batch-sizes', nargs='+',
default=[512, 1024, 2048, 4096, 8192], type=int)
argparser.add_argument('--num-layers', nargs='+', default=[2, 3], type=int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as comments above, let's unify this (along with hetero num neighbors) for both homogeneous and heterogeneous graphs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look into the comment above.

Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks mostly great. Left a few minor comments.

benchmark/inference/hetero_gat.py Show resolved Hide resolved
print(f'Model {name} not supported!')

if name == 'rgat':
model = model_type(metadata, params['hidden_channels'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model = model_type(metadata, params['hidden_channels'],
return model_type(metadata, params['hidden_channels'],

It is recommended to return immediately. The following elif can be converted to if.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


def get_model(name, params, metadata=None):
try:
model_type = models_dict[name]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_type = models_dict[name]
model_type = models_dict.get(name, None)

to avoid the try/catch? Personal preference though, I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


def get_model(name, params, metadata=None):
try:
model_type = models_dict[name]
Copy link
Member

@rusty1s rusty1s Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_type = models_dict[name]
Model = models_dict[name]

Can we make this uppercase so that it is clear that it returns a class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - done.

kwargs['heads'] = params['num_heads']
model = model_type(params['inputs_channels'],
params['hidden_channels'], params['num_layers'],
params['output_channels'], **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
params['output_channels'], **kwargs)
params['output_channels'], heads=params['num_heads'],)

I think this is cleaner than first constructing a dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

shuffle=False,
num_workers=args.num_workers,
)
subgraph_loader.data.n_id = torch.arange(data.num_nodes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be dropped IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 45 to 64
subgraph_loader = NeighborLoader(
copy.copy(data),
num_neighbors=[-1],
input_nodes=mask,
batch_size=batch_size,
shuffle=False,
num_workers=args.num_workers,
)
subgraph_loader.data.n_id = torch.arange(data.num_nodes)

for layers in args.num_layers:
if hetero:
subgraph_loader = NeighborLoader(
copy.copy(data),
num_neighbors=[args.hetero_num_neighbors] * layers,
input_nodes=mask,
batch_size=batch_size,
shuffle=False,
num_workers=args.num_workers,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For homogeneous graphs, we do inference via layer-wise computation. As such, I think the code is correct. We should nonetheless document this properly here to avoid confusion.

shuffle=False,
num_workers=args.num_workers,
)
subgraph_loader.data.n_id = torch.arange(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 103 to 105
argparser.add_argument('--datasets', nargs='+',
default=['ogbn-mag', 'ogbn-products',
'reddit'], type=str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

Comment on lines 108 to 109
default=['edge_conv', 'gat', 'gcn', 'pna_conv', 'rgat',
'rgcn'], type=str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would need to get the union of all values. +1

@rusty1s
Copy link
Member

rusty1s commented Jul 21, 2022

Please also add your PR to the CHANGELOG.md :)

@mszarma mszarma merged commit 11f16e3 into pyg-team:master Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants