Add `Summary` for PyG datasets #5438

hatemhelal · 2022-09-14T07:20:12Z

This PR adds the ability to collect summary statistics for PyG datasets. Initially implemented as a standalone class and posting here for feedback. Various questions I'm thinking about:

Would this make more sense as a method on the torch_geometric.data.Dataset interface?
Is it ok to use the pandas and tabulate packages? I see them in the full requirements and could change the implementation to lazy import / error if those packages aren't available.
Need to make the summary more useful for single-graph datasets.

Here is a quick demo on QM9:

from torch_geometric.datasets import QM9
from torch_geometric.data import Summary

dataset = QM9("data/QM9")
s = Summary(dataset)
print(s)

outputs:

100%|██████████| 130831/130831 [00:16<00:00, 7841.34it/s]
Summary QM9(130831)
         nodes     edges
----  --------  --------
mean  18.0325   37.3269
std    2.94371   6.29847
min    3         4
25%   16        34
50%   18        38
75%   20        42
max   29        56

Can also query properties on the Summary object:

s.max_num_nodes, s.max_num_edges

outputs

(29, 56)

codecov · 2022-09-14T07:25:13Z

Codecov Report

Merging #5438 (ef98779) into master (8ac0a48) will increase coverage by 0.11%.
The diff coverage is 96.92%.

❗ Current head ef98779 differs from pull request most recent head 27c398a. Consider uploading reports for the commit 27c398a to get more accurate results

@@            Coverage Diff             @@
##           master    #5438      +/-   ##
==========================================
+ Coverage   83.33%   83.44%   +0.11%     
==========================================
  Files         348      348              
  Lines       18939    18914      -25     
==========================================
+ Hits        15782    15783       +1     
+ Misses       3157     3131      -26

Impacted Files	Coverage Δ
torch_geometric/data/summary.py	`96.72% <96.72%> (ø)`
torch_geometric/data/__init__.py	`100.00% <100.00%> (ø)`
torch_geometric/data/dataset.py	`92.00% <100.00%> (+0.16%)`	⬆️
torch_geometric/utils/mask.py	`90.90% <0.00%> (-9.10%)`	⬇️
torch_geometric/data/storage.py	`81.76% <0.00%> (-0.32%)`	⬇️
torch_geometric/transforms/__init__.py	`100.00% <0.00%> (ø)`
torch_geometric/nn/to_hetero_transformer.py	`95.37% <0.00%> (ø)`
...h_geometric/nn/to_hetero_with_bases_transformer.py	`95.28% <0.00%> (ø)`
torch_geometric/transforms/mask.py
torch_geometric/data/lightning_datamodule.py	`49.12% <0.00%> (+6.33%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

EdisonLeeeee · 2022-09-14T15:27:22Z

Personally, I like this feature! How about adding it as a basic method summary for the dataset class?

dataset = QM9("data/QM9")
s = dataset.summary()
print(s)

Not a strong preference, but I think it would be convenient for users.

hatemhelal · 2022-09-14T20:47:48Z

Thanks @EdisonLeeeee I updated the patch with your suggestion and updated some of the tests. I found that the type annotation in Summary resulted in a circular module dependencies so I removed the type annotation for now.

I think I'd also like to add a minimum dataset size for showing the progress bar since it looks a bit odd in the unittests.

Happy to hear any other thoughts you have 🙏

EdisonLeeeee · 2022-09-15T14:00:46Z

Hi @hatemhelal

You've done an amazing job! Left some minor suggestions:

Would it support more statistics beyond nodes and edges in the summarized results? For example, density/sparsity, which can be simply calculated by

df['density'] = df['edges'] / df['nodes']**2

How about merging these properties *_num_nodes (e.g., min_num_nodes, max_num_nodes, etc) as one single method? In this way, you can get rid of implementing each of them whenever new statistics are added.
Do you think it is necessary to include some global statistics such as num_classes and num_features in the summarized results?
To improve the efficiency, we can add lru_cache for the summary method such that one can reuse the result without computing it again:

    @lru_cache(maxsize=1)
    def summary(self) -> Summary:

rusty1s

Made some modifications to the Summary to make use of data classes. Hope the changes are okay :)

rusty1s · 2022-09-16T14:48:07Z

@EdisonLeeeee These are all great suggestions to include into Summary. Happy to include them in a follow-up PR.

hatemhelal · 2022-09-16T14:51:19Z

All good @rusty1s, thanks @EdisonLeeeee for the kind words too!

hatemhelal force-pushed the dataset-summary branch from 9e186ac to 9b1a555 Compare September 14, 2022 09:39

hatemhelal force-pushed the dataset-summary branch from 9b1a555 to a0a6086 Compare September 14, 2022 20:45

hatemhelal marked this pull request as ready for review September 15, 2022 10:32

hatemhelal added 10 commits September 15, 2022 14:36

Add Summary for PyG datasets

fb217c3

make desc private

e69dc36

add summary method

e518c5e

adding tests + removing type annotation to avoid circular dep

256be26

lazy import of pandas and tabulate

5555f71

add switch for progress bar

d8ccd4b

moved summary tests and only run them with the required packages

bf651d9

update changelog

0afda00

address coverage gaps

a054f15

rtest for repr

ef98779

hatemhelal force-pushed the dataset-summary branch from 583cad5 to ef98779 Compare September 15, 2022 13:36

rusty1s assigned hatemhelal Sep 15, 2022

rusty1s added feature 1 - Priority P1 dataset labels Sep 15, 2022

update

acab272

rusty1s approved these changes Sep 16, 2022

View reviewed changes

Merge branch 'master' into dataset-summary

27c398a

rusty1s enabled auto-merge (squash) September 16, 2022 14:48

rusty1s merged commit c2def91 into pyg-team:master Sep 16, 2022

hatemhelal deleted the dataset-summary branch September 21, 2022 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `Summary` for PyG datasets #5438

Add `Summary` for PyG datasets #5438

hatemhelal commented Sep 14, 2022

codecov bot commented Sep 14, 2022 •

edited

Loading

EdisonLeeeee commented Sep 14, 2022 •

edited

Loading

hatemhelal commented Sep 14, 2022

EdisonLeeeee commented Sep 15, 2022 •

edited

Loading

rusty1s left a comment

rusty1s commented Sep 16, 2022

hatemhelal commented Sep 16, 2022

Add Summary for PyG datasets #5438

Add Summary for PyG datasets #5438

Conversation

hatemhelal commented Sep 14, 2022

codecov bot commented Sep 14, 2022 • edited Loading

Codecov Report

EdisonLeeeee commented Sep 14, 2022 • edited Loading

hatemhelal commented Sep 14, 2022

EdisonLeeeee commented Sep 15, 2022 • edited Loading

rusty1s left a comment

Choose a reason for hiding this comment

rusty1s commented Sep 16, 2022

hatemhelal commented Sep 16, 2022

Add `Summary` for PyG datasets #5438

Add `Summary` for PyG datasets #5438

codecov bot commented Sep 14, 2022 •

edited

Loading

EdisonLeeeee commented Sep 14, 2022 •

edited

Loading

EdisonLeeeee commented Sep 15, 2022 •

edited

Loading