Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Summary for PyG datasets #5438

Merged
merged 12 commits into from
Sep 16, 2022
Merged

Conversation

hatemhelal
Copy link
Contributor

This PR adds the ability to collect summary statistics for PyG datasets. Initially implemented as a standalone class and posting here for feedback. Various questions I'm thinking about:

  • Would this make more sense as a method on the torch_geometric.data.Dataset interface?
  • Is it ok to use the pandas and tabulate packages? I see them in the full requirements and could change the implementation to lazy import / error if those packages aren't available.
  • Need to make the summary more useful for single-graph datasets.

Here is a quick demo on QM9:

from torch_geometric.datasets import QM9
from torch_geometric.data import Summary

dataset = QM9("data/QM9")
s = Summary(dataset)
print(s)

outputs:

100%|██████████| 130831/130831 [00:16<00:00, 7841.34it/s]
Summary QM9(130831)
         nodes     edges
----  --------  --------
mean  18.0325   37.3269
std    2.94371   6.29847
min    3         4
25%   16        34
50%   18        38
75%   20        42
max   29        56

Can also query properties on the Summary object:

s.max_num_nodes, s.max_num_edges

outputs

(29, 56)

@codecov
Copy link

codecov bot commented Sep 14, 2022

Codecov Report

Merging #5438 (ef98779) into master (8ac0a48) will increase coverage by 0.11%.
The diff coverage is 96.92%.

❗ Current head ef98779 differs from pull request most recent head 27c398a. Consider uploading reports for the commit 27c398a to get more accurate results

@@            Coverage Diff             @@
##           master    #5438      +/-   ##
==========================================
+ Coverage   83.33%   83.44%   +0.11%     
==========================================
  Files         348      348              
  Lines       18939    18914      -25     
==========================================
+ Hits        15782    15783       +1     
+ Misses       3157     3131      -26     
Impacted Files Coverage Δ
torch_geometric/data/summary.py 96.72% <96.72%> (ø)
torch_geometric/data/__init__.py 100.00% <100.00%> (ø)
torch_geometric/data/dataset.py 92.00% <100.00%> (+0.16%) ⬆️
torch_geometric/utils/mask.py 90.90% <0.00%> (-9.10%) ⬇️
torch_geometric/data/storage.py 81.76% <0.00%> (-0.32%) ⬇️
torch_geometric/transforms/__init__.py 100.00% <0.00%> (ø)
torch_geometric/nn/to_hetero_transformer.py 95.37% <0.00%> (ø)
...h_geometric/nn/to_hetero_with_bases_transformer.py 95.28% <0.00%> (ø)
torch_geometric/transforms/mask.py
torch_geometric/data/lightning_datamodule.py 49.12% <0.00%> (+6.33%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@EdisonLeeeee
Copy link
Contributor

EdisonLeeeee commented Sep 14, 2022

Personally, I like this feature! How about adding it as a basic method summary for the dataset class?

dataset = QM9("data/QM9")
s = dataset.summary()
print(s)

Not a strong preference, but I think it would be convenient for users.

@hatemhelal
Copy link
Contributor Author

Thanks @EdisonLeeeee I updated the patch with your suggestion and updated some of the tests. I found that the type annotation in Summary resulted in a circular module dependencies so I removed the type annotation for now.

I think I'd also like to add a minimum dataset size for showing the progress bar since it looks a bit odd in the unittests.

Happy to hear any other thoughts you have 🙏

@hatemhelal hatemhelal marked this pull request as ready for review September 15, 2022 10:32
@EdisonLeeeee
Copy link
Contributor

EdisonLeeeee commented Sep 15, 2022

Hi @hatemhelal

You've done an amazing job! Left some minor suggestions:

  • Would it support more statistics beyond nodes and edges in the summarized results? For example, density/sparsity, which can be simply calculated by
df['density'] = df['edges'] / df['nodes']**2
  • How about merging these properties *_num_nodes (e.g., min_num_nodes, max_num_nodes, etc) as one single method? In this way, you can get rid of implementing each of them whenever new statistics are added.

  • Do you think it is necessary to include some global statistics such as num_classes and num_features in the summarized results?

  • To improve the efficiency, we can add lru_cache for the summary method such that one can reuse the result without computing it again:

    @lru_cache(maxsize=1)
    def summary(self) -> Summary:

Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some modifications to the Summary to make use of data classes. Hope the changes are okay :)

@rusty1s
Copy link
Member

rusty1s commented Sep 16, 2022

@EdisonLeeeee These are all great suggestions to include into Summary. Happy to include them in a follow-up PR.

@rusty1s rusty1s enabled auto-merge (squash) September 16, 2022 14:48
@rusty1s rusty1s merged commit c2def91 into pyg-team:master Sep 16, 2022
@hatemhelal
Copy link
Contributor Author

All good @rusty1s, thanks @EdisonLeeeee for the kind words too!

@hatemhelal hatemhelal deleted the dataset-summary branch September 21, 2022 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants