-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Summary
for PyG datasets
#5438
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5438 +/- ##
==========================================
+ Coverage 83.33% 83.44% +0.11%
==========================================
Files 348 348
Lines 18939 18914 -25
==========================================
+ Hits 15782 15783 +1
+ Misses 3157 3131 -26
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
9e186ac
to
9b1a555
Compare
Personally, I like this feature! How about adding it as a basic method dataset = QM9("data/QM9")
s = dataset.summary()
print(s) Not a strong preference, but I think it would be convenient for users. |
9b1a555
to
a0a6086
Compare
Thanks @EdisonLeeeee I updated the patch with your suggestion and updated some of the tests. I found that the type annotation in I think I'd also like to add a minimum dataset size for showing the progress bar since it looks a bit odd in the unittests. Happy to hear any other thoughts you have 🙏 |
583cad5
to
ef98779
Compare
Hi @hatemhelal You've done an amazing job! Left some minor suggestions:
df['density'] = df['edges'] / df['nodes']**2
@lru_cache(maxsize=1)
def summary(self) -> Summary: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some modifications to the Summary
to make use of data classes. Hope the changes are okay :)
@EdisonLeeeee These are all great suggestions to include into |
All good @rusty1s, thanks @EdisonLeeeee for the kind words too! |
This PR adds the ability to collect summary statistics for PyG datasets. Initially implemented as a standalone class and posting here for feedback. Various questions I'm thinking about:
torch_geometric.data.Dataset
interface?pandas
andtabulate
packages? I see them in the full requirements and could change the implementation to lazy import / error if those packages aren't available.Here is a quick demo on QM9:
outputs:
Can also query properties on the
Summary
object:outputs