AQSOL dataset #4626

vijaydwivedi75 · 2022-05-12T03:51:05Z

This PR adds the AQSOL dataset from updated Benchmarking GNNs, based on AqSolDB which is a standardized database of 9,982 molecular graphs with their aqueous solubility values, collected from 9 different data sources.

The aqueous solubility targets are collected from experimental measurements and standardized to LogS units in AqSolDB. These final values are the property to regress in the AQSOL dataset, similar to ZINC. This version of the dataset filters out few graphs with no bonds/edges and a small number of graphs with missing node feature values.

The resultant total molecular graphs are 9,823. For each molecular graph, the node features are the types of heavy atoms and the edge features are the types of bonds between them, similar as ZINC.

Dataset overview:

Task: Graph Regression
Size of Dataset: 9,982 molecules.
Split: Scaffold split (8:1:1) following same code as OGB.
After cleaning: 7,831 train / 996 val / 996 test
Number of (unique) atoms: 65 (Dict below)
Number of (unique) bonds: 5 (Dict below)
Performance Metric: MAE, same as ZINC

Atom Dict: {'Br': 0, 'C': 1, 'N': 2, 'O': 3, 'Cl': 4, 'Zn': 5, 'F': 6, 'P': 7, 'S': 8, 'Na': 9,
'Al': 10, 'Si': 11, 'Mo': 12, 'Ca': 13, 'W': 14, 'Pb': 15, 'B': 16, 'V': 17, 'Co': 18,
'Mg': 19, 'Bi': 20, 'Fe': 21, 'Ba': 22, 'K': 23, 'Ti': 24, 'Sn': 25, 'Cd': 26, 'I': 27,
'Re': 28, 'Sr': 29, 'H': 30, 'Cu': 31, 'Ni': 32, 'Lu': 33, 'Pr': 34, 'Te': 35, 'Ce': 36,
'Nd': 37, 'Gd': 38, 'Zr': 39, 'Mn': 40, 'As': 41, 'Hg': 42, 'Sb': 43, 'Cr': 44, 'Se': 45,
'La': 46, 'Dy': 47, 'Y': 48, 'Pd': 49, 'Ag': 50, 'In': 51, 'Li': 52, 'Rh': 53, 'Nb': 54,
'Hf': 55, 'Cs': 56, 'Ru': 57, 'Au': 58, 'Sm': 59, 'Ta': 60, 'Pt': 61, 'Ir': 62, 'Be': 63, 'Ge': 64}
    
Bond Dict: {'NONE': 0, 'SINGLE': 1, 'DOUBLE': 2, 'AROMATIC': 3, 'TRIPLE': 4}

for more information, see https://pre-commit.ci

codecov · 2022-05-12T03:53:57Z

Codecov Report

Merging #4626 (3bd43ab) into master (c20f8df) will decrease coverage by 0.04%.
The diff coverage is n/a.

❗ Current head 3bd43ab differs from pull request most recent head 48562fa. Consider uploading reports for the commit 48562fa to get more accurate results

@@            Coverage Diff             @@
##           master    #4626      +/-   ##
==========================================
- Coverage   82.93%   82.88%   -0.05%     
==========================================
  Files         316      316              
  Lines       16750    16677      -73     
==========================================
- Hits        13891    13823      -68     
+ Misses       2859     2854       -5

Impacted Files	Coverage Δ
torch_geometric/graphgym/logger.py	`79.59% <0.00%> (-2.28%)`	⬇️
torch_geometric/graphgym/model_builder.py	`97.05% <0.00%> (-1.13%)`	⬇️
torch_geometric/loader/utils.py	`77.77% <0.00%> (-0.49%)`	⬇️
torch_geometric/data/hetero_data.py	`93.84% <0.00%> (-0.09%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c20f8df...48562fa. Read the comment docs.

for more information, see https://pre-commit.ci

Padarn · 2022-05-14T00:02:13Z

torch_geometric/datasets/aqsol.py

+        os.unlink(path)
+
+    def process(self):
+        for split in ['train', 'val', 'test']:


I guess you don't really need to process the splits you're not going to load - but also maybe its worth just doing it ahead of time.

Padarn · 2022-05-14T00:03:07Z

torch_geometric/datasets/aqsol.py

+    def process(self):
+        for split in ['train', 'val', 'test']:
+            with open(osp.join(self.raw_dir, f'{split}.pickle'), 'rb') as f:
+                graphs = pickle.load(f)


I feel loading pickles is a bit dangerous in general (doesn't work well across python versions) - do you know if the data is available in a non-pickle format?

Padarn · 2022-05-14T00:05:20Z

Other than a slight preference not to load pickled data (which maybe we cannot avoid here), this looks good to me!

vijaydwivedi75 · 2022-05-14T20:39:46Z

Thanks @Padarn for your comments. The PR is similar to how ZINC dataset is maintained, which also loads from pickle files.

The pickle file in AQSOL (this PR) is a list of graphs objects, each of which is a tuple of the graph info (the tuple information is mentioned in comments as well as below).

# Each `graph` is a tuple (x, edge_attr, edge_index, y)
#     Shape of x : [num_nodes, 1]
#     Shape of edge_attr : [num_edges]
#     Shape of edge_index : [2, num_edges]
#     Shape of y : [1]

…nto aqsol

for more information, see https://pre-commit.ci

rusty1s · 2022-05-15T05:06:07Z

Thank you!

vijaydwivedi75 and others added 3 commits May 12, 2022 11:40

AQSOL dataset

42bb44d

minor edit

bccb130

[pre-commit.ci] auto fixes from pre-commit.com hooks

d5e2a64

for more information, see https://pre-commit.ci

vijaydwivedi75 and others added 6 commits May 12, 2022 12:02

changelog

277298e

formatting

0c5a4ff

[pre-commit.ci] auto fixes from pre-commit.com hooks

1822d34

for more information, see https://pre-commit.ci

minor doc formatting

3cc93d6

[pre-commit.ci] auto fixes from pre-commit.com hooks

cd9e804

for more information, see https://pre-commit.ci

doc formatting

3bd43ab

Padarn reviewed May 14, 2022

View reviewed changes

update

fb01c7b

rusty1s approved these changes May 15, 2022

View reviewed changes

rusty1s and others added 4 commits May 14, 2022 22:04

Merge branch 'master' into aqsol

69a2c2e

update

b0d9d9e

Merge branch 'aqsol' of github.com:vijaydwivedi75/pytorch_geometric i…

edd3fd7

…nto aqsol

[pre-commit.ci] auto fixes from pre-commit.com hooks

48562fa

for more information, see https://pre-commit.ci

rusty1s assigned vijaydwivedi75 May 15, 2022

rusty1s added feature 0 - Priority P0 dataset labels May 15, 2022

rusty1s merged commit 90fa81d into pyg-team:master May 15, 2022

vijaydwivedi75 deleted the aqsol branch May 15, 2022 07:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AQSOL dataset #4626

AQSOL dataset #4626

vijaydwivedi75 commented May 12, 2022

codecov bot commented May 12, 2022 •

edited

Loading

Padarn May 14, 2022

Padarn May 14, 2022

Padarn commented May 14, 2022

vijaydwivedi75 commented May 14, 2022

rusty1s commented May 15, 2022

AQSOL dataset #4626

AQSOL dataset #4626

Conversation

vijaydwivedi75 commented May 12, 2022

codecov bot commented May 12, 2022 • edited Loading

Codecov Report

Padarn May 14, 2022

Choose a reason for hiding this comment

Padarn May 14, 2022

Choose a reason for hiding this comment

Padarn commented May 14, 2022

vijaydwivedi75 commented May 14, 2022

rusty1s commented May 15, 2022

codecov bot commented May 12, 2022 •

edited

Loading