Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AQSOL dataset #4626

Merged
merged 14 commits into from
May 15, 2022
Merged

AQSOL dataset #4626

merged 14 commits into from
May 15, 2022

Conversation

vijaydwivedi75
Copy link
Contributor

This PR adds the AQSOL dataset from updated Benchmarking GNNs, based on AqSolDB which is a standardized database of 9,982 molecular graphs with their aqueous solubility values, collected from 9 different data sources.

The aqueous solubility targets are collected from experimental measurements and standardized to LogS units in AqSolDB. These final values are the property to regress in the AQSOL dataset, similar to ZINC. This version of the dataset filters out few graphs with no bonds/edges and a small number of graphs with missing node feature values.

The resultant total molecular graphs are 9,823. For each molecular graph, the node features are the types of heavy atoms and the edge features are the types of bonds between them, similar as ZINC.

Dataset overview:

  • Task: Graph Regression
  • Size of Dataset: 9,982 molecules.
  • Split: Scaffold split (8:1:1) following same code as OGB.
  • After cleaning: 7,831 train / 996 val / 996 test
  • Number of (unique) atoms: 65 (Dict below)
  • Number of (unique) bonds: 5 (Dict below)
  • Performance Metric: MAE, same as ZINC
Atom Dict: {'Br': 0, 'C': 1, 'N': 2, 'O': 3, 'Cl': 4, 'Zn': 5, 'F': 6, 'P': 7, 'S': 8, 'Na': 9,
'Al': 10, 'Si': 11, 'Mo': 12, 'Ca': 13, 'W': 14, 'Pb': 15, 'B': 16, 'V': 17, 'Co': 18,
'Mg': 19, 'Bi': 20, 'Fe': 21, 'Ba': 22, 'K': 23, 'Ti': 24, 'Sn': 25, 'Cd': 26, 'I': 27,
'Re': 28, 'Sr': 29, 'H': 30, 'Cu': 31, 'Ni': 32, 'Lu': 33, 'Pr': 34, 'Te': 35, 'Ce': 36,
'Nd': 37, 'Gd': 38, 'Zr': 39, 'Mn': 40, 'As': 41, 'Hg': 42, 'Sb': 43, 'Cr': 44, 'Se': 45,
'La': 46, 'Dy': 47, 'Y': 48, 'Pd': 49, 'Ag': 50, 'In': 51, 'Li': 52, 'Rh': 53, 'Nb': 54,
'Hf': 55, 'Cs': 56, 'Ru': 57, 'Au': 58, 'Sm': 59, 'Ta': 60, 'Pt': 61, 'Ir': 62, 'Be': 63, 'Ge': 64}
    
Bond Dict: {'NONE': 0, 'SINGLE': 1, 'DOUBLE': 2, 'AROMATIC': 3, 'TRIPLE': 4}

@codecov
Copy link

codecov bot commented May 12, 2022

Codecov Report

Merging #4626 (3bd43ab) into master (c20f8df) will decrease coverage by 0.04%.
The diff coverage is n/a.

❗ Current head 3bd43ab differs from pull request most recent head 48562fa. Consider uploading reports for the commit 48562fa to get more accurate results

@@            Coverage Diff             @@
##           master    #4626      +/-   ##
==========================================
- Coverage   82.93%   82.88%   -0.05%     
==========================================
  Files         316      316              
  Lines       16750    16677      -73     
==========================================
- Hits        13891    13823      -68     
+ Misses       2859     2854       -5     
Impacted Files Coverage Δ
torch_geometric/graphgym/logger.py 79.59% <0.00%> (-2.28%) ⬇️
torch_geometric/graphgym/model_builder.py 97.05% <0.00%> (-1.13%) ⬇️
torch_geometric/loader/utils.py 77.77% <0.00%> (-0.49%) ⬇️
torch_geometric/data/hetero_data.py 93.84% <0.00%> (-0.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c20f8df...48562fa. Read the comment docs.

os.unlink(path)

def process(self):
for split in ['train', 'val', 'test']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you don't really need to process the splits you're not going to load - but also maybe its worth just doing it ahead of time.

def process(self):
for split in ['train', 'val', 'test']:
with open(osp.join(self.raw_dir, f'{split}.pickle'), 'rb') as f:
graphs = pickle.load(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel loading pickles is a bit dangerous in general (doesn't work well across python versions) - do you know if the data is available in a non-pickle format?

@Padarn
Copy link
Contributor

Padarn commented May 14, 2022

Other than a slight preference not to load pickled data (which maybe we cannot avoid here), this looks good to me!

@vijaydwivedi75
Copy link
Contributor Author

Thanks @Padarn for your comments. The PR is similar to how ZINC dataset is maintained, which also loads from pickle files.

The pickle file in AQSOL (this PR) is a list of graphs objects, each of which is a tuple of the graph info (the tuple information is mentioned in comments as well as below).

# Each `graph` is a tuple (x, edge_attr, edge_index, y)
#     Shape of x : [num_nodes, 1]
#     Shape of edge_attr : [num_edges]
#     Shape of edge_index : [2, num_edges]
#     Shape of y : [1]

@rusty1s
Copy link
Member

rusty1s commented May 15, 2022

Thank you!

@rusty1s rusty1s merged commit 90fa81d into pyg-team:master May 15, 2022
@vijaydwivedi75 vijaydwivedi75 deleted the aqsol branch May 15, 2022 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants