-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AQSOL dataset #4626
AQSOL dataset #4626
Conversation
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #4626 +/- ##
==========================================
- Coverage 82.93% 82.88% -0.05%
==========================================
Files 316 316
Lines 16750 16677 -73
==========================================
- Hits 13891 13823 -68
+ Misses 2859 2854 -5
Continue to review full report at Codecov.
|
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
torch_geometric/datasets/aqsol.py
Outdated
os.unlink(path) | ||
|
||
def process(self): | ||
for split in ['train', 'val', 'test']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you don't really need to process the splits you're not going to load - but also maybe its worth just doing it ahead of time.
def process(self): | ||
for split in ['train', 'val', 'test']: | ||
with open(osp.join(self.raw_dir, f'{split}.pickle'), 'rb') as f: | ||
graphs = pickle.load(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel loading pickles is a bit dangerous in general (doesn't work well across python versions) - do you know if the data is available in a non-pickle format?
Other than a slight preference not to load pickled data (which maybe we cannot avoid here), this looks good to me! |
Thanks @Padarn for your comments. The PR is similar to how ZINC dataset is maintained, which also loads from pickle files. The pickle file in AQSOL (this PR) is a list of graphs objects, each of which is a tuple of the graph info (the tuple information is mentioned in comments as well as below).
|
Thank you! |
This PR adds the AQSOL dataset from updated Benchmarking GNNs, based on AqSolDB which is a standardized database of 9,982 molecular graphs with their aqueous solubility values, collected from 9 different data sources.
The aqueous solubility targets are collected from experimental measurements and standardized to LogS units in AqSolDB. These final values are the property to regress in the AQSOL dataset, similar to ZINC. This version of the dataset filters out few graphs with no bonds/edges and a small number of graphs with missing node feature values.
The resultant total molecular graphs are 9,823. For each molecular graph, the node features are the types of heavy atoms and the edge features are the types of bonds between them, similar as ZINC.
Dataset overview: