-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AQSOL dataset #4626
Merged
Merged
AQSOL dataset #4626
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
42bb44d
AQSOL dataset
vijaydwivedi75 bccb130
minor edit
vijaydwivedi75 d5e2a64
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 277298e
changelog
vijaydwivedi75 0c5a4ff
formatting
vijaydwivedi75 1822d34
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3cc93d6
minor doc formatting
vijaydwivedi75 cd9e804
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3bd43ab
doc formatting
vijaydwivedi75 fb01c7b
update
rusty1s 69a2c2e
Merge branch 'master' into aqsol
rusty1s b0d9d9e
update
rusty1s edd3fd7
Merge branch 'aqsol' of github.com:vijaydwivedi75/pytorch_geometric i…
rusty1s 48562fa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
import os | ||
import os.path as osp | ||
import pickle | ||
import shutil | ||
from typing import Callable, List, Optional | ||
|
||
import torch | ||
|
||
from torch_geometric.data import ( | ||
Data, | ||
InMemoryDataset, | ||
download_url, | ||
extract_zip, | ||
) | ||
|
||
|
||
class AQSOL(InMemoryDataset): | ||
r"""The AQSOL dataset from the `Benchmarking Graph Neural Networks | ||
<http://arxiv.org/abs/2003.00982>`_ paper based on | ||
`AqSolDB <https://www.nature.com/articles/s41597-019-0151-1>`_, a | ||
standardized database of 9,982 molecular graphs with their aqueous | ||
solubility values, collected from 9 different data sources. | ||
|
||
The aqueous solubility targets are collected from experimental measurements | ||
and standardized to LogS units in AqSolDB. These final values denote the | ||
property to regress in the :class:`AQSOL` dataset. After filtering out few | ||
graphs with no bonds/edges, the total number of molecular graphs is 9,833. | ||
For each molecular graph, the node features are the types of heavy atoms | ||
and the edge features are the types of bonds between them, similar as in | ||
the :class:`~torch_geometric.datasets.ZINC` dataset. | ||
|
||
Args: | ||
root (string): Root directory where the dataset should be saved. | ||
split (string, optional): If :obj:`"train"`, loads the training | ||
dataset. | ||
If :obj:`"val"`, loads the validation dataset. | ||
If :obj:`"test"`, loads the test dataset. | ||
(default: :obj:`"train"`) | ||
transform (callable, optional): A function/transform that takes in an | ||
:obj:`torch_geometric.data.Data` object and returns a transformed | ||
version. The data object will be transformed before every access. | ||
(default: :obj:`None`) | ||
pre_transform (callable, optional): A function/transform that takes in | ||
an :obj:`torch_geometric.data.Data` object and returns a | ||
transformed version. The data object will be transformed before | ||
being saved to disk. (default: :obj:`None`) | ||
pre_filter (callable, optional): A function that takes in an | ||
:obj:`torch_geometric.data.Data` object and returns a boolean | ||
value, indicating whether the data object should be included in | ||
the final dataset. (default: :obj:`None`) | ||
""" | ||
url = 'https://www.dropbox.com/s/lzu9lmukwov12kt/aqsol_graph_raw.zip?dl=1' | ||
|
||
def __init__(self, root: str, split: str = 'train', | ||
transform: Optional[Callable] = None, | ||
pre_transform: Optional[Callable] = None, | ||
pre_filter: Optional[Callable] = None): | ||
assert split in ['train', 'val', 'test'] | ||
super().__init__(root, transform, pre_transform, pre_filter) | ||
path = osp.join(self.processed_dir, f'{split}.pt') | ||
self.data, self.slices = torch.load(path) | ||
|
||
@property | ||
def raw_file_names(self) -> List[str]: | ||
return [ | ||
'train.pickle', 'val.pickle', 'test.pickle', 'atom_dict.pickle', | ||
'bond_dict.pickle' | ||
] | ||
|
||
@property | ||
def processed_file_names(self) -> List[str]: | ||
return ['train.pt', 'val.pt', 'test.pt'] | ||
|
||
def download(self): | ||
shutil.rmtree(self.raw_dir) | ||
path = download_url(self.url, self.root) | ||
extract_zip(path, self.root) | ||
os.rename(osp.join(self.root, 'asqol_graph_raw'), self.raw_dir) | ||
os.unlink(path) | ||
|
||
def process(self): | ||
for raw_path, path in zip(self.raw_paths, self.processed_paths): | ||
with open(raw_path, 'rb') as f: | ||
graphs = pickle.load(f) | ||
|
||
data_list: List[Data] = [] | ||
for graph in graphs: | ||
x, edge_attr, edge_index, y = graph | ||
|
||
x = torch.from_numpy(x) | ||
edge_attr = torch.from_numpy(edge_attr) | ||
edge_index = torch.from_numpy(edge_index) | ||
y = torch.tensor([y]).float() | ||
|
||
if edge_index.numel() == 0: | ||
continue # Skipping for graphs with no bonds/edges. | ||
|
||
data = Data(x=x, edge_index=edge_index, edge_attr=edge_attr, | ||
y=y) | ||
|
||
if self.pre_filter is not None and not self.pre_filter(data): | ||
continue | ||
|
||
if self.pre_transform is not None: | ||
data = self.pre_transform(data) | ||
|
||
data_list.append(data) | ||
|
||
torch.save(self.collate(data_list), path) | ||
|
||
def atoms(self) -> List[str]: | ||
return [ | ||
'Br', 'C', 'N', 'O', 'Cl', 'Zn', 'F', 'P', 'S', 'Na', 'Al', 'Si', | ||
'Mo', 'Ca', 'W', 'Pb', 'B', 'V', 'Co', 'Mg', 'Bi', 'Fe', 'Ba', 'K', | ||
'Ti', 'Sn', 'Cd', 'I', 'Re', 'Sr', 'H', 'Cu', 'Ni', 'Lu', 'Pr', | ||
'Te', 'Ce', 'Nd', 'Gd', 'Zr', 'Mn', 'As', 'Hg', 'Sb', 'Cr', 'Se', | ||
'La', 'Dy', 'Y', 'Pd', 'Ag', 'In', 'Li', 'Rh', 'Nb', 'Hf', 'Cs', | ||
'Ru', 'Au', 'Sm', 'Ta', 'Pt', 'Ir', 'Be', 'Ge' | ||
] | ||
|
||
def bonds(self) -> List[str]: | ||
return ['NONE', 'SINGLE', 'DOUBLE', 'AROMATIC', 'TRIPLE'] |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel loading pickles is a bit dangerous in general (doesn't work well across python versions) - do you know if the data is available in a non-pickle format?