Skip to content

Commit

Permalink
Strengthen CI and add support for rdkit>=2022.09.1 (#58)
Browse files Browse the repository at this point in the history
As discussed in #57, `rdkit` introduced a breaking change in `2022.09.1`
that changes the return type of `GetSSSR` from an integer (number of
rings) to the rings themselves, which broke the featurization in
`chem/topology_features.py`. This PR makes several changes around this:
- Patching `GetSSSR` to behave differently depending on `rdkit` version,
so that the downstream code works across all versions (fixes #57).
- Adding `test_cli.py` which downloads the pretrained MoLeR checkpoint
and uses it in CLI to verify that the behaviour (i.e. encodings of
particular molecules or samples produced through random sampling) did
not change (fixes #12).
- Extending CI to test under a range of dependency versions, starting
with the original versions used when developing MoLeR (`python==3.7.7`,
`tensorflow==2.1.0` and `rdkit==2020.09.1.0`) and finishing with modern
ones (`python=3.10` and latest versions of `tensorflow` and `rdkit`).

With extended CI and the new CLI test, we can now be reasonably sure
MoLeR works reliably across different dependency versions, and that the
checkpoint trained under old versions continues to work in exactly the
same way under modern ones.
  • Loading branch information
kmaziarz committed Jun 15, 2023
1 parent 92c9233 commit bd97b51
Show file tree
Hide file tree
Showing 13 changed files with 163 additions and 45 deletions.
12 changes: 8 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,17 @@ jobs:
matrix:
include:
- environment-file: environment-py37.yml
build-name: "Python 3.7.7, TF 2.1.0"
build-name: "python 3.7.7, tf 2.1.0, rdkit 2020.09.1"
- environment-file: environment-py38.yml
build-name: "python 3.8.16, tf 2.6.2, rdkit 2021.09.1"
- environment-file: environment-py39.yml
build-name: "python 3.9.16, tf 2.9.1, rdkit 2022.09.1"
- environment-file: environment.yml
build-name: "Python 3.9.13, TF 2.9.1"
build-name: "python 3.10, tf latest, rdkit latest"
defaults:
run:
shell: bash -l {0}
name: build (${{ matrix.build-name }})
name: ${{ matrix.build-name }}
steps:
- uses: actions/checkout@v3
- uses: conda-incubator/setup-miniconda@v2
Expand All @@ -43,6 +47,6 @@ jobs:
- name: Run unit tests
run: |
pytest --ignore=./molecule_generation/test/integration/
- name: Run integration test
- name: Run integration tests
run: |
pytest ./molecule_generation/test/integration/
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Fixed
- Removed deprecated `numpy` types to make `molecule_generation` work with `numpy>=1.24.0` ([#49](https://github.com/microsoft/molecule-generation/pull/49))
- Patched `GetSSSR` for compatibility with `rdkit>=2022.09.1` ([#58](https://github.com/microsoft/molecule-generation/pull/58))

## [0.3.0] - 2022-10-18

Expand Down
29 changes: 22 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,35 +3,50 @@
[![CI](https://github.com/microsoft/molecule-generation/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/microsoft/molecule-generation/actions/workflows/ci.yml)
[![license](https://img.shields.io/github/license/microsoft/molecule-generation.svg)](https://github.com/microsoft/molecule-generation/blob/main/LICENSE)
[![pypi](https://img.shields.io/pypi/v/molecule-generation.svg)](https://pypi.org/project/molecule-generation/)
[![python](https://img.shields.io/pypi/pyversions/molecule_generation)](https://www.python.org/downloads/)
[![code style](https://img.shields.io/badge/code%20style-black-202020.svg)](https://github.com/ambv/black)

This repository contains training and inference code for the MoLeR model introduced in [Learning to Extend Molecular Scaffolds with Structural Motifs](https://arxiv.org/abs/2103.03864). We also include our implementation of CGVAE, but it currently lacks integration with the high-level model interface, and is provided mostly for reference.
This repository contains training and inference code for the MoLeR model introduced in [Learning to Extend Molecular Scaffolds with Structural Motifs](https://arxiv.org/abs/2103.03864). We also include our implementation of CGVAE, but without integration with the high-level model interface.

## Quick start

The `molecule_generation` package can be installed via `pip`, but it additionally depends on `rdkit` and (if one wants to use a GPU) on correctly setting up CUDA libraries. One approach to get both is through our minimalistic `conda` environment:
`molecule_generation` can be installed via `pip`, but it additionally depends on `rdkit` and (if one wants to use a GPU) on setting up CUDA libraries. One can get both through `conda`:

```bash
conda env create -f environment.yml
conda activate moler-env
```

This environment pins the versions of `python`, `rdkit` and `tensorflow` for reproducibility, but `molecule_generation` is compatible with a range of versions of these dependencies. If `tensorflow` installation doesn't work out-of-the-box for your particular system, you may need to refer to [the tensorflow website](https://www.tensorflow.org/install) for guidelines.
Our package was tested with `python>=3.7`, `tensorflow>=2.1.0` and `rdkit>=2020.09.1`; see the `environment*.yml` files for the exact configurations tested in CI.

To then install the latest release of `molecule_generation`, simply run
To then install the latest release of `molecule_generation`, run
```bash
pip install molecule-generation
```

Alternatively, running `pip install -e .` within the root folder installs the latest state of the code, including changes that were merged into `main` but not yet released.
Alternatively, `pip install -e .` within the root folder installs the latest state of the code, including changes that were merged into `main` but not yet released.

A MoLeR checkpoint trained using the default hyperparameters is available [here](https://figshare.com/ndownloader/files/34642724) (or [here](https://pan.baidu.com/s/1lkiWK9-d5MvNyzqRrusGXA?pwd=4hij) if you're in China and figshare doesn't work for you). This file needs to be saved in a fresh folder `MODEL_DIR` (e.g., `/tmp/MoLeR_checkpoint`) and be renamed to have the `.pkl` ending (e.g., to `GNN_Edge_MLP_MoLeR__2022-02-24_07-16-23_best.pkl`). Then you can sample 10 molecules by running
A MoLeR checkpoint trained using the default hyperparameters is available [here](https://figshare.com/ndownloader/files/34642724). This file needs to be saved in a fresh folder `MODEL_DIR` (e.g., `/tmp/MoLeR_checkpoint`) and be renamed to have the `.pkl` ending (e.g., to `GNN_Edge_MLP_MoLeR__2022-02-24_07-16-23_best.pkl`). Then you can sample 10 molecules by running

```bash
molecule_generation sample MODEL_DIR 10
```

See the next sections for how to train your own model and run more advanced inference.
See below for how to train your own model and run more advanced inference.

### Troubleshooting

> Q: I am in China and so the figshare checkpoint link does not work for me.
>
> A: You can try [this link](https://pan.baidu.com/s/1lkiWK9-d5MvNyzqRrusGXA?pwd=4hij) instead.
> Q: My particular combination of dependency versions does not work.
>
> A: Please submit an issue and default to using one of the pinned configurations from `environment-py*.yml` in the meantime.
> Q: Installing `tensorflow` on my system does not work.
>
> A: Please refer to [the tensorflow website](https://www.tensorflow.org/install) for guidelines.
## Workflow

Expand Down
1 change: 1 addition & 0 deletions environment-py37.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ channels:
- rdkit
- conda-forge
dependencies:
- pip==23.1.2
- python==3.7.7
- rdkit==2020.09.1.0
- tensorflow==2.1.0
Expand Down
11 changes: 11 additions & 0 deletions environment-py38.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: moler-env
channels:
- rdkit
- conda-forge
dependencies:
- pip==23.1.2
- python==3.8.16
- rdkit==2021.09.1
- tensorflow==2.6.2
- pip:
- numpy==1.22.4
11 changes: 11 additions & 0 deletions environment-py39.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: moler-env
channels:
- rdkit
- conda-forge
dependencies:
- pip==23.1.2
- python==3.9.16
- rdkit==2022.09.1
- tensorflow==2.9.1
- pip:
- numpy==1.24.3
9 changes: 5 additions & 4 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@ channels:
- rdkit
- conda-forge
dependencies:
- python==3.9.13
- rdkit==2020.09.1.0
- tensorflow==2.9.1
- pip
- python=3.10
- rdkit
- tensorflow
- pip:
- numpy==1.23.1
- numpy
16 changes: 14 additions & 2 deletions molecule_generation/chem/topology_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,23 @@
from typing import Collection

import numpy as np
import rdkit
from packaging.version import parse as parse_version
from rdkit.Chem import BondType, GetSSSR, RWMol

logger = logging.getLogger(__name__)


def get_SSSR_size(mol: RWMol) -> int:
"""Get the size of the smallest set of smallest rings as computed by `rdkit`."""
# Starting with version 2022.09.1 `GetSSSR` returns the SSSR itself, prior to that it returns
# the *size* of the SSSR.
if parse_version(rdkit.__version__) >= parse_version("2022.09.1"):
return len(GetSSSR(mol))
else:
return GetSSSR(mol)


def calculate_topology_features(edges: Collection, mol: RWMol) -> np.ndarray:
"""Constrain edges based on how many loops would be in a resulting molecule.
Expand All @@ -32,13 +44,13 @@ def calculate_topology_features(edges: Collection, mol: RWMol) -> np.ndarray:
mol_copy = RWMol(mol)
try:
# Must be calculated before GetRingInfo is called, to ensure ring info is initialised.
num_rings_in_base_mol = GetSSSR(mol_copy)
num_rings_in_base_mol = get_SSSR_size(mol_copy)
num_base_tri_ring_edges = _calculate_num_tri_rings(mol_copy)

for edge_idx, edge in enumerate(edges):
test_mol = RWMol(mol)
test_mol.AddBond(int(edge[0]), int(edge[1]), BondType.SINGLE)
num_rings_with_new_edge = GetSSSR(test_mol)
num_rings_with_new_edge = get_SSSR_size(test_mol)
num_tri_ring_edges = _calculate_num_tri_rings(test_mol)

num_tri_ring_edges_created_by_edge[edge_idx] = (
Expand Down
16 changes: 16 additions & 0 deletions molecule_generation/test/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from pathlib import Path
from typing import List

import pytest


@pytest.fixture
def test_smiles_path() -> str:
return Path(__file__).resolve().parent / "test_datasets" / "10_test_smiles.smiles"


@pytest.fixture
def test_smiles(test_smiles_path: str) -> List[str]:
with open(test_smiles_path) as f:
data = f.readlines()
return data
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,6 @@
from molecule_generation.utils.preprocessing_utils import save_data, write_jsonl_gz_data


@pytest.fixture
def smiles_list():
smiles_file = os.path.join(
os.path.dirname(__file__), "..", "test_datasets", "10_test_smiles.smiles"
)
with open(smiles_file) as f:
data = f.readlines()
return data


@pytest.fixture
def interrim_dir():
save_dir = os.path.join(os.path.dirname(os.path.realpath(__file__)), "tmp")
Expand All @@ -34,8 +24,8 @@ def interrim_dir():
shutil.rmtree(save_dir)


def test_write_smiles_to_jsonl(smiles_list, interrim_dir):
smiles_dict = [{"SMILES": x.strip()} for x in smiles_list]
def test_write_smiles_to_jsonl(test_smiles, interrim_dir):
smiles_dict = [{"SMILES": x.strip()} for x in test_smiles]
data = featurise_smiles_datapoints(
train_data=smiles_dict,
valid_data=smiles_dict,
Expand All @@ -48,9 +38,9 @@ def test_write_smiles_to_jsonl(smiles_list, interrim_dir):
assert num_written == 10


def test_read_and_write_jsonl_files(smiles_list, interrim_dir):
def test_read_and_write_jsonl_files(test_smiles, interrim_dir):
# Prepare the jsonl.gz files
smiles_dict = [{"SMILES": x.strip()} for x in smiles_list]
smiles_dict = [{"SMILES": x.strip()} for x in test_smiles]
data = featurise_smiles_datapoints(
train_data=smiles_dict,
valid_data=smiles_dict,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,6 @@
from molecule_generation.utils.preprocessing_utils import save_data, write_jsonl_gz_data


@pytest.fixture
def smiles_list():
smiles_file = os.path.join(
os.path.dirname(__file__), "..", "test_datasets", "10_test_smiles.smiles"
)
with open(smiles_file) as f:
data = f.readlines()
return data


@pytest.fixture
def interrim_dir():
save_dir = os.path.join(os.path.dirname(os.path.realpath(__file__)), "tmp")
Expand All @@ -34,8 +24,8 @@ def interrim_dir():
shutil.rmtree(save_dir)


def test_write_smiles_to_jsonl(smiles_list, interrim_dir):
smiles_dict = [{"SMILES": x.strip()} for x in smiles_list]
def test_write_smiles_to_jsonl(test_smiles, interrim_dir):
smiles_dict = [{"SMILES": x.strip()} for x in test_smiles]
data = featurise_smiles_datapoints(
train_data=smiles_dict,
valid_data=smiles_dict,
Expand All @@ -48,9 +38,9 @@ def test_write_smiles_to_jsonl(smiles_list, interrim_dir):
assert num_written == 10


def test_read_and_write_jsonl_files(smiles_list, interrim_dir):
def test_read_and_write_jsonl_files(test_smiles, interrim_dir):
# Prepare the jsonl.gz files
smiles_dict = [{"SMILES": x.strip()} for x in smiles_list]
smiles_dict = [{"SMILES": x.strip()} for x in test_smiles]
data = featurise_smiles_datapoints(
train_data=smiles_dict,
valid_data=smiles_dict,
Expand Down
65 changes: 65 additions & 0 deletions molecule_generation/test/integration/test_cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
import hashlib
import urllib.request
import pickle
import subprocess
import tempfile
from pathlib import Path
from typing import List

import numpy as np
import pytest


@pytest.fixture(scope="module")
def pretrained_checkpoint_dir() -> Path:
"""Download a pretrained MoLeR checkpoint for testing and remove it afterwards."""
with tempfile.TemporaryDirectory() as temp_dir:
temp_dir_path = Path(temp_dir)
urllib.request.urlretrieve(
"https://figshare.com/ndownloader/files/34642724", temp_dir_path / "model.pkl"
)
yield temp_dir_path


def run_cli(args: List[str]) -> str:
return subprocess.run(
["molecule_generation"] + args, stdout=subprocess.PIPE, check=True, text=True
).stdout


def test_encode(pretrained_checkpoint_dir: Path, test_smiles_path: Path) -> None:
output_path = pretrained_checkpoint_dir / "embeddings.pkl"
run_cli(["encode", pretrained_checkpoint_dir, str(test_smiles_path), output_path])

with open(output_path, "rb") as f:
embeddings = np.stack(pickle.load(f))
output_path.unlink()

# There should be one encoding per SMILES in `test_smiles_path`.
assert embeddings.shape == (10, 512)

# Compress encodings into their norms and compare with precomputed values.
expected_norms = np.asarray(
[4.09043, 3.56717, 5.40588, 5.60358, 5.41453, 5.55465, 3.48990, 4.50119, 4.33559, 5.36916]
)
assert np.allclose(np.linalg.norm(embeddings, axis=-1), expected_norms)


def test_sample(pretrained_checkpoint_dir: Path) -> None:
num_samples = 100
output = run_cli(["sample", pretrained_checkpoint_dir, str(num_samples)])

samples = [smiles for smiles in output.split("\n")[1:-1]]
assert len(samples) == num_samples

# Check the first three outputs verbatim for easier debugging.
expected_first_samples = [
"O=C1C2=CC=C(C3=CC=CC=C3)C=C=C2OC2=CC=CC=C12",
"CC(=O)NC1=NC2=CC(OCC3=CC=CN(CC4=CC=C(Cl)C=C4)C3=O)=CC=C2N1",
"CCN1C(=O)C2=CC=CC=C2N=C1NC(C)C(=O)NCC(=O)N=[N+]=[N-]",
]
assert samples[: len(expected_first_samples)] == expected_first_samples

# Check all samples by comparing to a precomputed hash.
samples_hash = hashlib.shake_256("\n".join(samples).encode()).hexdigest(16)
assert samples_hash == "366d78fd2c71c6754a4fd9d403ad8276"
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
],
)

0 comments on commit bd97b51

Please sign in to comment.