Strengthen CI and add support for rdkit>=2022.09.1 (#58)

As discussed in #57, `rdkit` introduced a breaking change in `2022.09.1` that changes the return type of `GetSSSR` from an integer (number of rings) to the rings themselves, which broke the featurization in `chem/topology_features.py`. This PR makes several changes around this: - Patching `GetSSSR` to behave differently depending on `rdkit` version, so that the downstream code works across all versions (fixes #57). - Adding `test_cli.py` which downloads the pretrained MoLeR checkpoint and uses it in CLI to verify that the behaviour (i.e. encodings of particular molecules or samples produced through random sampling) did not change (fixes #12). - Extending CI to test under a range of dependency versions, starting with the original versions used when developing MoLeR (`python==3.7.7`, `tensorflow==2.1.0` and `rdkit==2020.09.1.0`) and finishing with modern ones (`python=3.10` and latest versions of `tensorflow` and `rdkit`). With extended CI and the new CLI test, we can now be reasonably sure MoLeR works reliably across different dependency versions, and that the checkpoint trained under old versions continues to work in exactly the same way under modern ones.
microsoft · Jun 15, 2023 · bd97b51 · bd97b51
1 parent 92c9233
commit bd97b51
Show file tree

Hide file tree

Showing 13 changed files with 163 additions and 45 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -15,13 +15,17 @@ jobs:
       matrix:
         include:
           - environment-file: environment-py37.yml
-            build-name: "Python 3.7.7, TF 2.1.0"
+            build-name: "python 3.7.7, tf 2.1.0, rdkit 2020.09.1"
+          - environment-file: environment-py38.yml
+            build-name: "python 3.8.16, tf 2.6.2, rdkit 2021.09.1"
+          - environment-file: environment-py39.yml
+            build-name: "python 3.9.16, tf 2.9.1, rdkit 2022.09.1"
           - environment-file: environment.yml
-            build-name: "Python 3.9.13, TF 2.9.1"
+            build-name: "python 3.10, tf latest, rdkit latest"
     defaults:
       run:
         shell: bash -l {0}
-    name: build (${{ matrix.build-name }})
+    name: ${{ matrix.build-name }}
     steps:
     - uses: actions/checkout@v3
     - uses: conda-incubator/setup-miniconda@v2
@@ -43,6 +47,6 @@ jobs:
     - name: Run unit tests
       run: |
         pytest --ignore=./molecule_generation/test/integration/
-    - name: Run integration test
+    - name: Run integration tests
       run: |
         pytest ./molecule_generation/test/integration/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 - Removed deprecated `numpy` types to make `molecule_generation` work with `numpy>=1.24.0` ([#49](https://github.com/microsoft/molecule-generation/pull/49))
+- Patched `GetSSSR` for compatibility with `rdkit>=2022.09.1` ([#58](https://github.com/microsoft/molecule-generation/pull/58))
 
 ## [0.3.0] - 2022-10-18
 

diff --git a/README.md b/README.md
@@ -3,35 +3,50 @@
 [![CI](https://github.com/microsoft/molecule-generation/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/microsoft/molecule-generation/actions/workflows/ci.yml)
 [![license](https://img.shields.io/github/license/microsoft/molecule-generation.svg)](https://github.com/microsoft/molecule-generation/blob/main/LICENSE)
 [![pypi](https://img.shields.io/pypi/v/molecule-generation.svg)](https://pypi.org/project/molecule-generation/)
+[![python](https://img.shields.io/pypi/pyversions/molecule_generation)](https://www.python.org/downloads/)
 [![code style](https://img.shields.io/badge/code%20style-black-202020.svg)](https://github.com/ambv/black)
 
-This repository contains training and inference code for the MoLeR model introduced in [Learning to Extend Molecular Scaffolds with Structural Motifs](https://arxiv.org/abs/2103.03864). We also include our implementation of CGVAE, but it currently lacks integration with the high-level model interface, and is provided mostly for reference.
+This repository contains training and inference code for the MoLeR model introduced in [Learning to Extend Molecular Scaffolds with Structural Motifs](https://arxiv.org/abs/2103.03864). We also include our implementation of CGVAE, but without integration with the high-level model interface.
 
 ## Quick start
 
-The `molecule_generation` package can be installed via `pip`, but it additionally depends on `rdkit` and (if one wants to use a GPU) on correctly setting up CUDA libraries. One approach to get both is through our minimalistic `conda` environment:
+`molecule_generation` can be installed via `pip`, but it additionally depends on `rdkit` and (if one wants to use a GPU) on setting up CUDA libraries. One can get both through `conda`:
 
 ```bash
 conda env create -f environment.yml
 conda activate moler-env
 ```
 
-This environment pins the versions of `python`, `rdkit` and `tensorflow` for reproducibility, but `molecule_generation` is compatible with a range of versions of these dependencies. If `tensorflow` installation doesn't work out-of-the-box for your particular system, you may need to refer to [the tensorflow website](https://www.tensorflow.org/install) for guidelines.
+Our package was tested with `python>=3.7`, `tensorflow>=2.1.0` and `rdkit>=2020.09.1`; see the `environment*.yml` files for the exact configurations tested in CI.
 
-To then install the latest release of `molecule_generation`, simply run
+To then install the latest release of `molecule_generation`, run
 ```bash
 pip install molecule-generation
 ```
 
-Alternatively, running `pip install -e .` within the root folder installs the latest state of the code, including changes that were merged into `main` but not yet released.
+Alternatively, `pip install -e .` within the root folder installs the latest state of the code, including changes that were merged into `main` but not yet released.
 
-A MoLeR checkpoint trained using the default hyperparameters is available [here](https://figshare.com/ndownloader/files/34642724) (or [here](https://pan.baidu.com/s/1lkiWK9-d5MvNyzqRrusGXA?pwd=4hij) if you're in China and figshare doesn't work for you). This file needs to be saved in a fresh folder `MODEL_DIR` (e.g., `/tmp/MoLeR_checkpoint`) and be renamed to have the `.pkl` ending (e.g., to `GNN_Edge_MLP_MoLeR__2022-02-24_07-16-23_best.pkl`). Then you can sample 10 molecules by running
+A MoLeR checkpoint trained using the default hyperparameters is available [here](https://figshare.com/ndownloader/files/34642724). This file needs to be saved in a fresh folder `MODEL_DIR` (e.g., `/tmp/MoLeR_checkpoint`) and be renamed to have the `.pkl` ending (e.g., to `GNN_Edge_MLP_MoLeR__2022-02-24_07-16-23_best.pkl`). Then you can sample 10 molecules by running
 
 ```bash
 molecule_generation sample MODEL_DIR 10
 ```
 
-See the next sections for how to train your own model and run more advanced inference.
+See below for how to train your own model and run more advanced inference.
+
+### Troubleshooting
+
+> Q: I am in China and so the figshare checkpoint link does not work for me.
+>
+> A: You can try [this link](https://pan.baidu.com/s/1lkiWK9-d5MvNyzqRrusGXA?pwd=4hij) instead.
+
+> Q: My particular combination of dependency versions does not work.
+>
+> A: Please submit an issue and default to using one of the pinned configurations from `environment-py*.yml` in the meantime.
+
+> Q: Installing `tensorflow` on my system does not work.
+>
+> A: Please refer to [the tensorflow website](https://www.tensorflow.org/install) for guidelines.
 
 ## Workflow
 

diff --git a/environment-py37.yml b/environment-py37.yml
@@ -3,6 +3,7 @@ channels:
   - rdkit
   - conda-forge
 dependencies:
+  - pip==23.1.2
   - python==3.7.7
   - rdkit==2020.09.1.0
   - tensorflow==2.1.0

diff --git a/environment-py38.yml b/environment-py38.yml
@@ -0,0 +1,11 @@
+name: moler-env
+channels:
+  - rdkit
+  - conda-forge
+dependencies:
+  - pip==23.1.2
+  - python==3.8.16
+  - rdkit==2021.09.1
+  - tensorflow==2.6.2
+  - pip:
+    - numpy==1.22.4
diff --git a/environment-py39.yml b/environment-py39.yml
@@ -0,0 +1,11 @@
+name: moler-env
+channels:
+  - rdkit
+  - conda-forge
+dependencies:
+  - pip==23.1.2
+  - python==3.9.16
+  - rdkit==2022.09.1
+  - tensorflow==2.9.1
+  - pip:
+    - numpy==1.24.3
diff --git a/environment.yml b/environment.yml
@@ -3,8 +3,9 @@ channels:
   - rdkit
   - conda-forge
 dependencies:
-  - python==3.9.13
-  - rdkit==2020.09.1.0
-  - tensorflow==2.9.1
+  - pip
+  - python=3.10
+  - rdkit
+  - tensorflow
   - pip:
-    - numpy==1.23.1
+    - numpy
diff --git a/molecule_generation/chem/topology_features.py b/molecule_generation/chem/topology_features.py
@@ -3,11 +3,23 @@
 from typing import Collection
 
 import numpy as np
+import rdkit
+from packaging.version import parse as parse_version
 from rdkit.Chem import BondType, GetSSSR, RWMol
 
 logger = logging.getLogger(__name__)
 
 
+def get_SSSR_size(mol: RWMol) -> int:
+    """Get the size of the smallest set of smallest rings as computed by `rdkit`."""
+    # Starting with version 2022.09.1 `GetSSSR` returns the SSSR itself, prior to that it returns
+    # the *size* of the SSSR.
+    if parse_version(rdkit.__version__) >= parse_version("2022.09.1"):
+        return len(GetSSSR(mol))
+    else:
+        return GetSSSR(mol)
+
+
 def calculate_topology_features(edges: Collection, mol: RWMol) -> np.ndarray:
     """Constrain edges based on how many loops would be in a resulting molecule.
 
@@ -32,13 +44,13 @@ def calculate_topology_features(edges: Collection, mol: RWMol) -> np.ndarray:
     mol_copy = RWMol(mol)
     try:
         # Must be calculated before GetRingInfo is called, to ensure ring info is initialised.
-        num_rings_in_base_mol = GetSSSR(mol_copy)
+        num_rings_in_base_mol = get_SSSR_size(mol_copy)
         num_base_tri_ring_edges = _calculate_num_tri_rings(mol_copy)
 
         for edge_idx, edge in enumerate(edges):
             test_mol = RWMol(mol)
             test_mol.AddBond(int(edge[0]), int(edge[1]), BondType.SINGLE)
-            num_rings_with_new_edge = GetSSSR(test_mol)
+            num_rings_with_new_edge = get_SSSR_size(test_mol)
             num_tri_ring_edges = _calculate_num_tri_rings(test_mol)
 
             num_tri_ring_edges_created_by_edge[edge_idx] = (

diff --git a/molecule_generation/test/conftest.py b/molecule_generation/test/conftest.py
@@ -0,0 +1,16 @@
+from pathlib import Path
+from typing import List
+
+import pytest
+
+
+@pytest.fixture
+def test_smiles_path() -> str:
+    return Path(__file__).resolve().parent / "test_datasets" / "10_test_smiles.smiles"
+
+
+@pytest.fixture
+def test_smiles(test_smiles_path: str) -> List[str]:
+    with open(test_smiles_path) as f:
+        data = f.readlines()
+    return data
diff --git a/molecule_generation/test/dataset/test_jsonl_cgvae_trace_dataset_preprocessing.py b/molecule_generation/test/dataset/test_jsonl_cgvae_trace_dataset_preprocessing.py
@@ -14,16 +14,6 @@
 from molecule_generation.utils.preprocessing_utils import save_data, write_jsonl_gz_data
 
 
-@pytest.fixture
-def smiles_list():
-    smiles_file = os.path.join(
-        os.path.dirname(__file__), "..", "test_datasets", "10_test_smiles.smiles"
-    )
-    with open(smiles_file) as f:
-        data = f.readlines()
-    return data
-
-
 @pytest.fixture
 def interrim_dir():
     save_dir = os.path.join(os.path.dirname(os.path.realpath(__file__)), "tmp")
@@ -34,8 +24,8 @@ def interrim_dir():
         shutil.rmtree(save_dir)
 
 
-def test_write_smiles_to_jsonl(smiles_list, interrim_dir):
-    smiles_dict = [{"SMILES": x.strip()} for x in smiles_list]
+def test_write_smiles_to_jsonl(test_smiles, interrim_dir):
+    smiles_dict = [{"SMILES": x.strip()} for x in test_smiles]
     data = featurise_smiles_datapoints(
         train_data=smiles_dict,
         valid_data=smiles_dict,
@@ -48,9 +38,9 @@ def test_write_smiles_to_jsonl(smiles_list, interrim_dir):
     assert num_written == 10
 
 
-def test_read_and_write_jsonl_files(smiles_list, interrim_dir):
+def test_read_and_write_jsonl_files(test_smiles, interrim_dir):
     # Prepare the jsonl.gz files
-    smiles_dict = [{"SMILES": x.strip()} for x in smiles_list]
+    smiles_dict = [{"SMILES": x.strip()} for x in test_smiles]
     data = featurise_smiles_datapoints(
         train_data=smiles_dict,
         valid_data=smiles_dict,

diff --git a/molecule_generation/test/dataset/test_jsonl_moler_trace_dataset_preprocessing.py b/molecule_generation/test/dataset/test_jsonl_moler_trace_dataset_preprocessing.py
@@ -14,16 +14,6 @@
 from molecule_generation.utils.preprocessing_utils import save_data, write_jsonl_gz_data
 
 
-@pytest.fixture
-def smiles_list():
-    smiles_file = os.path.join(
-        os.path.dirname(__file__), "..", "test_datasets", "10_test_smiles.smiles"
-    )
-    with open(smiles_file) as f:
-        data = f.readlines()
-    return data
-
-
 @pytest.fixture
 def interrim_dir():
     save_dir = os.path.join(os.path.dirname(os.path.realpath(__file__)), "tmp")
@@ -34,8 +24,8 @@ def interrim_dir():
         shutil.rmtree(save_dir)
 
 
-def test_write_smiles_to_jsonl(smiles_list, interrim_dir):
-    smiles_dict = [{"SMILES": x.strip()} for x in smiles_list]
+def test_write_smiles_to_jsonl(test_smiles, interrim_dir):
+    smiles_dict = [{"SMILES": x.strip()} for x in test_smiles]
     data = featurise_smiles_datapoints(
         train_data=smiles_dict,
         valid_data=smiles_dict,
@@ -48,9 +38,9 @@ def test_write_smiles_to_jsonl(smiles_list, interrim_dir):
     assert num_written == 10
 
 
-def test_read_and_write_jsonl_files(smiles_list, interrim_dir):
+def test_read_and_write_jsonl_files(test_smiles, interrim_dir):
     # Prepare the jsonl.gz files
-    smiles_dict = [{"SMILES": x.strip()} for x in smiles_list]
+    smiles_dict = [{"SMILES": x.strip()} for x in test_smiles]
     data = featurise_smiles_datapoints(
         train_data=smiles_dict,
         valid_data=smiles_dict,

diff --git a/molecule_generation/test/integration/test_cli.py b/molecule_generation/test/integration/test_cli.py
@@ -0,0 +1,65 @@
+import hashlib
+import urllib.request
+import pickle
+import subprocess
+import tempfile
+from pathlib import Path
+from typing import List
+
+import numpy as np
+import pytest
+
+
+@pytest.fixture(scope="module")
+def pretrained_checkpoint_dir() -> Path:
+    """Download a pretrained MoLeR checkpoint for testing and remove it afterwards."""
+    with tempfile.TemporaryDirectory() as temp_dir:
+        temp_dir_path = Path(temp_dir)
+        urllib.request.urlretrieve(
+            "https://figshare.com/ndownloader/files/34642724", temp_dir_path / "model.pkl"
+        )
+        yield temp_dir_path
+
+
+def run_cli(args: List[str]) -> str:
+    return subprocess.run(
+        ["molecule_generation"] + args, stdout=subprocess.PIPE, check=True, text=True
+    ).stdout
+
+
+def test_encode(pretrained_checkpoint_dir: Path, test_smiles_path: Path) -> None:
+    output_path = pretrained_checkpoint_dir / "embeddings.pkl"
+    run_cli(["encode", pretrained_checkpoint_dir, str(test_smiles_path), output_path])
+
+    with open(output_path, "rb") as f:
+        embeddings = np.stack(pickle.load(f))
+    output_path.unlink()
+
+    # There should be one encoding per SMILES in `test_smiles_path`.
+    assert embeddings.shape == (10, 512)
+
+    # Compress encodings into their norms and compare with precomputed values.
+    expected_norms = np.asarray(
+        [4.09043, 3.56717, 5.40588, 5.60358, 5.41453, 5.55465, 3.48990, 4.50119, 4.33559, 5.36916]
+    )
+    assert np.allclose(np.linalg.norm(embeddings, axis=-1), expected_norms)
+
+
+def test_sample(pretrained_checkpoint_dir: Path) -> None:
+    num_samples = 100
+    output = run_cli(["sample", pretrained_checkpoint_dir, str(num_samples)])
+
+    samples = [smiles for smiles in output.split("\n")[1:-1]]
+    assert len(samples) == num_samples
+
+    # Check the first three outputs verbatim for easier debugging.
+    expected_first_samples = [
+        "O=C1C2=CC=C(C3=CC=CC=C3)C=C=C2OC2=CC=CC=C12",
+        "CC(=O)NC1=NC2=CC(OCC3=CC=CN(CC4=CC=C(Cl)C=C4)C3=O)=CC=C2N1",
+        "CCN1C(=O)C2=CC=CC=C2N=C1NC(C)C(=O)NCC(=O)N=[N+]=[N-]",
+    ]
+    assert samples[: len(expected_first_samples)] == expected_first_samples
+
+    # Check all samples by comparing to a precomputed hash.
+    samples_hash = hashlib.shake_256("\n".join(samples).encode()).hexdigest(16)
+    assert samples_hash == "366d78fd2c71c6754a4fd9d403ad8276"
diff --git a/setup.py b/setup.py
@@ -37,6 +37,7 @@
         "Programming Language :: Python :: 3.7",
         "Programming Language :: Python :: 3.8",
         "Programming Language :: Python :: 3.9",
+        "Programming Language :: Python :: 3.10",
         "Topic :: Scientific/Engineering :: Artificial Intelligence",
     ],
 )