This repo contains the processing scripts, code and evaluation methods for the paper "Microdroplet screening rapidly profiles a biocatalyst to enable its AI-assisted engineering".
Via github:
To ensure you have a compatible environment to run the code in, we recommend using the environment.yaml
file to create a conda environment using conda
or mamba
. If you are curious, the detailed dependencies are listed in pyproject.toml
.
git clone https://github.com/Hollfelder-Lab/lrDMS-IRED.git
cd lrDMS-IRED
conda env create -f environment.yaml # NOTE: This will auto-install all required dependencies and the `lrdms` package
conda activate lrdms
Alternatively, if you have an existing environment that you want to use, you can install lrdms
and the required dependencies using pip.
git clone https://github.com/Hollfelder-Lab/lrDMS-IRED.git
cd lrDMS-IRED
pip install -e .
Via pip:
This will install lrdms
in your environment's site-packages
, and only include the main code & data,
not the notebooks. This means you'll be able to import the functions and data in lrdms
for use in your
own work. If you want to modify some of the code or try out the notebooks locally you should prefer one of the options above
(or use the colab links provided in the notebooks
folder.
pip install git+https://github.com/Hollfelder-Lab/lrDMS-IRED.git
The UMIC-seq2 pipeline is adapted from Zurek et al 2020 to allow for (i) the processing of larger datasets via use of mmseqs2 for clustering and (ii) incorporation of the UMI sequence in polished reads for use in long-read deep mutational scanning (lrDMS). Following the steps outlined in scripts
, raw Oxford Nanopore data can be used to generate a variant identifyer file 'VIF' that can be fed into the DiMSum pipeline. The provided scripts and outlined pipeline can also be used to analyse amplicon Oxford Nanopore data without downstream use of the file for lrDMS.
If lrDMS is conducted, next-gerneration sequencing reads of the UMI region before and after screening are used to calculate fitness score and the VIF is used to link UMI and variant identity. The output of the processing pipeline is a .csv
file containing the fitness scores for individual sequences. For convenience, the processed data is also provided in the data
folder as srired_active_data.csv
. This data can then be used for combinability and mutability analysis and machine learning.
A convenient summary profile of the combinability and mutability of the obtained data, which may be used to inform rational engineering campaigns, can be generated as demonstrated in notebooks/data_analysis.ipynb
.
The documentation for the analysis of epistasis is in the epistasis
folder.
For a detailed analysis of the single and double mutant models (replicatiosn of models in the paper, ablations, learning curves, feature importances, top predictions), please refer to the notebooks
folder, which contain analysis notebooks for each of the models.
Our data is available in the data
folder and on Zenodo:
This code is licensed under the MIT License - see the LICENSE file for details.
Please cite our paper if you use this code or data in your own work:
@article {
Gantz2024,
author = {Gantz, Maximilian and Mathis, Simon V. and Nintzel, Friederike E. H. and Zurek, Paul J. and Knaus, Tanja and Patel, Elie and Boros, Daniel and Weberling, Friedrich-Maximilian and Kenneth, Matthew R. A. and Klein, Oskar J. and Medcalf, Elliot J. and Moss, Jacob and Herger, Michael and Kaminski, Tomasz S. and Mutti, Francesco G. and Lio, Pietro and Hollfelder, Florian},
title = {Microdroplet screening rapidly profiles a biocatalyst to enable its AI-assisted engineering},
elocation-id = {2024.04.08.588565},
year = {2024},
doi = {10.1101/2024.04.08.588565},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/04/08/2024.04.08.588565},
eprint = {https://www.biorxiv.org/content/early/2024/04/08/2024.04.08.588565.full.pdf},
journal = {bioRxiv}
}
- Hollfelder Lab, Department of Biochemistry, University of Cambridge, UK
- Lio Lab, Department of Computer Science and Technology, University of Cambridge, UK
For questions, please contact
- fh111(at)cam.ac.uk
- mg985(at)cam.ac.uk
- simon.mathis(at)cl.cam.ac.uk
- fmw37(at)cam.ac.uk
We welcome contributions to this repository. To set up the development environment, please follow the instructions below:
git clone https://github.com/Hollfelder-Lab/lrDMS-IRED.git
cd lrDMS-IRED
chmod +x contribute.sh
./contribute.sh