GenePert: Leveraging GenePT Embeddings for Gene Perturbation Prediction
Predicting how perturbation of a target gene affects the expression of other genes is a critical component of understanding cell biology. This is a challenging prediction problem as the model must capture complex gene-gene relationships and the output is high-dimensional and sparse.
To address this challenge, we present GenePert
, a simple approach that leverages GenePT embeddings, which
are derived using ChatGPT from text descriptions of individual genes, to predict gene expression
changes due to perturbations via regularized regression models. Benchmarked on multiple CRISPR
perturbation screen datasets across multiple cell types, GenePert consistently outperforms state-of-the-art prediction models measured in both Pearson correlation and mean squared error metrics. Even with limited training data, GenePert generalizes effectively, offering a scalable solution for predicting perturbation outcomes.
This repository contains example code (genepert-k562-demo.ipynb
) to run GenePert
on example CRISPR Peturb-seq datasets. It uses the essential genes from K562 cell line generated by Replogle et al. 2022 as an example. By default it reports aggregated test-fold results from five-fold cross-validation, but the GenePertExperiment
class also has a method called run_experiment_with_adata
that allows users to specify training and testing AnnData object for user specified train/test split.
The complete list of embeddings and datasets used can be obtained from the following places:
-
Perturb-seq Datasets:
- K562 Essential (Replogle et al. 2022)
- RPE1 Essential (Replogle et al. 2022)
- Tian et al. CRISPRa
- Adamson et al. CRISPRi
- Norman et al. CRISPRi
- Dixit et al. CRISPRi
- Wessels et al. CRISPR-Cas13d
- Xu et al. CRISPR-dCas9
-
Different pre-trained embedding models:
- GenePert (GPT-4)
- GenePert (GPT-3.5)
- Pretrained ESM2 embeddings
- Pretrained Geneformer embeddings
- Pretrained scGPT embeddings
Overview of the GenePert process: (a) Data from pooled perturbation screens with high-throughput readouts (for instance, single-cell RNA sequencing) serve as input for the GenePert model. (b) For each gene being perturbed, we use its GenePT embedding as its feature: that is, we first extract its corresponding gene information summary from NCBI and, if available, its protein summary from UniProt, and use OpenAI's Model 3 text embedding of the summary as its representation. (c) For training, GenePert uses ridge regression, which posits a linear relationship between the average regulatory effects (left matrix X) of perturbations (rows) and the gene features (right matrix G). (d) During testing, for an unseen perturbation, GenePert uses the fitted coefficients on the corresponding GenePT embedding to generate predictions.