GenePert

GenePert: Leveraging GenePT Embeddings for Gene Perturbation Prediction

Predicting how perturbation of a target gene affects the expression of other genes is a critical component of understanding cell biology. This is a challenging prediction problem as the model must capture complex gene-gene relationships and the output is high-dimensional and sparse.

To address this challenge, we present GenePert, a simple approach that leverages GenePT embeddings, which are derived using ChatGPT from text descriptions of individual genes, to predict gene expression changes due to perturbations via regularized regression models. Benchmarked on multiple CRISPR perturbation screen datasets across multiple cell types, GenePert consistently outperforms state-of-the-art prediction models measured in both Pearson correlation and mean squared error metrics. Even with limited training data, GenePert generalizes effectively, offering a scalable solution for predicting perturbation outcomes.

This repository contains example code (genepert-k562-demo.ipynb) to run GenePert on example CRISPR Peturb-seq datasets. It uses the essential genes from K562 cell line generated by Replogle et al. 2022 as an example. By default it reports aggregated test-fold results from five-fold cross-validation, but the GenePertExperiment class also has a method called run_experiment_with_adata that allows users to specify training and testing AnnData object for user specified train/test split.

The complete list of embeddings and datasets used can be obtained from the following places:

Perturb-seq Datasets:
- K562 Essential (Replogle et al. 2022)
- RPE1 Essential (Replogle et al. 2022)
- Tian et al. CRISPRa
- Adamson et al. CRISPRi
- Norman et al. CRISPRi
- Dixit et al. CRISPRi
- Wessels et al. CRISPR-Cas13d
- Xu et al. CRISPR-dCas9
Different pre-trained embedding models:
- GenePert (GPT-4)
- GenePert (GPT-3.5)
- Pretrained ESM2 embeddings
- Pretrained Geneformer embeddings
- Pretrained scGPT embeddings

Overview of the GenePert process: (a) Data from pooled perturbation screens with high-throughput readouts (for instance, single-cell RNA sequencing) serve as input for the GenePert model. (b) For each gene being perturbed, we use its GenePT embedding as its feature: that is, we first extract its corresponding gene information summary from NCBI and, if available, its protein summary from UniProt, and use OpenAI's Model 3 text embedding of the summary as its representation. (c) For training, GenePert uses ridge regression, which posits a linear relationship between the average regulatory effects (left matrix X) of perturbations (rows) and the gene features (right matrix G). (d) During testing, for an unseen perturbation, GenePert uses the fitted coefficients on the corresponding GenePT embedding to generate predictions.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
figures		figures
.gitignore		.gitignore
GenePertExperiment.py		GenePertExperiment.py
README.md		README.md
genepert-k562-demo.ipynb		genepert-k562-demo.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenePert

About

Releases

Packages

Languages

zou-group/GenePert

Folders and files

Latest commit

History

Repository files navigation

GenePert

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages