scFoundation💡🔍 |
Large-scale foundation model on single-cell transcriptomics. Minsheng Hao et al. Nature Methods (2024) |
GitHub Repository |
Transformer encoder, Performer decoder |
Foundation model for single-cell analysis, built on xTrimoGene architecture with a read-depth-aware (RDA) pretraining across 50 million profiles |
50M |
7 |
Mean square error loss |
Cell clustering; Cell type annotation; Perturbation prediction; Drug response prediction |
scGPT 💡🔍 |
scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Haotian Cui et al. Nature Methods (2024) |
GitHub Repository |
Transformer |
A foundation model designed for single-cell multi-omics aimed to deepen the understanding of biological data and improve performance in tasks like cell type annotation and integration. |
33M |
441 |
Mean square error; Cosine similarity; Cross entropy loss |
Cell type annotation; Perturbation response prediction; Multi-batch integration; Multi-omic integration; Gene regulatory network inference |
MarsGT 🔍 |
MarsGT: Multi-omics analysis for rare population inference using single-cell graph transformer. Xiaoying Wang et al. Nature Communications (2024) |
GitHub Repository |
Graph Transformer |
Identifying rare cell populations in single-cell multi-omics, with superior performance and insights for early detection and therapeutic intervention strategies |
750K |
550 |
KL divergence, cosign similarity, and regression loss |
Construct enhancer gene regulatory networks |
scGREAT 🔍 |
scGREAT: Transformer-based deep-language model for gene regulatory network inference from single-cell transcriptomics. Yuchen Wang et al. iScience (2024) |
GitHub Repository |
Transformer |
Inferencing Gene Regulatory Networks (GRN) from single-cell transcriptomics data and textual information about genes using a transformer-based model |
4K |
7 |
Cross entropy loss |
Gene Regulatory Network Inference |
scMulan 💡🔍 |
scMulan: A Multitask Generative Pre-Trained Language Model for Single-Cell Analysis. Haiyang Bian et al. Research in Computational Molecular Biology (RECOMB) (2024) |
GitHub Repository |
Transformer |
Generative multitask model for single-cell analysis, trained on 10 million cells. |
10M |
5 |
Cross entropy loss |
Cell type annotation; Batch integration; Conditional cell generation |
CellPLM 💡🔍 |
CellPLM: Pre-training of Cell Language Model Beyond Single Cells. Hongzhi Wen et al. International Conference on Learning Representations (ICLR) (2024) |
GitHub Repository |
Transformer |
The framework marks the first of its kind, encoding inter-cell relations, harnessing spatially-resolved transcriptomic data, and adopts a decent prior distribution. |
9M scRNA-seq + 2M spatial |
3 |
Masked language modeling with mean squared error loss |
Zero-shot clustering; scRNA-seq denoising; Spatial transcriptomic imputation; Cell type annotation; Perturbation prediction |
tGPT 💡🔍 |
Generative pretraining from large-scale transcriptomes for single-cell deciphering. Hongru Shen et al. iScience (2023) |
GitHub Repository |
Transformer |
Generative pretraining on 22.3 million single-cell transcriptomes aligns with established cell labels and states suitable for single-cell and bulk analysis. |
22.3M |
4 |
Cross entropy loss |
Single-cell clustering; Inference of developmental lineage; Feature representation analysis of bulk tissues |
TOSICA 🔍 |
Transformer for one stop interpretable cell type annotation. Jiawei Chen et al. Nature Communications (2023) |
GitHub Repository |
Transformer |
An efficient cell type annotator trained on scRNA-seq data shows high accuracy across diverse datasets and enables new cell type discovery. |
536K |
6 |
Cross entropy loss |
Cell type annotation; Data integration; Cell differentiation trajectory inference |
Geneformer 💡🔍 |
Transfer learning enables predictions in network biology. Christina V. Theodoris et al. Nature (2023) |
Hugging Face Repository; GitHub Repository |
Transformer |
Pre-trained on 30 million single-cell transcriptomes to enable context-specific predictions and identify therapeutic targets in network biology with limited data. |
30M |
561 |
Cross entropy loss |
Chromatin dynamics prediction; Network dynamics prediction; Cell type annotation; Gene network analysis |
STGRNS 🔍 |
STGRNS: an interpretable transformer-based method for inferring gene regulatory networks from single-cell transcriptomic data. Jing Xu et al. Bioinformatics (2023) |
GitHub Repository |
Transformer |
Focused on enhancing gene regulatory network inference from single-cell transcriptomic data using a proposed gene expression motif technique, applicable across various scRNA-seq data types. |
154K+ |
48 |
Cross entropy loss |
Gene regulatory networks inference |
DeepMAPS 🔍 |
Single-cell biological network inference using a heterogeneous graph transformer. Anjun Ma et al. Nature Communications (2023) |
GitHub Repository |
Graph Transformer |
Infers biological networks from single-cell multi-omics data via a heterogeneous graph and a multi-head graph transformer, enhancing local and global context learning. |
199K |
17 |
Mean squared error and KL divergence |
Dimensionality reduction and cell clustering; Biological network construction |
scBERT 💡🔍 |
scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Fan Yang et al. Nature Machine Intelligence (2022) |
GitHub Repository |
Transformer (BERT-based model) |
A BERT-based model was pre-trained on large amounts of unlabeled scRNA-seq data for cell type annotation, demonstrating superior performance. |
1M |
10 |
Cross entropy loss |
Cell type annotation; Novel cell type prediction |
scCLIP 💡🔍 |
scCLIP: Multi-modal Single-cell Contrastive Learning Integration Pre-training. Lei Xiong et al. Conference on Neural Information Processing Systems (NeurIPS) AI for Science Workshop (2023) |
GitHub Repository |
Transformer |
Introduced a multi-modal Transformer model with contrastive learning, optimized for single-cell ATAC-seq data by tokenizing genomic peaks |
377K |
2 |
Cross entropy loss |
Modality alignment |
scMVP 🔍 |
A deep generative model for multi-view profiling of single-cell RNA-seq and ATAC-seq data. Gaoyang Li et al. Genome Biology (2022) |
GitHub Repository |
Transformer + VAE |
Introduces scMVP, a multi-modal deep generative model for processing single-cell RNA-seq and ATAC-seq data, addressing data sparsity and integration challenges. |
100K |
5 |
Clustering consistency loss – similar to CycleGAN |
Clustering; Imputation; Trajectory Inference |
Enformer 🔍 |
Effective gene expression prediction from sequence by integrating long-range interactions. Žiga Avsec et al. Nature Methods (2021) |
Hugging Face Repository; GitHub Repository |
Transformer with attention layers |
To improve gene expression prediction from DNA sequences by integrating long-range interactions, leveraging transformer architecture for better accuracy. |
254K |
2 |
Poisson negative log-likelihood loss |
Gene expression prediction; Variant effect prediction; Epigenetic state prediction |
CIForm 🔍 |
CIForm as a Transformer-based model for cell-type annotation of large-scale single-cell RNA-seq data. Jing Xu et al. Briefings in Bioinformatics (2023) |
GitHub Repository |
Transformer |
Developed for cell-type annotation of large-scale single-cell RNA-seq data, aiming to overcome batch effects and efficiently process large datasets |
12M |
16 |
Cross entropy loss |
Cell type annotation |
TransCluster 🔍 |
TransCluster: A Cell-Type Identification Method for single-cell RNA-Seq data using deep learning based on transformer. Tao Song et al. Frontiers Genetics (2022) |
GitHub Repository |
Transformer |
Proposes TransCluster, combining linear discriminant analysis and a modified Transformer to enhance cell-type identification accuracy and robustness across various human tissue datasets |
51K |
2 |
Cross entropy loss |
Cell type annotation |
iSEEEK 💡🔍 |
A universal approach for integrating super large-scale single-cell transcriptomes by exploring gene rankings. Hongru Shen et al. Briefings in Bioinformatics (2022) |
GitHub Repository |
Transformer |
Introduces iSEEEK, an approach for integrating super large-scale single-cell RNA sequencing data by exploring gene rankings of top-expressing genes and states suitable for single-cell and bulk analysis |
11.9M |
60 |
Cross entropy loss |
Cell clusters delineation; Marker genes identification; Cell developmental trajectory exploration; Cluster-specific gene-gene interaction modules exploration analysis of bulk tissues |
Exceiver 💡 |
A single-cell gene expression language model. Connell et al. arXiv (2022) |
GitHub Repository |
Transformer |
Introduced discrete noise masking for self-supervised learning on unlabeled datasets and developed a framework using scRNA-seq to enhance downstream tasks in gene regulation and phenotype prediction |
500K |
1 |
Cross entropy loss + Mean square error |
Drug response prediction |
xTrimoGene 💡🔍 |
xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data. Jing Gong et al. Conference on Neural Information Processing Systems (NeurIPS) (2023) |
Unpublished |
Asymmetric encoder-decoder transformer |
Introduced a transformer variant for scRNA-seq data, significantly reducing computational and memory usage while preserving accuracy, and developed tailored pre-trained models for single-cell data |
5M |
- |
Mean square error |
Cell type annotation; Perturbation response prediction; Synergistic drug combination prediction |
Cell2Sentence 💡🔍 |
Cell2Sentence: Teaching Large Language Models the Language of Biology. Daniel Levine et al. International Conference on Machine Learning (ICLR) (2024) |
GitHub Repository |
Transformer (GPT) |
A single and flexible framework for seamlessly integrating Large Language Models (LLMs), specifically GPT-2, into transcriptomics, leveraging widely-used LLM libraries |
40K |
2 |
Cross entropy loss |
Unconditional cell generation; Conditional cell generation; Cell type prediction |
GenePT 💡 |
GenePT: A Simple But Effective Foundation Model for Genes and Cells Built From ChatGPT. Yiqun T. Chen & James Zou bioRxiv (2023) |
GitHub Repository |
Transformer (GPT) |
Used NCBI text descriptions of individual genes with GPT-3.5 to generate gene embeddings then further leveraged on downstream tasks |
21K |
10 |
Cross entropy loss |
Gene property prediction; Batch integration; Cell type annotation |
CellLM 💡 |
Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning. Suyuan Zhao et al. arXiv (2023) |
GitHub Repository |
Performer Transformer |
Presented a novel divide-and-conquer contrastive learning strategy designed to decouple the batch size from GPU memory constraints in cell representation learning |
2M |
2 |
Masked language modeling with cross-entropy loss, cell type discrimination with binary cross-entropy loss, and divide-and-conquer contrastive loss |
Cell type annotation; Drug sensitivity prediction |
scELMo 💡 |
scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis. Tianyu Liu et al. bioRxiv (2023) |
GitHub Repository |
Transformer (GPT) |
Extended the concept from GenePT and proposed a novel approach to leverage the advantages from Large Language Models (LLMs) to formalize a foundation model for single-cell data analysis |
69K |
5 |
Cross entropy loss |
Cell clustering; Batch effect correction; Cell type annotation; Perturbation analysis |
UCE 💡 |
Universal Cell Embeddings: A Foundation Model for Cell Biology. Yanay Rosen et al. bioRxiv (2023) |
GitHub Repository |
Transformer |
Trained in a self-supervised learning fashion on a diverse corpus of cell atlas data encompassing humans and other species, this model offers a cohesive biological latent space capable of representing cells from any tissue or species, all without the need for manual data annotations |
36M |
300 |
Cross entropy loss |
Zero-shot embedding quality and clustering; Cell type organization; Zero-shot cell type alignment to Integrated Mega-scale Atlas (IMA) |
CellFM 💡 |
a large-scale foundation model pre-trained on transcriptomics of 100 million human cells. Yuansong Zeng et al. bioRxiv (2024) |
GitHub Repository |
Transformer |
A 800-million-parameter single-cell model trained on ~100 million human cells, outperforming existing models in applications like cell annotation and gene function prediction |
100M |
20 |
Mean square error loss loss |
Cell type annotation; Pertubation prediction; Gene function predction |
Nicheformer 💡 |
Nicheformer: A Foundation Model for Single-Cell and Spatial Omics. Anna C. Schaar et al. bioRxiv (2024) |
GitHub Repository |
Transformer |
Transformer-based model that integrates over 110 million human and mouse cells, learning unified representations from dissociated and spatial transcriptomics for advanced analysis of cellular interactions and environments. |
110M |
180+ |
Masked language modeling loss, Cross entropy loss |
Spatial cell type, niche region label prediction; Neighborhood cell density prediction |
CELLama 💡 |
CELLama: Foundation Model for Single Cell and Spatial Transcriptomics by Cell Embedding Leveraging Language Model Abilities. Hongyoon Choi et al. bioRxiv (2024) |
GitHub Repository |
Transformer |
CELLama leverages language models to transform scRNA-seq and spatial transcriptomics data into gene expression 'sentences,' facilitating advanced cellular analysis across diverse datasets |
536K+ |
4 |
Cosine similarity |
Cell typing; Integration |
LangCell 💡 |
LangCell: Language-Cell Pre-training for Cell Identity Understanding. Suyuan Zhao et al. arXiv (2024) |
GitHub Repository |
Transformer |
LangCell integrates single-cell data with natural language during pre-training enabling effective zero-shot, few-shot, and fine-tuning performance in cell identity understanding tasks |
27.5M |
4 |
Masked gene modeling, Cell-cell contrastive, Cell-text contrastive, and Cell-text matching losses |
Novel cell type identification; Cell type annotation; Batch integration |
GeneCompass 💡 |
GeneCompass: Deciphering Universal Gene Regulatory Mechanisms with Knowledge-Informed Cross-Species Foundation Model. Xiaodong Yang et al. bioRxiv (2023) |
GitHub Repository |
Transformer |
Cross-species foundation model pre-trained on over 120 million single-cell transcriptomes from humans and mice, integrating biological prior knowledge |
120M |
13 |
Mean square error, Cross entropy loss |
Cell type annotation; Gene regulatory network prediction; Drug dose response prediction |
scTranslator 💡 |
A pre-trained large generative model for translating single-cell transcriptome to proteome. Linjing Liu et al. bioRxiv (2023) |
GitHub Repository |
Transformer |
scTranslator, a pre-trained generative model inspired by NLP and genetic translation, enhances single-cell proteomics by generating multi-omics data from the transcriptome |
239K |
76 |
Mean square error |
Interaction inference; Cell clustering |
scMoFormer 🔍 |
Single-Cell Multimodal Prediction via Transformers. Wenzhuo Tang et al. ACM International Conference on Information and Knowledge Management (CIKM) (2023) |
GitHub Repository |
Transformer |
Transformer-based framework designed to leverage and model the interactions of multimodal single-cell data, incorporating external domain knowledge for enhanced performance |
146K |
3 |
Mean square error loss |
Multimodal prediction |
scTransSort 💡🔍 |
scTransSort: Transformers for Intelligent Annotation of Cell Types by Gene Embeddings. Linfang Jiao et al. Biomolecules (2023) |
GitHub Repository |
Transformer |
Cell-type annotation using transformers, pre-trained on single-cell transcriptomics data |
185K |
47 |
Sparse Categorical Cross entropy |
Cell type annotation |
BioFormers |
BioFormers: A Scalable Framework for Exploring Biostates using Transformers. Siham Amara-Belgadi et al. bioRxiv (2023) |
GitHub Repository |
Transformer |
Transformer-based unsupervised learning to model biological systems, defining a 'biostate' as a comprehensive vector of genomic, proteomic, and other biological markers |
8K |
3 |
Cross entropy loss |
Genetic perturbation prediction; Gene network inference |
MuSe-GNN 🔍 |
MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph Data. Tian Yu et al. Conference on Neural Information Processing Systems (NeurIPS) (2023) |
GitHub Repository |
Graph-Transformer |
Multimodal Similarity Learning Graph Neural Network, for integrating multimodal biological data to uncover gene function similarities across diverse datasets |
- |
82 |
Binary cross entropy, Cosine similarity, Noise contrastive estimation loss |
Cell clusters delineation; Marker genes identification; Cell developmental trajectory exploration; Cluster-specific gene–gene interaction modules exploration analysis of bulk tissues |
scFormer |
scFormer: A Universal Representation Learning Approach for Single-Cell Data Using Transformers. Haotian Cui et al. bioRxiv (2022) |
GitHub Repository |
Transformer |
Transformer-based deep learning framework employing self-attention to jointly optimize unsupervised cell and gene embeddings |
27K |
3 |
Cross entropy loss |
Integration; Perturbation prediction |
scTT 🔍 |
Representation Learning and Translation between the Mouse and Human Brain using a Deep Transformer Architecture. Minxing Pang & Jesper Tegnér. International Conference on Machine Learning (ICML) Workshop on Computational Biology (2020) |
Unpublished |
Transformer |
Transformer-based architecture translates single-cell genomic data between mouse and human, with enhanced clustering accuracy |
170K |
2 |
Mean square error |
Clustering; Alignment |
scmFormer 💡🔍 |
scmFormer Integrates Large-Scale Single-Cell Proteomics and Transcriptomics Data by Multi-Task Transformer. Jing Xu et al. Advanced Science (2024) |
Unpublished |
Transformer decoder |
Transformer-based model integrating single-cell multi-omics data outperforming existing methods in label transfer and handling large-scale datasets. It also improves modality generation and spatial multi-omic analysis. |
1.48M |
24 |
Mean square error |
Missing modality generation; Missing features generation; Cell type label transfer; Clustering; Dimentionality reduction |
scPRINT 💡 |
scPRINT: pre-training on 50 million cells allows robust gene network predictions. Jérémie Kalfon et al. bioRxiv (2024) |
GitHub Repository |
Transformer |
A large transformer-based cell model pre-trained on over 50 million cells and designed to infer gene networks and uncover complex cellular biology. |
50M+ |
800+ |
A combination of negative log-likelihood loss and contrastive loss |
Gene network inference |