A highly scalable and accurate inference of gene expression and structure for single-cell transcriptomes using semi-supervised deep learning.
- Free software: Apache License 2.0
- Python >=3.6
- TensorFlow >=1.13.1,<2.0.0
- numpy >=1.14.0
- pandas >=0.21.0
- h5py >=2.9.0
- matplotlib >=3.0.0
Install TensorFlow
If you have an Nvidia GPU, be sure to install a version of TensorFlow that supports it first -- DISC runs much faster with GPU:
pip install "tensorflow-gpu>= 1.13.1,<2.0.0"
We typically tensorflow-gpu==1.13.1.
Here are requirements for GPU version TensorFlow:
* Hardware * NVIDIA GPU card with CUDA Compute Capability 3.5 or higher. * Software * NVIDIA GPU drivers - CUDA 10.0 requires 410.x or higher. * CUDA Toolkit - TensorFlow_ supports CUDA 10.0 (TensorFlow >= 1.13.0) * CUPTI ships with the CUDA Toolkit. * cuDNN SDK (>= 7.4.1)
See this for further information.
Install DISC with pip
To install with
pip
, run the following from a terminal:pip install disc
Install DISC from GitHub
To clone the repository and install manually, run the following from a terminal:
git clone git://github.com/iyhaoo/DISC.git cd disc python setup.py install
Quick Start
(1). How to run DISC:
disc \ --dataset=matrix.loom \ --out-dir=out_dir
where
matrix.loom
is a loom-formatted raw count matrix with genes in rows and cells in columns andout_dir
is the target path for output folder.(2). What DISC outputs:
log.tsv
: records DISC training information.summary.pdf
: shows the fitting line and optimal point and will be updated in real time when DISC is running.summary.tsv
: records the raw data insummary.pdf
.result
: imputaion result folder, which contains:imputation.loom
: the imputed matrix with genes in rows and cells in columns.feature.loom
: the feature matrix with feature in rows and cells in columns.running_info.hdf5
: a hdf5-formatted file, contains some useful information ofmatrix.loom
(e.g. library size, the expressed counts and cells for each genes, imputed genes, etc.).
models
: For every save interval, DISC freezes its parameters into this folder (in pb format).
Data availability
The sources of our data are listed here.
- SSCORTEX :
- Mouse somatosensory cortex of CD-1 mice at age of p28 and p29 were profiled by 10X where 7,477 cells were detected (scRNA-seq). In addition, osmFISH experiment of 4,839 cells from somatosensory cortex, hippocampus and ventricle of a CD-1 mouse at age of p22 was conducted and 33 genes were detected (FISH).
- CBMC :
- Cord blood mononuclear cells were profiled by CITE-seq, where 8,005 human cells were detected in total (scRNA-seq).
- PBMC :
- 2,700 freeze-thaw peripheral blood mononuclear cells (PBMC) from a healthy donor were profiled by 10X, where 32,738 genes were detect (scRNA-seq).
- JURKAT_293T :
- 3258 jurkat cells (scRNA-seq) and 2885 293T cells (scRNA-seq) were profiled by 10X separately. This dataset has bulk RNA-seq data (bulk RNA-seq).
- 10X_5CL :
- 5,001 cells from 5 human lung adenocarcinoma cell lines H2228, H1975, A549, H838 and HCC827 were profiled by 10X (scRNA-seq). This dataset has bulk RNA-seq data (bulk RNA-seq).
- BONE_MARROW :
- 6,941 human bone marrow cells from sample MantonBM6 were profiled by 10X. The original single-cell RNA sequencing data provided by HCA was aligned to hg19, 6939 cells left after cell filtering (scRNA-seq). This dataset has bulk RNA-seq data (bulk RNA-seq).
- RETINA :
- Retinas of mice at age of p14 were profiled in 7 different replicates on by Drop-seq, where 6,600, 9,000, 6,120, 7,650, 7,650, 8280, and 4000 (49,300 in total) STAMPs (single-cell transcriptomes attached to micro-particles) were collected (scRNA-seq). The dataset has cell annotation.
- BRAIN_SPLiT :
- 156,049 mice nuclei from developing brain and spinal cord at age of p2 or p11 mice were profiled by SPLiT-seq (scRNA-seq). The cell annotation of this dataset is included in file GSM3017261_150000_CNS_nuclei.mat.gz at the same GEO page.
- BRAIN_1.3M :
- 1,306,127 cells from combined cortex, hippocampus, and subventricular zone of 2 E18 C57BL/6 mice were profiled by 10X (scRNA-seq).
We provide our pre-processed data here.
Dataset
Raw Data
DS Data
FISH Data
Bulk Data
Cell Type Annotation
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO
NO (Too large)
NO
NO
NO
Evaluations
Data Preparation, Imputation and Computational Resource Evaluation
- (1). Data Pre-processing
- (2). Imputation
- (3). Computational Resource Evaluation (Results, Test Program)
Data Structure Recovery Evaluation
- (1). Gene Expression Structures (FISH)
- Tutorial : MELANOMA
- (2). Gene and Cell Structures (Down-sampling)
- Tutorial : MELANOMA
- (S1). Spearman Correlation (Bulk)
- Tutorial : JURKAT_293T
- (S2). Identification of True Zeros (Down-sampling)
- Tutorial : MELANOMA, SSCORTEX, CBMC and PBMC
Down-stream Analysis Improvement:
- (1). Cell Type Identification (Down-sampling)
- Tutorial : PBMC
- (2). DEG Identification (Bulk)
- Tutorial : JURKAT_293T
- (3). Solution for Large Dataset Analysis
- Tutorial : PBMC
- (S1). Trajectory Analysis
Tutorial : BONE_MARROW
Other Utility Scripts
Script
Output
Yao He#, Hao Yuan#, Cheng Wu#, Zhi Xie*. DISC: a highly scalable and accurate inference of gene expression and structure for single-cell transcriptomes using semi-supervised deep learning. Genome Biology 21, 170 (2020). https://doi.org/10.1186/s13059-020-02083-3
- Update CLI.
- First release on PyPI.