Skip to content

Latest commit

 

History

History
executable file
·
421 lines (318 loc) · 27.3 KB

index.md

File metadata and controls

executable file
·
421 lines (318 loc) · 27.3 KB

RNA-seq 2G manual {.tabset}

Zhe Zhang

2016-10-02

**RNA-seq 2G** is a web portal with >20 statistical methods that perform two-group analysis of differential gene expression. It uses read count data from RNA-seq or similar data matrix as input and generates test statistics in consistent format as output.

 

Introduction {.tabset}

Two-group comparison of differential expression (DE) is the most common analysis of transcriptome data. For RNA-seq data, the comparison is usually performed on a gene-level matrix of read counts, with the read counts corresponding to the number of sequencing reads mapped to each gene in each RNA-seq sample.
 

DE methods

Statistical methods that have been applied to two-group DE of RNA-seq data are widely different in terms of their assumptions on data distribution, input/output format, performance, sensitivity, and user-friendliness, which are summarized in the table below:
**Table 1** Analysis methods for differential expression.

Column description:

  • Name: Method names to be showed on the web portal.
  • Call: The names of the re-implemented functions in the DEGandMore package.
  • Default: An arbitrarily selected popular DE methods.
  • Speed: How much time the DE method takes to finish. Typically, fast methods take a few seconds, medium methods take up to a minute, and slow methods take a few minutes or more. Therefore, the waiting time will be long if any slow methods are selected.
  • Paired: Whether the method supports paired test.
  • Logged: Wehther the original method is performed on transformed data in logged scale.
  • Normalization: Whether the method has its own internal normalization procedure.
  • Distribution: Which data distribution the original method is based upon.
  • Test: The statistical test used by the method.
  • Function: The main function and package within which the method was originally implemented.
Name Call Default Speed Paired Logged Normalization Distribution Test Function
StudentsT DeT Yes Fast Yes Yes No Normal Student's t test, equal variance TDist {stats}
limma DeLimma Yes Fast Yes Yes No Normal Empirical Bayes moderation ebayes {limma}
edgeR DeEdgeR Yes Fast Yes No Yes Negative binomial Exact/Likelihood ratio exactTest {edgeR}
DESeq2 DeDeSeq Yes Fast Yes No Yes Negative binomial Generalized linear model DESeq {DESeq2}
ABSSeq DeAbsSeq No Fast Yes No Yes Negative binomial Sum of counts callDEs {ABSSeq}
BGmix DeBGmix No Fast Yes Yes No Normal Bayesian mixture model BGmix {BGmix}
PoissonSeq DePoissonSeq No Fast Yes No Yes Poisson log-linear Poisson goodness-of-fit PS.Main {PoissonSeq}
RBM DeRBM No Fast No Yes No Normal Empirical Bayes & resampling RBM_T {RBM}
voom DeVoomLimma No Fast Yes No Yes Log-normal Empirical Bayes moderation voom {limma}
WelchsT DeWelch No Fast Yes Yes No Normal Welch's t test, unequal variance TDist {stats}
DEGseq DeDegSeq No Medium No No No Binomial/Poisson Likelihood Ratio Test DEGexp {DEGseq}
EBSeq DeEbSeq No Medium No No Yes Negative Binomial Empirical Bayesian EBTest {EBSeq}
NOISeq DeNoiSeq No Medium No No Yes Nonparametric Empirical Bayes noiseqbio {NOISeq}
PLGEM DePlgem No Medium Yes No No Normal Power Law Global Error Model run.plgem {plgem}
RankProd DeRankP No Medium Yes Yes No Nonparametric Rank product RP {RankProd}
SAM DeSam No Medium Yes Yes No Normal Alternative t test with permutation samr {samr}
SAMSeq DeSamSeq No Medium Yes No No Nonparametric Wilcoxon with resampling SAMseq {samr}
sSeq DeSSeq No Medium No No No Negative Binomial Shrinkage Approach of Dispersion Estimation nbTestSH {sSeq}
Wilcoxon DeWilcoxon No Medium Yes Yes No Nonparametric Wilcoxon signed-rank test wilcox.test {stats}
baySeq DeBaySeq No Slow Yes No No Negative binomial Empirical Bayesian getLikelihoods {baySeq}
bridge DeBridge No Slow No Yes No T/Gaussian Bayesian hierarchical model bridge.2samples {bridge}
LMGene DeLMGene No Slow Yes No No Normal Linear model & glog transformation genediff {LMGene}
ALDEx2 DeAldex2 No Slower Yes No Yes Dirichlet Welch's t/Wilcoxon/Kruskal Wallace aldex {ALDEx2}
BADER DeBader No Slower No No Yes Overdispersed poisson Bayesian BADER {BADER}
edgeRun DeEdgeRun No Slower Yes No Yes Negative binomial Exact unconditional UCexactTest {edgeRun}
tweeDEseq DeTweeDeSeq No Slower No No Yes Poisson-Tweedie Poisson-like tweeDE {tweeDEseq}

Test statistics

All DE methods available through **RNAseq 2G** report a 6-column table of test statistics. All tables from the same analysis will have the same size and the same number and order of row and column names. The rows are the genes of read count matrix, after filtering for missing values and low read count.
**Table 2** Example of analysis output from all DE methods, a 6-column table:

Column description:

  • Mean_Control and Mean_Patient The group means of normalized read counts, using the internal normalization of the DE method or user-specified normalization if the method doesn't have one. If the DE method uses log-transformed data, the final values in these columns will be un-logged.
  • Patient-Control: The different between the first 2 columns.
  • LogFC: The log2-ratio of group means, which is often refered to as fold change. For example, when the average read counts of the two groups is 1:4, the log2-ratio equals to 2.0; and when the ratio is 8:1, the log2-ratio will be -3.0. If a DE method has no its own algorithms to calculate log2-ratios, the values will be calculated based on normalized data. Missing values will be replaced with 0.
  • Pvalue: The statistical significance of group difference. This is the column that differs all DE methods as each method has its own assumption about data distribution and choice of statistical test. Missing values in the original result will be replaced by 1.0 in the final table. RNA-seq 2G provides online visualization to compare p values from different DE methods side-by-side.
  • FDR: False discovery rate calculated from the p value using the Benjamini-Hochberg method. Although some DE methods also calculate FDRs or other adjusted p values using their own algorithms, they will not be used in this table. Therefore, the values in this column is solely based on the Pvalue column and calculated the same way for all DE methods.
Mean_Control Mean_Patient Control-Patient LogFC Pvalue FDR
A1BG 2.4379 4.0794 1.6415 0.5684 0.44000 0.610
A1BG-AS1 7.8024 13.5533 5.7509 0.8885 0.34000 0.520
A1CF 0.7970 1.7488 0.9518 1.3458 0.35000 0.540
A2M 14.3158 3.0886 -11.2272 -2.1986 0.08400 0.230
A2M-AS1 24.6289 6.3006 -18.3283 -2.1011 0.08500 0.230
A4GALT 0.2714 5.0005 4.7291 2.3824 0.23000 0.410
AAAS 66.9266 34.9312 -31.9954 -0.9293 0.00650 0.076
AACS 20.0464 11.2508 -8.7957 -0.8621 0.03600 0.150
AAED1 100.2450 110.5209 10.2759 0.0725 0.85000 0.910
AAGAB 83.6005 152.3225 68.7220 0.8871 0.00780 0.082
AAK1 947.7917 173.8182 -773.9735 -2.3011 0.01400 0.100
AAMDC 5.5618 1.0051 -4.5567 -2.6119 0.01300 0.098
AAMP 46.4962 32.5321 -13.9641 -0.5125 0.05000 0.170
AANAT 2.4125 9.3400 6.9275 1.6817 0.10000 0.250
AAR2 17.4167 62.8776 45.4609 1.8744 0.00085 0.043
AARS 49.9801 59.8619 9.8819 0.2605 0.57000 0.720

Prepare inputs {.tabset}

For RNA-seq data the raw sequencing reads need to be aligned to the reference genome and transcriptome using any alignment program. Next, the aligned reads should be assigned to annotated genes or transcripts to generate a read count matrix. RNA-seq 2G accepts other types of data, such as those generated by the proteomics and Nanostring technologies, as long as the raw data was processed similarly to generate a integer matrix.
 

Read count matrix

Using gene-level data from RNA-seq as an example, the read count matrix should have rows corresponding to unique genes and columns corresponding to unique samples, and each cell should be the number of sequencing reads of a sample mapped to a gene. Missing values are allowed, and will be dealt with before the analysis starts. **The row and column names of the matrix must be unique gene and sample IDs respectively.** The type of gene IDs (official symbols, RefSeq, etc.) doesn't matter. **RNA-seq 2G** accepts the read count matrix in different types of files and examples can be downloaded [here](https://github.com/zhezhangsh/awsomics/raw/master/rnaseq_2g/data/count_example.zip).
  • R file(.rds, .rda, or .rdata) that saves the read count matrix as a matrix or data.frame object.
  • Tab-delimited text file(.txt) that has the gene IDs in the first column and sample IDs in the first row.
  • Comma-delimited text file(.csv) that has the gene IDs in the first column and sample IDs in the first row.
  • Excel file(.xls or .xlsx) that has the matrix in the first worksheet and the gene and sample IDs in the first column and row.
  • HTML file(.htm or .html) that has the matrix as the first HTML table within the table tag.

Sample grouping

Sample IDs should be grouped into two groups to be compared. The sample IDs must match the column names of the read count matrix. The default names of the groups are ***Control*** and ***Case*** if the they are not explicitly named. **For paired test**, both groups must have the same number of samples and the paired samples must be in the same order.** **RNA-seq 2G** allows users to type in the group names and sample IDs directly or upload a formmated file. Examples of acceptable file formats can be downloaded [here](https://github.com/zhezhangsh/awsomics/raw/master/rnaseq_2g/data/group_example.zip).
  • R file(.rds, .rda, or .rdata) that is a list of two named character vectors of sample IDs.
  • Tab-delimited text file(.txt) that includes two lines, both started with group name and followed by sample IDs.
  • Comma-delimited text file(.csv) that includes two lines, both started with group name and followed by sample IDs.
  • Excel file(.xls or .xlsx) that includes two rows, both started with group name and followed by sample IDs.

Other parameters

Users also need to specify a few parameters to run the DE analysis:
  • Paired: Whether the comparison is paired. Only applicable to DE methods supporting paired test. The compared groups must have the same number of samples for paired test and the paired samples need to be organized in the same order.

  • Missing value: What to do with missing values. The options are replacing missing values with 0 and removing the genes including any missing values. The imputation option is currently not available, so users need to do it themselves if they want their data imputed. Indeed, it's highly encouraged that users handle the missing values with their favorite strategy beforehand.

  • Minimal count: Genes with total read counts from all samples less than this number will not be included in the result.

  • Normalization: How to normalize the read count matrix if the DE method has no its own methods.

    • Normalization of read count: How to normalize read count data. The options are "DESeq", "TMM", "RLE", "QQ", "UpperQuantile", "Median", and "TotalCount".

      • DESeq: the default normalization method provided the DESeq2 package.
      • TMM: the "trimmed mean of M-values" method provided by the EdgeR package.
      • RLE: the "relative log expression" method provided by the EdgeR package.
      • QQ: the quantile-quantile normalization to make all samples have the exactly same distribution.
      • UpperQuantile: rescale data so all the samples have the same upper quantile.
      • Median: rescale data so all the samples have the same median.
      • TotalCount: rescale data so all the samples have the same total number of read counts.
    • Normalization of logged data: How to normalize data after it's log-transformed. The options are "Loess", "VST", "Rlog", "QQ", "UpperQuantile", and "Median".

      • Loess: fit pairs of samples to local regression.
      • VST: the "variance stabilizing transformation" method provided the DESeq2 package.
      • Rlog: the "regularized log transformation" method provided the DESeq2 package.
      • QQ: the quantile-quantile normalization to make all samples have the exactly same distribution.
      • UpperQuantile: rescale data so all the samples have the same median.
      • Median: rescale data so all the samples have the same median.

Run analysis online {.tabset}

**RNAseq 2G** provides a user-friendly web portal to run a DE analysis using any of the available methods. Each analysis will be assigned a random ID and its results can be re-visited by specifying the ID. To set an analysis go to http://rnaseq2g.awsomics.org and finish 3 simple steps described below within the on the **Analysis** page.
 

Step 1

![](figure/analysis_step1.png)
**Figure 1A.** Set up online DE analysis, Step 1.
  • A: upload a local file of read count matrix in one of the acceptable formats.
  • B: click file extension to download an example file in acceptable format.
  • C: once a read count matrix is successfully loaded, the first a few rows will be showed for visual confirmation.

Step 2

![](figure/analysis_step2.png)
**Figure 1B.** Set up online DE analysis, Step 2.
  • A: upload a local file of sample grouping in an acceptable formats; the group names and sample ID will show up in boxes on the right.
  • B: click file extension to download an example file in acceptable format.
  • C: group names; can be directly typed in or automatically added through loaded file.
  • D: sample IDs; can be directly typed in (separated by commas) or automatically added through loaded file; for paired test, both groups should have the same number of samples and paired samples should be in the same order.
  • E: paired or unpaired test; will be ignored if the DE method doesn't support paired test.
  • F: replace missing values with 0 or remove the genes including any missing value.
  • G: methods to normalize original read count data and log-transformed data; ignored if the DE method has its own normalization.
  • H: genes with total number of reads less than this number will not be included in the results.

Step 3

![](figure/analysis_step3.png)
**Figure 1C.** Set up online DE analysis, Step 3.
  • A: pick the DE methods to run.
  • B: quickly pick a subset of methods to run.
  • C: click to show detailed description of all DE methods.

Run analysis offline {.tabset}

An alternative to use **RNA-seq 2G** is to directly call the ***DeRNAseq {DEGandMore}*** function within R. This option is more suitable for DE analysis runs using any of the slow methods (see **Table 1**). When any slow DE methods are selected, the online waiting might be too long and users should run the DE analysis offline, but can later upload their results to **RNAseq 2G** for visualization. Running the offline analysis takes some basic R skills and a few simple steps.
 

Install packages

Open RStudio or R console, **make sure the base version of R is 3.2 or higher and have the _devtools_ package installed**, using the code below to find out.
version;
if(!require('devtools')) install.packages('devtools'); 

Install all required R packages from CRAN, Bioconductor, and GitHub and test if they can be loaded properly using the code below. Depending on how many packages need to be installed, this process might take a while. If everything goes well, you will get a message at the end of screen messages, saying "Congratulations! All required packages have been installed and loaded."

require(devtools);
source_url("https://raw.githubusercontent.com/zhezhangsh/DEGandMore/master/installer.r"); 

Perform analysis

Run DE analysis using the code below based on the **DEGandMore::DeRNAseq** function.
require(RoCA);
require(DEGandMore); 

# Import read count matrix and sample grouping from files; see above for acceptable file formats
read.count   <- ImportTable('my_read_count_file');
sample.group <- ImportList('my_sample_group_file');

# Set up parameters
pr <- FALSE;   # change to TRUE for paired test
md <- 0;       # change to 1-4 for other method groups or specific method names
mc <- 6;       # change to another non-negative integer for a different cutoff
nc <- 1;       # change to 2 or higher if the system supports parallel computing
n1 <- 'DESeq'; # change to "TMM", "RLE", "QQ", "UpperQuantile", "Median", or "TotalCount" to use other methods
n2 <- 'Loess'; # change to "VST", "Rlog", "QQ", "UpperQuantile", or "Median" to use other methods

####################################################################################################
# Run DE analysis
my.analysis.result <- DeRNAseq(ct = read.count, grps = sample.group, mthds = md, paired = pr,
                               min.count = mc, num.cluster = nc, just.stat = TRUE, 
                               norm.count = n1, norm.logged = n2, force.norm = TRUE);
# note that the parameters 'just.stat' and 'force.norm' should always be TRUE for RNAseq 2G
####################################################################################################

# save results to file
saveRDS(my.analysis.result, 'my_analysis_result.rds');
# the 'my_analysis_result.rds' file can now be uploaded to the RNAseq 2G web portal for visualizaiton

Alternatively, simply run DE analysis using just 2 specific methods and default parameters.

my.analysis.result <- DeRNAseq(read.count, sample.group, c('DeEdgeR', 'DeDeSeq'), force.norm = TRUE);

List methods

Available DE methods and method groups can be found using the code below.
DeRNAseqMethods(group=0); # Default methods
## [1] "DeT"     "DeLimma" "DeEdgeR" "DeDeSeq"
DeRNAseqMethods(group=1); # Fast methods
##  [1] "DeT"          "DeLimma"      "DeEdgeR"      "DeDeSeq"      "DeAbsSeq"     "DeBGmix"      "DePoissonSeq"
##  [8] "DeRBM"        "DeVoomLimma"  "DeWelch"
DeRNAseqMethods(group=2); # Fast+medium methods
##  [1] "DeT"          "DeLimma"      "DeEdgeR"      "DeDeSeq"      "DeAbsSeq"     "DeBGmix"      "DePoissonSeq"
##  [8] "DeRBM"        "DeVoomLimma"  "DeWelch"      "DeDegSeq"     "DeEbSeq"      "DeNoiSeq"     "DePlgem"     
## [15] "DeRankP"      "DeSam"        "DeSamSeq"     "DeSSeq"       "DeWilcoxon"
DeRNAseqMethods(group=3); # Fast+medium+slow methods
##  [1] "DeT"          "DeLimma"      "DeEdgeR"      "DeDeSeq"      "DeAbsSeq"     "DeBGmix"      "DePoissonSeq"
##  [8] "DeRBM"        "DeVoomLimma"  "DeWelch"      "DeDegSeq"     "DeEbSeq"      "DeNoiSeq"     "DePlgem"     
## [15] "DeRankP"      "DeSam"        "DeSamSeq"     "DeSSeq"       "DeWilcoxon"   "DeBaySeq"     "DeBridge"    
## [22] "DeLMGene"
DeRNAseqMethods(group=4); # All methods
##  [1] "DeT"          "DeLimma"      "DeEdgeR"      "DeDeSeq"      "DeAbsSeq"     "DeBGmix"      "DePoissonSeq"
##  [8] "DeRBM"        "DeVoomLimma"  "DeWelch"      "DeDegSeq"     "DeEbSeq"      "DeNoiSeq"     "DePlgem"     
## [15] "DeRankP"      "DeSam"        "DeSamSeq"     "DeSSeq"       "DeWilcoxon"   "DeBaySeq"     "DeBridge"    
## [22] "DeLMGene"     "DeAldex2"     "DeBader"      "DeEdgeRun"    "DeTweeDeSeq"

Browse results {.tabset}

***TO BE FINISHED***
 
Results of DE analysis can be visualized and compared online via **RNAseq 2G**.
 
![](figure/result_upload.png)
![](figure/result_table.png)
![](figure/result_plot.png)
![](figure/compare_both.png)
![](figure/compare_rank.png)
![](figure/compare_both.png)

END OF DOCUMENT