-
Notifications
You must be signed in to change notification settings - Fork 0
Manual
sgRSEA can be run on specific steps as well as fastq-to-result using a single command.
- count Get sgRNA count matrix
- normalization Normalize sgRNA count matrix
- reformat Reformat count table for sgRSEA stat test
- stattest Identify significant genes from normalized count table
usage: sgrsea [-h] [--version] {count,reformat,run,stattest,normalization} ...
sgRSEA: identify significant genes in CRISPR-Cas9 experiment
positional arguments:
{count,reformat,run,stattest,normalization}
run Run the whole program from fastq to result
count Get sgRNA count matrix
normalization Normalize sgRNA count matrix
reformat Reformat count table for sgRSEA stat test
stattest Identify significant genes from normalized count table
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
For command line options of each command, type sgrsea COMMAND -h
sgRSEA count can be used on single fastq file and multiple fastq files. When using multiple fastq files, a design file containing design information should be provided.
usage: sgrsea count [-h] [-i INFILE | -d DESIGNFILE] -o OUTFILE [-l LIBFILE]
[--sgstart SGSTART] [--sgstop SGSTOP] [--trim3 TRIM3]
optional arguments:
-h, --help show this help message and exit
-i INFILE, --input INFILE
input fastq
-d DESIGNFILE, --design DESIGNFILE
design file
-o OUTFILE, --output OUTFILE
output
-l LIBFILE, --library LIBFILE
Gene locus in bed format
--sgstart SGSTART The first nucleotide sgRNA starts. 1-index
--sgstop SGSTOP The last nucleotide sgRNA starts. 1-index
--trim3 TRIM3 The trimming pattern from 3'. This pattern and the
following sequence will be removed
usage: sgrsea normalization [-h] -i INFILE
[--normalize-method {total,median,upperquantile}]
-o OUTFILE [--split-lib]
optional arguments:
-h, --help show this help message and exit
-i INFILE, --input INFILE
input count file matrix
--normalize-method {total,median,upperquantile}
design file
-o OUTFILE, --output OUTFILE
output
--split-lib Lib A and B are sequenced separately
sgRSEA takes a data matrix with 4 columns: sgRNA, Gene, treatment, control. sgreas reformat to collapse replicates and make multiple input files for stattest if there are more than 1 comparison needed to be done accroding to the design file.
usage: sgrsea reformat [-h] -i INFILE -o OUTFILE [-d DESIGNFILE] -t TREAT -c
CTRL [--collapse-replicates {auto,stack,mean}]
optional arguments:
-h, --help show this help message and exit
-i INFILE, --input INFILE
input BAM file
-o OUTFILE, --output OUTFILE
output
-d DESIGNFILE, --design DESIGNFILE
output
-t TREAT, --treatment TREAT
columns/name of treatment samples
-c CTRL, --control CTRL
columns/name of control samples
--collapse-replicates {auto,stack,mean}
Way to collapse replicates
usage: sgrsea stattest [-h] -i INFILE -o OUTFILE --multiplier MULTIPLIER
optional arguments:
-h, --help show this help message and exit
-i INFILE, --input INFILE
sgRSEA input file, 4 columns
-o OUTFILE, --output OUTFILE
output file name
--multiplier MULTIPLIER
Multiplier to generate background
usage: sgrsea run [-h] [-i INFILE | -d DESIGNFILE] -o OUTFILE [-l LIBFILE]
[--sgstart SGSTART] [--sgstop SGSTOP] [--trim3 TRIM3]
[--normalize_method {total,median,upperquantile}]
[--split-lib] -t TREAT -c CTRL [--t-lable TREATLABEL]
[--c-label CTRLLABEL] [--multiplier MULTIPLIER]
[--random-seed RANDOMSEED]
[--collapse-replicates {auto,stack,mean}] [--no-count]
Run the whole sgRSEA suite
optional arguments:
-h, --help show this help message and exit
-i INFILE, --input INFILE
input fastq
-d DESIGNFILE, --design DESIGNFILE
design file
-o OUTFILE, --output OUTFILE
output
-l LIBFILE, --library LIBFILE
Gene locus in bed format
--sgstart SGSTART The first nucleotide sgRNA starts. 1-index
--sgstop SGSTOP The last nucleotide sgRNA starts. 1-index
--trim3 TRIM3 The trimming pattern from 3'. This pattern and the
following sequence will be removed
--normalize_method {total,median,upperquantile}
design file
--split-lib Lib A and B are sequenced separately
-t TREAT, --treatment TREAT
columns/name of treatment samples
-c CTRL, --control CTRL
columns/name of control samples
--t-lable TREATLABEL label of treatment samples
--c-label CTRLLABEL label of control samples
--multiplier MULTIPLIER
Multiplier to generate background
--random-seed RANDOMSEED
Random seed to control permutation process
--collapse-replicates {auto,stack,mean}
Way to collapse replicates
--no-count Skip counting step. Uses output contents as input
For command line options of each command, type % COMMAND -h
Run the suite:
#Run the suite from fastq to final results
sgrsea run -d design_file -o my_experiment
# Get the count for single fastq file
sgrsea count -i my_fastq -o output -l sgRNA_lib
# Get the count matrix of multiple fastq file and generate a matrix
sgrsea count -d design.txt -o output_prefix
# Normalize the count matrix
sgrsea normalization -i count_matrix --ormalize-method total -o outputfile
# Normalize the count matrix. There are sub-libs sequenced separately
sgrsea normalization -i count_matrix --ormalize-method total -o outputfile --split-lib
# Reformat the matrix into sgRSEA 4-column matrix.
# -t and -c values MUST match group content of the design file
sgrsea reformat -i normalized_matrix -d design.txt -t Heat,Cold,Dry -c Ctrl,Ctrl,Ctrl -o output_prefix --collapse-replicates auto
# If you only have two group, you can use indexes (first data column is 0) for control and treatment, and you don't need a design file
sgrsea reformat -i normalized_matrix -t 0,1 -c 2,3 -o output_prefix --collapse-replicates auto
# Stattest on normalized and formatted matrix
sgrsea stattest -i matrix -o output
When you are using the suite for your experiment, you need to prepare a design file.(For individual functionalities, you may not have to.) Design file has to include all the essential columns (Please use exactly the same column names, order does not matter) Use tab as delimiter.
- filepath: The absolute path to the fastq file
- lib: The absolute path of the library file. If there is no sublib, this column should has the same value across all rows
- sublib: The sample for sublib. Eg: GeCKO_libA, GeCKO_libB. Use this when you sequence sublib separately
- sample: Name of each sample. Note that don't include sublib information here
- group: This will be used as output prefix for each sample. Please DON'T use " ", "-" in the content
- sgstart: The first nucleotide of sgRNA. 1-index
- sgstop: The last nucleotide of sgRNA. 1-index
- trim3: Sequence pattern of the 3' adaptor. Usually 5~7nt. If provided, the program will look for perfect match of this pattern in fastq sequence. The last match and all nucleotides after that will be trimmed. If you don't need this, put "NA" in the design file
Example
filepath | sample | lib | sublib | group | sgstart | sgstop | trim3 |
---|---|---|---|---|---|---|---|
UA1.fastq | U | GeCKOv2_Library_A.txt | LibA | CONTROL | 34 | 53 | False |
UB1.fastq | U | GeCKOv2_Library_B.txt | LibB | CONTROL | 36 | 55 | False |
HA1.fastq | H | GeCKOv2_Library_A.txt | LibA | TREATMENT | 42 | 61 | False |
HB1.fastq | H | GeCKOv2_Library_B.txt | LibB | TREATMENT | 35 | 54 | False |
In this example, there are 1 treatment and 1 control. Library A and B are sequenced separately. For each sample, sgRNA positions are different. There is not 3 prime adaptor sequence provides so trim3 columns are filled with "False". If 3 prime adaptor overrides sgstop, fill the 'sgstop' column with "-1". You can Download Example and open it in Excel as a template for your own design file.
Library file has 3 columns: sgRNA, Gene, Sequence. The order of the columns doesn't matter. Please make sure the column names are exactly as mentioned. Use tab as delimiter.
Example
Gene | sgRNA | Sequence |
---|---|---|
A1BG | HGLibA_00001 | GTCGCTGAGCTCCGATTCGA |
A1BG | HGLibA_00002 | ACCTGTAGTTGCCGGCGTGC |
A1BG | HGLibA_00003 | CGTCAGCGTCACATTGGCCA |
A1CF | HGLibA_00004 | CGCGCACTGGTCCAGCGCAC |
A1CF | HGLibA_00005 | CCAAGCTATATCCTGTGCGC |
A1CF | HGLibA_00006 | AAGTTGCTTGATTGCATTCT |
A2M | HGLibA_00007 | CGCTTCTTAAATTCTTGGGT |
A2M | HGLibA_00008 | TCACAGCGAAGGCGACACAG |
A2M | HGLibA_00009 | CAAACTCCTTCATCCAAGTC |
The count matrix will contain basic sgRNA, gene, sequence, sublib information with counts of each sample as an extra column. The normalization matrix will contain sgRNA, gene information, along with normalized counts of each sample as an extra column.
This matrix has four columns, sgRNA, Gene, treatment, control. If there are multiple comparisons, multiple files will be generated.
For each comparison, there will be a result file. Columns are:
- Gene: name of the gene
- sgcount: number of sgRNA per gene
- NScore: normalized maxmean score
- pos_p: p value for positive selection
- pos_fdr: FDR for positive selection
- pos_rank: gene rank for positive selection
- neg_p: p value for negative selection
- neg_fdr: FDR for negative selection
- neg_rank: gene rank for negative selection