Manual

Run the program

sgRSEA can be run on specific steps as well as fastq-to-result using a single command.

count Get sgRNA count matrix
normalization Normalize sgRNA count matrix
reformat Reformat count table for sgRSEA stat test
stattest Identify significant genes from normalized count table

usage: sgrsea [-h] [--version] {count,reformat,run,stattest,normalization} ...

sgRSEA: identify significant genes in CRISPR-Cas9 experiment

positional arguments:
  {count,reformat,run,stattest,normalization}
    run                 Run the whole program from fastq to result
    count               Get sgRNA count matrix
    normalization       Normalize sgRNA count matrix
    reformat            Reformat count table for sgRSEA stat test
    stattest            Identify significant genes from normalized count table

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

For command line options of each command, type sgrsea COMMAND -h

Get count matrix from fastq files

sgRSEA count can be used on single fastq file and multiple fastq files. When using multiple fastq files, a design file containing design information should be provided.

usage: sgrsea count [-h] [-i INFILE | -d DESIGNFILE] -o OUTFILE [-l LIBFILE]
                    [--sgstart SGSTART] [--sgstop SGSTOP] [--trim3 TRIM3]

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --input INFILE
                        input fastq
  -d DESIGNFILE, --design DESIGNFILE
                        design file
  -o OUTFILE, --output OUTFILE
                        output
  -l LIBFILE, --library LIBFILE
                        Gene locus in bed format
  --sgstart SGSTART     The first nucleotide sgRNA starts. 1-index
  --sgstop SGSTOP       The last nucleotide sgRNA starts. 1-index
  --trim3 TRIM3         The trimming pattern from 3'. This pattern and the
                        following sequence will be removed

Normalize sgRNA count matrix

usage: sgrsea normalization [-h] -i INFILE
                            [--normalize-method {total,median,upperquantile}]
                            -o OUTFILE [--split-lib]

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --input INFILE
                        input count file matrix
  --normalize-method {total,median,upperquantile}
                        design file
  -o OUTFILE, --output OUTFILE
                        output
  --split-lib           Lib A and B are sequenced separately

Convert data matrix to sgRSEA input

sgRSEA takes a data matrix with 4 columns: sgRNA, Gene, treatment, control. sgreas reformat to collapse replicates and make multiple input files for stattest if there are more than 1 comparison needed to be done accroding to the design file.

usage: sgrsea reformat [-h] -i INFILE -o OUTFILE [-d DESIGNFILE] -t TREAT -c
                       CTRL [--collapse-replicates {auto,stack,mean}]

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --input INFILE
                        input BAM file
  -o OUTFILE, --output OUTFILE
                        output
  -d DESIGNFILE, --design DESIGNFILE
                        output
  -t TREAT, --treatment TREAT
                        columns/name of treatment samples
  -c CTRL, --control CTRL
                        columns/name of control samples
  --collapse-replicates {auto,stack,mean}
                        Way to collapse replicates

Find significant genes for each comparison

usage: sgrsea stattest [-h] -i INFILE -o OUTFILE --multiplier MULTIPLIER

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --input INFILE
                        sgRSEA input file, 4 columns
  -o OUTFILE, --output OUTFILE
                        output file name
  --multiplier MULTIPLIER
                        Multiplier to generate background

Run the whole package

usage: sgrsea run [-h] [-i INFILE | -d DESIGNFILE] -o OUTFILE [-l LIBFILE]
                    [--sgstart SGSTART] [--sgstop SGSTOP] [--trim3 TRIM3]
                    [--normalize_method {total,median,upperquantile}]
                    [--split-lib] -t TREAT -c CTRL [--t-lable TREATLABEL]
                    [--c-label CTRLLABEL] [--multiplier MULTIPLIER]
                    [--random-seed RANDOMSEED]
                    [--collapse-replicates {auto,stack,mean}] [--no-count]

Run the whole sgRSEA suite

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --input INFILE
                        input fastq
  -d DESIGNFILE, --design DESIGNFILE
                        design file
  -o OUTFILE, --output OUTFILE
                        output
  -l LIBFILE, --library LIBFILE
                        Gene locus in bed format
  --sgstart SGSTART     The first nucleotide sgRNA starts. 1-index
  --sgstop SGSTOP       The last nucleotide sgRNA starts. 1-index
  --trim3 TRIM3         The trimming pattern from 3'. This pattern and the
                        following sequence will be removed
  --normalize_method {total,median,upperquantile}
                        design file
  --split-lib           Lib A and B are sequenced separately
  -t TREAT, --treatment TREAT
                        columns/name of treatment samples
  -c CTRL, --control CTRL
                        columns/name of control samples
  --t-lable TREATLABEL  label of treatment samples
  --c-label CTRLLABEL   label of control samples
  --multiplier MULTIPLIER
                        Multiplier to generate background
  --random-seed RANDOMSEED
                        Random seed to control permutation process
  --collapse-replicates {auto,stack,mean}
                        Way to collapse replicates
  --no-count            Skip counting step. Uses output contents as input

For command line options of each command, type % COMMAND -h

Examples

Run the suite:

#Run the suite from fastq to final results
sgrsea run -d design_file -o my_experiment

# Get the count for single fastq file
sgrsea count -i my_fastq -o output -l sgRNA_lib 

# Get the count matrix of multiple fastq file and generate a matrix
sgrsea count -d design.txt -o output_prefix

# Normalize the count matrix
sgrsea normalization -i count_matrix --ormalize-method total -o outputfile

# Normalize the count matrix. There are sub-libs sequenced separately
sgrsea normalization -i count_matrix --ormalize-method total -o outputfile --split-lib

# Reformat the matrix into sgRSEA 4-column matrix. 
# -t and -c values MUST match group content of the design file
sgrsea reformat -i normalized_matrix -d design.txt -t Heat,Cold,Dry -c Ctrl,Ctrl,Ctrl -o output_prefix --collapse-replicates auto
# If you only have two group, you can use indexes (first data column is 0) for control and treatment, and you don't need a design file
sgrsea reformat -i normalized_matrix -t 0,1 -c 2,3 -o output_prefix --collapse-replicates auto

# Stattest on normalized and formatted matrix
sgrsea stattest -i matrix -o output

Input

Design file

When you are using the suite for your experiment, you need to prepare a design file.(For individual functionalities, you may not have to.) Design file has to include all the essential columns (Please use exactly the same column names, order does not matter) Use tab as delimiter.

filepath: The absolute path to the fastq file
lib: The absolute path of the library file. If there is no sublib, this column should has the same value across all rows
sublib: The sample for sublib. Eg: GeCKO_libA, GeCKO_libB. Use this when you sequence sublib separately
sample: Name of each sample. Note that don't include sublib information here
group: This will be used as output prefix for each sample. Please DON'T use " ", "-" in the content
sgstart: The first nucleotide of sgRNA. 1-index
sgstop: The last nucleotide of sgRNA. 1-index
trim3: Sequence pattern of the 3' adaptor. Usually 5~7nt. If provided, the program will look for perfect match of this pattern in fastq sequence. The last match and all nucleotides after that will be trimmed. If you don't need this, put "NA" in the design file

Example

filepath	sample	lib	sublib	group	sgstart	sgstop	trim3
UA1.fastq	U	GeCKOv2_Library_A.txt	LibA	CONTROL	34	53	False
UB1.fastq	U	GeCKOv2_Library_B.txt	LibB	CONTROL	36	55	False
HA1.fastq	H	GeCKOv2_Library_A.txt	LibA	TREATMENT	42	61	False
HB1.fastq	H	GeCKOv2_Library_B.txt	LibB	TREATMENT	35	54	False

In this example, there are 1 treatment and 1 control. Library A and B are sequenced separately. For each sample, sgRNA positions are different. There is not 3 prime adaptor sequence provides so trim3 columns are filled with "False". If 3 prime adaptor overrides sgstop, fill the 'sgstop' column with "-1". You can Download Example and open it in Excel as a template for your own design file.

Library file

Library file has 3 columns: sgRNA, Gene, Sequence. The order of the columns doesn't matter. Please make sure the column names are exactly as mentioned. Use tab as delimiter.

Example

Gene	sgRNA	Sequence
A1BG	HGLibA_00001	GTCGCTGAGCTCCGATTCGA
A1BG	HGLibA_00002	ACCTGTAGTTGCCGGCGTGC
A1BG	HGLibA_00003	CGTCAGCGTCACATTGGCCA
A1CF	HGLibA_00004	CGCGCACTGGTCCAGCGCAC
A1CF	HGLibA_00005	CCAAGCTATATCCTGTGCGC
A1CF	HGLibA_00006	AAGTTGCTTGATTGCATTCT
A2M	HGLibA_00007	CGCTTCTTAAATTCTTGGGT
A2M	HGLibA_00008	TCACAGCGAAGGCGACACAG
A2M	HGLibA_00009	CAAACTCCTTCATCCAAGTC

Output

Count matrix (w/wo normalization)

The count matrix will contain basic sgRNA, gene, sequence, sublib information with counts of each sample as an extra column. The normalization matrix will contain sgRNA, gene information, along with normalized counts of each sample as an extra column.

sgRSEA formatted matrix

This matrix has four columns, sgRNA, Gene, treatment, control. If there are multiple comparisons, multiple files will be generated.

sgRSEA stattest output file

For each comparison, there will be a result file. Columns are:

Gene: name of the gene
sgcount: number of sgRNA per gene
NScore: normalized maxmean score
pos_p: p value for positive selection
pos_fdr: FDR for positive selection
pos_rank: gene rank for positive selection
neg_p: p value for negative selection
neg_fdr: FDR for negative selection
neg_rank: gene rank for negative selection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly