Skip to content
bchen4 edited this page Aug 29, 2017 · 17 revisions

Run the program

sgRSEA can be run on specific steps as well as fastq-to-result using a single command.

  • count Get sgRNA count matrix
  • normalization Normalize sgRNA count matrix
  • reformat Reformat count table for sgRSEA stat test
  • stattest Identify significant genes from normalized count table
usage: sgrsea [-h] [--version] {count,reformat,run,stattest,normalization} ...

sgRSEA: identify significant genes in CRISPR-Cas9 experiment

positional arguments:
  {count,reformat,run,stattest,normalization}
    run                 Run the whole program from fastq to result
    count               Get sgRNA count matrix
    normalization       Normalize sgRNA count matrix
    reformat            Reformat count table for sgRSEA stat test
    stattest            Identify significant genes from normalized count table

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

For command line options of each command, type sgrsea COMMAND -h

Get count matrix from fastq files

sgRSEA count can be used on single fastq file and multiple fastq files. When using multiple fastq files, a design file containing design information should be provided.

usage: sgrsea count [-h] [-i INFILE | -d DESIGNFILE] -o OUTFILE [-l LIBFILE]
                    [--sgstart SGSTART] [--sgstop SGSTOP] [--trim3 TRIM3]

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --input INFILE
                        input fastq
  -d DESIGNFILE, --design DESIGNFILE
                        design file
  -o OUTFILE, --output OUTFILE
                        output
  -l LIBFILE, --library LIBFILE
                        Gene locus in bed format
  --sgstart SGSTART     The first nucleotide sgRNA starts. 1-index
  --sgstop SGSTOP       The last nucleotide sgRNA starts. 1-index
  --trim3 TRIM3         The trimming pattern from 3'. This pattern and the
                        following sequence will be removed

Normalize sgRNA count matrix

usage: sgrsea normalization [-h] -i INFILE
                            [--normalize-method {total,median,upperquantile}]
                            -o OUTFILE [--split-lib]

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --input INFILE
                        input count file matrix
  --normalize-method {total,median,upperquantile}
                        design file
  -o OUTFILE, --output OUTFILE
                        output
  --split-lib           Lib A and B are sequenced separately

Convert data matrix to sgRSEA input

sgRSEA takes a data matrix with 4 columns: sgRNA, Gene, treatment, control. sgreas reformat to collapse replicates and make multiple input files for stattest if there are more than 1 comparison needed to be done accroding to the design file.

usage: sgrsea reformat [-h] -i INFILE -o OUTFILE [-d DESIGNFILE] -t TREAT -c
                       CTRL [--collapse-replicates {auto,stack,mean}]

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --input INFILE
                        input BAM file
  -o OUTFILE, --output OUTFILE
                        output
  -d DESIGNFILE, --design DESIGNFILE
                        output
  -t TREAT, --treatment TREAT
                        columns/name of treatment samples
  -c CTRL, --control CTRL
                        columns/name of control samples
  --collapse-replicates {auto,stack,mean}
                        Way to collapse replicates

Find significant genes for each comparison

usage: sgrsea stattest [-h] -i INFILE -o OUTFILE --multiplier MULTIPLIER

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --input INFILE
                        sgRSEA input file, 4 columns
  -o OUTFILE, --output OUTFILE
                        output file name
  --multiplier MULTIPLIER
                        Multiplier to generate background

Run the whole package

usage: sgrsea run [-h] [-i INFILE | -d DESIGNFILE] -o OUTFILE [-l LIBFILE]
                    [--sgstart SGSTART] [--sgstop SGSTOP] [--trim3 TRIM3]
                    [--normalize_method {total,median,upperquantile}]
                    [--split-lib] -t TREAT -c CTRL [--t-lable TREATLABEL]
                    [--c-label CTRLLABEL] [--multiplier MULTIPLIER]
                    [--random-seed RANDOMSEED]
                    [--collapse-replicates {auto,stack,mean}] [--no-count]

Run the whole sgRSEA suite

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --input INFILE
                        input fastq
  -d DESIGNFILE, --design DESIGNFILE
                        design file
  -o OUTFILE, --output OUTFILE
                        output
  -l LIBFILE, --library LIBFILE
                        Gene locus in bed format
  --sgstart SGSTART     The first nucleotide sgRNA starts. 1-index
  --sgstop SGSTOP       The last nucleotide sgRNA starts. 1-index
  --trim3 TRIM3         The trimming pattern from 3'. This pattern and the
                        following sequence will be removed
  --normalize_method {total,median,upperquantile}
                        design file
  --split-lib           Lib A and B are sequenced separately
  -t TREAT, --treatment TREAT
                        columns/name of treatment samples
  -c CTRL, --control CTRL
                        columns/name of control samples
  --t-lable TREATLABEL  label of treatment samples
  --c-label CTRLLABEL   label of control samples
  --multiplier MULTIPLIER
                        Multiplier to generate background
  --random-seed RANDOMSEED
                        Random seed to control permutation process
  --collapse-replicates {auto,stack,mean}
                        Way to collapse replicates
  --no-count            Skip counting step. Uses output contents as input

For command line options of each command, type % COMMAND -h

Examples

Run the suite:

#Run the suite from fastq to final results
sgrsea run -d design_file -o my_experiment

# Get the count for single fastq file
sgrsea count -i my_fastq -o output -l sgRNA_lib 

# Get the count matrix of multiple fastq file and generate a matrix
sgrsea count -d design.txt -o output_prefix

# Normalize the count matrix
sgrsea normalization -i count_matrix --ormalize-method total -o outputfile

# Normalize the count matrix. There are sub-libs sequenced separately
sgrsea normalization -i count_matrix --ormalize-method total -o outputfile --split-lib

# Reformat the matrix into sgRSEA 4-column matrix. 
# -t and -c values MUST match group content of the design file
sgrsea reformat -i normalized_matrix -d design.txt -t Heat,Cold,Dry -c Ctrl,Ctrl,Ctrl -o output_prefix --collapse-replicates auto
# If you only have two group, you can use indexes (first data column is 0) for control and treatment, and you don't need a design file
sgrsea reformat -i normalized_matrix -t 0,1 -c 2,3 -o output_prefix --collapse-replicates auto

# Stattest on normalized and formatted matrix
sgrsea stattest -i matrix -o output

Input

Design file

When you are using the suite for your experiment, you need to prepare a design file.(For individual functionalities, you may not have to.) Design file has to include all the essential columns (Please use exactly the same column names, order does not matter) Use tab as delimiter.

  • filepath: The absolute path to the fastq file
  • lib: The absolute path of the library file. If there is no sublib, this column should has the same value across all rows
  • sublib: The sample for sublib. Eg: GeCKO_libA, GeCKO_libB. Use this when you sequence sublib separately
  • sample: Name of each sample. Note that don't include sublib information here
  • group: This will be used as output prefix for each sample. Please DON'T use " ", "-" in the content
  • sgstart: The first nucleotide of sgRNA. 1-index
  • sgstop: The last nucleotide of sgRNA. 1-index
  • trim3: Sequence pattern of the 3' adaptor. Usually 5~7nt. If provided, the program will look for perfect match of this pattern in fastq sequence. The last match and all nucleotides after that will be trimmed. If you don't need this, put "NA" in the design file

Example

filepath sample lib sublib group sgstart sgstop trim3
UA1.fastq U GeCKOv2_Library_A.txt LibA CONTROL 34 53 False
UB1.fastq U GeCKOv2_Library_B.txt LibB CONTROL 36 55 False
HA1.fastq H GeCKOv2_Library_A.txt LibA TREATMENT 42 61 False
HB1.fastq H GeCKOv2_Library_B.txt LibB TREATMENT 35 54 False

In this example, there are 1 treatment and 1 control. Library A and B are sequenced separately. For each sample, sgRNA positions are different. There is not 3 prime adaptor sequence provides so trim3 columns are filled with "False". If 3 prime adaptor overrides sgstop, fill the 'sgstop' column with "-1". You can Download Example and open it in Excel as a template for your own design file.

Library file

Library file has 3 columns: sgRNA, Gene, Sequence. The order of the columns doesn't matter. Please make sure the column names are exactly as mentioned. Use tab as delimiter.

Example

Gene sgRNA Sequence
A1BG HGLibA_00001 GTCGCTGAGCTCCGATTCGA
A1BG HGLibA_00002 ACCTGTAGTTGCCGGCGTGC
A1BG HGLibA_00003 CGTCAGCGTCACATTGGCCA
A1CF HGLibA_00004 CGCGCACTGGTCCAGCGCAC
A1CF HGLibA_00005 CCAAGCTATATCCTGTGCGC
A1CF HGLibA_00006 AAGTTGCTTGATTGCATTCT
A2M HGLibA_00007 CGCTTCTTAAATTCTTGGGT
A2M HGLibA_00008 TCACAGCGAAGGCGACACAG
A2M HGLibA_00009 CAAACTCCTTCATCCAAGTC

Output

Count matrix (w/wo normalization)

The count matrix will contain basic sgRNA, gene, sequence, sublib information with counts of each sample as an extra column. The normalization matrix will contain sgRNA, gene information, along with normalized counts of each sample as an extra column.

sgRSEA formatted matrix

This matrix has four columns, sgRNA, Gene, treatment, control. If there are multiple comparisons, multiple files will be generated.

sgRSEA stattest output file

For each comparison, there will be a result file. Columns are:

  • Gene: name of the gene
  • sgcount: number of sgRNA per gene
  • NScore: normalized maxmean score
  • pos_p: p value for positive selection
  • pos_fdr: FDR for positive selection
  • pos_rank: gene rank for positive selection
  • neg_p: p value for negative selection
  • neg_fdr: FDR for negative selection
  • neg_rank: gene rank for negative selection