EdiTyper

A Python program for aligning CRISPR-edited RNA-seq data. Written by Alexandre Yahi.

Installation

To install, run the following command

python setup.py install

Please note, to install system-wide, you must have sudo access. If you are not installing system-wide, please install within your $PATH and $PYTHONPATH.

Dependencies

EdiTyper is compatible with Python 2.7 and 3.3+; it also depends on the following Python modules:

Each of these modules is available on PyPi and installable using pip

EdiTyper also depends on R and Rscript for plotting. If R is not installed, then plots will be suppressed. EdiTyper will still run, assuming at least one other output (see Suppression Arguments).

Basic Usage

To get a basic help message, simply run

EdiTyper

EdiTyper -r reference.fasta -t template.fasta -i sample.fastq

Classification scheme

We classify the unique reads based on their alignment with the reference. HDR are reads with the right SNP edited to the target SNP as identified in the template sequence, and no deletion and no insertions but tolerate mismatches. If there are insertions or deletions then the read is classified as MIX, which is a HDR with indels. If the SNP of interest is not edited and there are no indels, then the read is unchanged (NO_EDIT). Otherwise, the read is a NHEJ.

Arguments

Alignment Arguments

Parameter	Definition	Required?	Default
`-m \| --analysis-mode`	Set which mismatch between the reference and template sequences gets analyzed (SNP) and which are treated as NHEJ (PAM). Specified as a 'SNP' and a series 'PAM's separated by '+' characters (eg. SNP+PAM, PAM+SNP)	No	SNP
`-p \| --pvalue-threshold`	Set the threshold for the alignment	No	1 *10-3
`-g \| --gap-opening`	Set the gap opening penalty	No	8
`-e \| --gap-extension`	Set the gap extension penalty	No	1
`--parallel`	Run EdiTyper in parallel, supports optional argumen to limit number of cores, otherwise uses all cores	No	None

Reference Arguments

Parameter	Definition	Required?
`-r \| --reference-sequence`	Reference FASTA for alignment	Yes
`-t \| --template-sequence`	Template sequence for editing	Yes
`-b \| --reference-bed`	A BED file with the genomic location of the reference sequence. Used to get the chromosome name and sequence start point. Only the first entry will be read; any line starting with 'browser', 'track' or '`#`' will be ignored. Please ensure this is a zero- based BED file	No

Input Arguments

NOTE: These arguments are mutually exclusive

NOTE: At least one of these arguments is required

Parameter	Definition
`-i \| --input-file`	Provide a FASTQ file for aligning, this option can have multiple FASTQs specified
`-l \| --sample-list`	Provide a list of FASTQ files for aligning; there should be one FASTQ per line
`-d \| --fastq-directory`	Provide a directory where FASTQ files are located; files must end in .fastq or .fq (.gz is allowed afterwards)

Output Arguments

NOTE: All of these arguments are optional

Parameter	Definition	Default
`-d \| --output-directory`	Choose where to put all output files	'output'
`-j \| --project`	Give this project a name, will be used as the basename for batch output files (summary, quality plot, etc)	'edityper'
`--bam`	Convert SAM files to BAM files, optionally use CSI indices instead of BAI indices with `--bam csi`; requires SAMtools	None

Read Group Arguments

NOTE: Information provided will be applied to all read groups in all FASTQ files

NOTE: All of these arguments are optional and will only be used if SAM/BAM output is not suppressed

Parameter	Definition	Default
`-rc \| --read-center`	Name of the sequencing center	None
`-rl \| --read-library`	Sequencing library	None
`-rp \| --read-platform`	Platform used for sequencing	None
`-rs \| --read-sample`	Sample being sequenced	None

Suppression Arguments

NOTE: Once cannot suppress all output

Parameter	Definition
`--suppress-sam`	Suppress SAM output
`--suppress-events`	Suppress events table output
`--suppress-classification`	Suppress read classification output
`--suppress-tables`	Suppress both events and read classification
`--suppress-plots`	Suppress locus and quality tables and plots

Output Files

For each output table, all lines starting with # are header lines. All lines starting with ## are extra information. See details below for specifics about each output.

Output file	Extension
Alignments in SAM/BAM format	`.sam \| .bam`
Table of events by base	`.events.txt`
Classification of reads in tabular format	`.classification.txt`
Locus events table and plots	`_locus.txt` and `_locus.pdf`
Quality scores table and plots	`_quality.txt` and `_quality.pdf`
Summary of read classifications per input FASTQ file	`.summary.txt`
Read assignments table	`.assignments.txt`

SAM/BAM Output

SAM alignments are standard SAM files. They have been presorted in coordinate order with read groups attached. If BAM output, BAM indices will also be output in either BAI or CSI format. For more details, see SAM/BAM format specification from HTSlib.

Events Table

The .events.txt table shows a locus-by-locus overview of indels and mismatches in each FASTQ file. One table is generated per FASTQ file.

Column	Meaning
`POS`	Position in reference sequence
`REF`	Nucleotide in reference sequence at this position
`COV`	Coverage in FASTQ at position
`DEL`	Number of deletions starting at this position
`AVG_DEL`	Average length of deletions starting at this position
`DCOUNT`	Number of times this position is deleted
`INS`	Number of insertions starting at this position
`AVG_INS`	Average length of insertions starting at this position
`A`	Count of mismatched A's at this position
`T`	Count of mismatched T's at this position
`C`	Count of mismatched C's at this position
`G`	Count of mismatched G's at this position

Read Classifications

The .classifications.txt table shows a breakdown of indels and mismatches per read category. One table is generated per FASTQ file. The read categories are HDR, MIX, NHEJ, NO_EDIT, and DISCARD.

Column	Meaning
`TAG`	Which classification category; one of HDR, MIX, NHEJ, NO_EDIT, or DISCARD
`COUNT`	How many reads fall in this classification category?
`PERC_COUNT`	What percentage of reads fall in this classification category; excludes discarded reads from calculation
`INS_EVENTS`	Total number of insertion events; reported for HDR, MIX, and NHEJ only
`AVG_INS`	Average number of insertion events per read; reported for HDR, MIX, and NHEJ only
`STD_DEV_INS`	Standard deviation of the distribution of insertions; reported for HDR, MIX, and NHEJ only
`DEL_EVENTS`	Total number of deletion events; reported for HDR, MIX, and NHEJ only
`AVG_DEL`	Average number of deletion events per read; reported for HDR, MIX, NHEJ only
`STD_DEV_DEL`	Standard deviation of the distribution of deletions; reported for HDR, MIX, and NHEJ only
`MISMATCH_EVENTS`	Total number of mismatch events; reported for HDR, MIX, and NHEJ only
`AVG_MIS`	Average number of mismatch events per read; reported for HDR, MIX, and NHEJ only
`STD_DEV_MIS`	Standard deviation of the distribution of mismatches; reported for HDR, MIX, and NHEJ only
`NO_INDELS`	Number of reads with no indels; reported for HDR and MIX only
`PERC_NO_INDELS`	Percentage of reads with no indels; reported for HDR and MIX only
`INS_ONLY`	Number of reads with with only insertions; reported for HDR and MIX only
`PERC_INS_ONLY`	Percentage of reads with only insertions; reported for HDR and MIX only
`DEL_ONLY`	Number of reads with only deletions;
`PERC_DEL_ONLY`	Percentage of reads with only deletions; reported for HDR and MIX only
`INDELS`	Number of reads with both insertions and deletions; reported for HDR and MIX only
`PERC_INDELS`	Percentage of reads with both insertions and deletions; reported for HDR and MIX only

Summary Table

The .summary.txt table shows total reads, unique reads, discarded reads, SNP information, no editing, HDR, NHEJ, and mismatch percentages by base. One table is generated for all FASTQ files.

Column	Meaning
`FASTQ`	Name of FASTQ file
`TOTAL_READS`	Total number of reads in this FASTQ file
`TOTAL_NON_DISC`	Total number of reads in this FASTQ file, excluding discarded reads
`UNIQ_READS`	Total number of unique, non-discarded reads
`DISCARDED`	Number of discarded reads
`SNP_POS`	SNP position
`REF_STATE`	Reference state
`TEMP_SNP`	Alternate state
`NO_EDIT`	Number of non-edited reads
`PERC_NO_EDIT`	Percentage of reads classified as not edited, excluding discarded reads
`HDR`	Number of clean HDR reads
`PERC_HDR`	Percentage of reads classified as HDR, excluding discarded reads
`MIX`	Number of MIX reads
`PERC_MIX`	Percentage of reads classified as MIX, excluding discarded reads
`NHEJ`	Number of NHEJ reads
`PERC_NHEJ`	Percentage of reads classified as NHEJ, excluding discarded reads
`PERC_MIS_A`	Percentage of mismatches with an A compared to total number of mismatches
`PERC_MIS_T`	Percentage of mismatches with an T compared to total number of mismatches
`PERC_MIS_C`	Percentage of mismatches with an C compared to total number of mismatches
`PERC_MIS_G`	Percentage of mismatches with an G compared to total number of mismatches

Assignments Table

The .assignments.txt table shows how each read was classified as well as the number of insertions, deletions, and mismatches for each read. One table is generated per FASTQ. This table is only generated when verbosity is set to 'debug' (-v debug | --verbosty debug) and --suppress-tables is not passed.

Column	Meaning
`ReadID`	Read ID from the FASTQ file
`Label`	Classification assigned to this read
`NumDel`	Number of deletions in this read
`NumIns`	Number of insertions in this read
`NumMis`	Number of mismatches in this read

Locus Events Table and Plot

The locus plot shows events at each base along the reference and the number of supporting reads for each event (insertions, deletions, and mismatches). One PDF is generated per FASTQ file. The first page is scaled to total number of reads in the FASTQ file and shows all events, while the remaining pages show a specific event and are scaled to maximum number of supporting reads for that event. Coverage at each base is also shown.

The locus events table is used to generate the locus plot and is designed to be read into R. It has four columns, Insertions, Deletions, Mismatches, and Coverage, with the counts of each read for a given event. Each row is a position relative to the reference sequence.

Quality Scores Table and Plot

The quality plot shows the distribution of alignment quality scores. One PDF is generated for all FASTQ files. The quality-score threshold for discarding reads is shown as a black bar.

The quality scores table is used to generate the quality plot and is designed to be read into R. Each row is a FASTQ and each column is an alignment quality score for a read within the FASTQ file. There are N columns, where N is the maximum number of reads across all FASTQ files. FASTQs with fewer than N reads have NaNs to fill the rows.

Name		Name	Last commit message	Last commit date
Latest commit History 285 Commits
edityper		edityper
src		src
.classification_scheme.svg		.classification_scheme.svg
.gitignore		.gitignore
.nygc.jpeg		.nygc.jpeg
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EdiTyper

Installation

Dependencies

Basic Usage

Classification scheme

Arguments

Alignment Arguments

Reference Arguments

Input Arguments

Output Arguments

Read Group Arguments

Suppression Arguments

Output Files

SAM/BAM Output

Events Table

Read Classifications

Summary Table

Assignments Table

Locus Events Table and Plot

Quality Scores Table and Plot

About

Languages

License

LappalainenLab/edityper

Folders and files

Latest commit

History

Repository files navigation

EdiTyper

Installation

Dependencies

Basic Usage

Classification scheme

Arguments

Alignment Arguments

Reference Arguments

Input Arguments

Output Arguments

Read Group Arguments

Suppression Arguments

Output Files

SAM/BAM Output

Events Table

Read Classifications

Summary Table

Assignments Table

Locus Events Table and Plot

Quality Scores Table and Plot

About

Resources

License

Stars

Watchers

Forks

Languages