Skip to content

Latest commit

 

History

History
146 lines (95 loc) · 7.92 KB

README.md

File metadata and controls

146 lines (95 loc) · 7.92 KB

ebwt2InDel

Overview

ebwt2InDel is the new version of ebwt2snp (https://github.com/nicolaprezza/ebwt2snp). The new version detect also InDels and works directly on the input BWT, without building the LCP and GSA arrays. This is about 8 times more space-efficient (but slightly slower) than version 1.

The ebwt2InDel suite can be used to discover SNPs/indels inside one or between two sets of reads (fasta/fastq) without aligning them to a reference genome (alignment-free, reference-free) by just analyzing the Burrows-Wheeler Transform of the dataset. The output is a fasta file (in KisSNP2 format) where sequences are the contexts surrounding the identified SNPs/indels. The suite finds its main use in applications where no reference genome is known (alignment-free, reference-free variation discovery), but could also lead to the discovery of variations not detected by standard alignment-based tools (which suffer from issues such as low-quality reference, alignment-dependent biases, and difficulty of aligning fragments containing gaps or inversions). Examples of use include:

  • Genotyping: given a read set of an individual, find Heterozygous sites.
  • Metagenomics: given a read set containing DNA sequenced from many individuals (possibly of unknown species and unknown number), find differences between their genomes. For example, when run on a bacteria sample, the tool can be used to isolate fragments covering antibiotic-resistent genes (since not all bacteria will be resistent, this event will appear as a SNP/INDEL and will be detected by the tool).
  • Rare-variant discovery: given the reads sequenced from different cells of an individual, output fragments covering low-frequency variants (e.g. cancer-driving mutations).
  • Pairwise variant discovery: find differences between the reads sequenced from two individuals.

Modes

ebwt2InDel supports three modes:

  1. Analyze one dataset. This mode finds SNPs/INDELs within a single read collection (e.g. genotyping, metagenomics, or rare-variant discovery).
ebwt2InDel -1 collection.ebwt -o output.snp
  1. Analyze two datasets. This mode finds SNPs/INDELs between two read collections. The difference with the previous mode is that differences within the same dataset are not reported.
ebwt2InDel -1 collection1.ebwt -2 collection2.ebwt -o output.snp
  1. Analyze two datasets merged in a single BWT. This mode operates as the previous one, but the input is provided differently: one BWT of the two merged collections, plus the binary Document Array telling which suffix belongs to which collection. This is twice as fast as the previous mode, because only one BWT is navigated (thus the number of cache misses is halved). The document array is a simple ASCII file filled with "0" and "1".
ebwt2InDel -1 merged_collection.ebwt -d DA.txt -o output.snp

Pipelines

IMPORTANT: the DNA fragments of the input read set must contain only symbols 'A', 'C', 'G', 'T'. It is suggested either to filter out reads containing unknown characters 'N', or to remove them, or to replace them with random nucleotides.

First, the BWT of the input dataset(s) must be built. To achieve this, you can use https://github.com/felipelouza/egsa, https://github.com/giovannarosone/BCR_LCP_GSA, or https://github.com/felipelouza/egap. Then, ebwt2InDel can be run on the computed BWT(s) using one of the three modes above described.

Upon termination of ebwt2InDel, the tool filter_snp can be used to filter the .snp file generated by ebwt2InDel so to keep only SNP/indels covered by at least m reads. Increasing m improves precision (but decreases sensitivity), and it is suggested when the input dataset is high-covered (e.g. a value m = 5 is suggested for coverages >= 25x).

If a reference is available, a VCF file can be generated from the output calls with the following pipeline:

  • seqtk seq -F 'I' calls.snp > calls.fastq (convert the .snp file to fastq with fake base qualities)
  • bwa mem: to align calls.fastq to the reference genome
  • sam2vcf converts the ".sam" file produced by the previous step into a ".vcf" file containing the variations.

We call snp2vcf the pipeline seqtk -> bwa-mem -> sam2vcf.

Finally, If also a ground-truth VCF file is available, one can validate the above VCF file using the executable vcf_vs_vcf.

Publications

  • *Nicola Prezza, Nadia Pisanti, Marinella Sciortino and Giovanna Rosone: Detecting mutations by eBWT. WABI 2018. Leibniz International Proceedings in Informatics, LIPIcs , 2018, 113, art. no. 3, Schloss Dagstuhl--Leibniz-Zentrum für Informatik.
  • Nicola Prezza, Nadia Pisanti, Marinella Sciortino, and Giovanna Rosone, 2019. SNPs detection by eBWT positional clustering. Algorithms for Molecular Biology, 14(1), p.3.
  • Nicola Prezza, Nadia Pisanti, Marinella Sciortino, Giovanna Rosone: Variable-order reference-free variant discovery with the Burrows-Wheeler Transform. BMC Bioinform (2020), 21-S(8): 260, 10.1186/s12859-020-03586-3.*

A pre-print version can be found here: https://arxiv.org/abs/1805.01876.

Funding

Supported by the project Italian MIUR-SIR CMACBioSeq ("Combinatorial methods for analysis and compression of biological sequences") grant n.~RBSI146R5L, PI: Giovanna Rosone. Link: http://pages.di.unipi.it/rosone/CMACBioSeq.html

Install

#download BCR (for BWT computation)
git clone https://github.com/giovannarosone/BCR\_LCP\_GSA

#build ebwt2InDel
cd ebwt2InDel
mkdir build
cd build
cmake ..
make

#build BCR
cd ../../BCR\_LCP\_GSA
make

Run - one sample

Enter the folder with the fasta file reads.fasta . We assume that executables 'BCR_LCP_GSA', 'ebwt2InDel' are global.

#Step 1: remove reads containing 'N', or replace 'N' with random nucleotides
#Step 2: optional, but considerably increases sensitivity of the tool. Insert in reads.fasta also the reverse-complement of the reads.

#Build the BWT of the sets of reads (using 1GB of RAM)
BCR\_LCP\_GSA reads.fasta reads.fasta.bwt 1024

#Call SNPs (do this in the same folder containing all other files)
ebwt2InDel -1 reads.fasta.bwt -o output.snp

#File output.snp now contains identified SNPs/indels.

#Filter only events supported by at least 5 reads
filter_snp output.snp 5 > output.5.snp

One sample - parallel mode (experimental)

One sample can also be analyzed using multithreading. The idea is to sort the input fasta file by context (using HARC). After that, the sorted fasta can be broken into pieces, each of which can be analyzed separately.

We provide a script, pebwt2InDel, that executes automatically the parallel pipeline.

Required pre-installed software:: HARC (https://github.com/shubhamchandak94/HARC) BCR_LCP_GSA (https://github.com/giovannarosone/BCR_LCP_GSA)

Usage

pebwt2InDel.sh input_fasta threads RAM read_len output_directory/ harc_folder/ 

NOTES:

  • use absolute paths ending with /
  • "read_len" must be shorter than or equal to the length of most reads; remainders are lost!
  • RAM is the RAM used by BCR (in MB)
  • RAM: at most n bytes, where n is the total number of nucleotides.
  • Disk usage: in addition to the input and output (snp file), the process uses 4n additional Bytes on disk.

Run - two samples

Enter the folder with the two fasta files reads1.fasta and reads2.fasta (i.e. the reads of the two samples). We assume that executables 'BCR_LCP_GSA', 'ebwt2InDel' are global.

#Step 1: remove reads containing 'N', or replace 'N' with random nucleotides
#Step 2: optional, but considerably increases sensitivity of the tool. Insert in reads1.fasta also the reverse-complement of the reads. Repeat with reads2.fasta

#Build the BWT of the sets of reads
BCR\_LCP\_GSA reads1.fasta reads1.fasta.bwt 1024
BCR\_LCP\_GSA reads2.fasta reads2.fasta.bwt 1024

#Call SNPs (do this in the same folder containing all other files)
ebwt2InDel -1 reads1.fasta.bwt -2 reads2.fasta.bwt -o output.snp


#Filter only events supported by at least 5 reads
filter_snp output.snp 5 > output.5.snp