IndelEnsembler is an ensemble method for identifying deletions (DELs), tandem duplications (DUPs) and insertions (INSs) (either novel or due to transposition) from next generation sequencing data. It merges calls from different callers: Lumpy, Manta, SurVIndel and TranSurVeyor.
The reference fasta file should be indexed by both bwa and samtools. For example, assuming the file is hg19.fa, you should run
$ bwa index hg19.fa
$ samtools faidx hg19.fa
Although not mandatory, SurVIndel will generally give higher quality results if a simple repeats file is provided. This can normally be downloaded from the simpleRepeats table in UCSC. The header must be removed and only the chromosome, the start, the end and the period columns must be retained, i.e.:
$ cat downloaded-file | grep -v "#" | cut -f2,3,4,6 > file-for-survindel.bed
Alternatively, you can run TRF and use the provided trf-to-bed.sh, i.e.:
## Generated the repeats file
$ cd /path/to/SurVIndel
$ cat trf-output.dat | ./trf-to-bed.sh > /path/to/simple/repeats/file
The BAM files should be coordinate sorted, indexed, and should contain the MC and MQ tags. MC and MQ tags can be added using Picard FixMateInformation (http://broadinstitute.github.io/picard/command-line-overview.html#FixMateInformation).
Supposing file.bam is the file resulting from the alignment:
$ java -jar picard.jar FixMateInformation I=file.bam
$ samtools sort file.bam > sorted.bam
$ samtools index sorted.bam
- Manta
Manta's source code was cloned from the repository at HERE. We used manta-1.6.0-0.
$ wget https://github.com/Illumina/manta/releases/download/v1.6.0/manta-1.6.0.centos6_x86_64.tar.bz2
$ tar -xjf manta-1.6.0.centos6_x86_64.tar.bz2
$ cd manta
# Add Manta to your PATH
$ export PATH=/path/to/manta/bin:$PATH
# run Manta
$ python2 /path/to/manta/bin/configManta.py --bam /path/to/bamfile --referenceFasta /path/to/reference/fasta --runDir /an/empty/working/directory
$ /an/empty/working/directory/runWorkflow.py -m local -j 40
$ gunzip /an/empty/working/directory/results/variants/candidateSV.vcf.gz
- Lumpy
Lumpy's source code was cloned from the repository at HERE. We used LUMPY-0.2.13.
$ git clone --recursive https://github.com/arq5x/lumpy-sv.git
$ cd lumpy-sv
# Add Lumpy to your PATH
$ export PATH=/path/to/lumpy-sv/bin:$PATH
# run Lumpy
$ samtools view -b -F 1294 /path/to/bamfile > /path/to/discordants/bamfile
$ samtools view -h /path/to/bamfile | /path/to/lumpy-sv/scripts/extractSplitReads_BwaMem -i stdin | /path/to/samtools view -Sb - > /path/to/splitters/bamfile
$ samtools sort /path/to/discordants/bamfile -o /path/to/sorted/discordants/bamfile
$ samtools sort /path/to/splitters/bamfile -o /path/to/sorted/splitters/bamfile
$ /path/to/lumpy-sv/bin/lumpyexpress -B /path/to/bamfile -S /path/to/sorted/splitters/bamfile -D /path/to/sorted/discordants/bamfile -o /an/empty/working/directory
- SurVIndel
SurVIndel's source code was cloned from the repository at HERE.
$ git clone https://github.com/Mesh89/SurVIndel.git
$ cd SurVIndel
$ ./build_htslib.sh
$ cmake -DCMAKE_BUILD_TYPE=Release . && make
# run SurVIndel
$ python /path/to/SurVIndel/surveyor.py /path/to/bamfile /an/empty/working/directory /path/to/reference/fasta --threads 40 --samtools /path/to/samtools --bwa /path/to/bwa --simple-rep /path/to/simple/repeats/file
$ ./filter /path/to/working/directory alpha-value score-cutoff min-size simple-repeats
# The value of alpha-value, score-cutoff and min-size, you can refer (https://github.com/Mesh89/SurVIndel)
- TranSurVeyor
TranSurVeyor's source code was cloned from the repository at HERE.
$ git clone https://github.com/Mesh89/TranSurVeyor.git
$ cd TranSurVeyor
$ ./build_htslib.sh
$ cmake -DCMAKE_BUILD_TYPE=Release . && make
# run TranSurVeyor
$ python surveyor.py /path/to/bamfile /an/empty/working/directory /path/to/reference/fasta --threads 40 --samtools /path/to/samtools --bwa /path/to/bwa --maxTRAsize 10000
$ ./filter /path/to/working/directory
In order to run clustering, we need all the calls from all the samples in a single file (order does not matter). For example, if we are running three samples with 1000 SV each, we need a file with 3000 SVs. Each line should appear as in the original calls, but the sample name should be appended in the end.
For example, suppose we have 2 samples (S1 and S2), with one call each
S1: SV_1 chr1 100 + chr1 200 - DEL NA
S2: SV_1 chr1 120 + chr1 210 - DEL NA
We need to create a file (e.g. all.sv) with two lines
SV_1 chr1 100 + chr1 200 - DEL NA S1 SV_1 chr1 120 + chr1 210 - DEL NA S2
After that, we can run clustering as follows:
./clusterer all.sv MAX_DIST MIN_OVERLAP
MAX_DIST and MIN_OVERLAP are the maximum distance between breakpoints (in bp) and minimum overlap, respectively, for two SVs to be potentially part of the same cluster. 200 and 0 were used in the manuscript.
If the clustering process should be too slow, the file can be divided by by SV type and by chromosome, and each subfile can be run individually.
Installing IndelEnsembler is easy. You can download and uncompress the IndelEnsembler package using wget or through git.
# download the IndelEnsembler
$ wget https://github.com/kensung-lab/IndelEnsembler/archive/refs/heads/main.zip
or
$ git clone https://github.com/kensung-lab/IndelEnsembler.git
$ ./build.sh
# Usage
$ cd IndelEnsembler
$ vi pipeline.sh
# Change the path of ref_genome and repeats
$ bash pipeline.sh /path/to/manta/result /path/to/lumpy/result /path/to/survindel/result /path/to/transurveyor/result /an/empty/working/directory 200