Skip to content

3. Preparing the data

Jay Ghurye edited this page Oct 7, 2018 · 3 revisions

Suppose you have paired-end reads for a metagenomic sample with names as first.fastq.gz and second.fastq.gz. First, you need to assemble these reads into contigs using a standard metagenomic assembler. We recommend using MEGAHIT or metaSPAdes. You can refer the software manual of the assembler to figure out how to generate assemblies. After assembly, your contigs are in contigs.fasta file (it can vary based on which assembler you used).

Now, you will need to align reads to the input assemblies to generate alignments essential for scaffolding purposes. You can use any short-read aligner to generate such alignments. We recommend using Bowtie2. Please be sure to run the alignments in the single-end mode (although the reads are paired-end) to avoid introducing any biases of library size. You would run this as follows:

bowtie2-build contigs.fasta idx #build the index
bowtie2 -x idx -U first.fastq.gz | samtools view -bS - | samtools sort - -o alignment_1.bam #align first reads
bowtie2 -x idx -U second.fastq.gz | samtools view -bS - | samtools sort - -o alignment_2.bam #align second reads
samtools merge alignment_total.bam alignment_1.bam alignment_2.bam #merge the alignments 
samtools sort -n alignment_total.bam -o alignment.bam #sort by read names 

At this point, we have contigs and alignments of reads to those contigs. And that's it!! You have enough information to run MetaCarvel.

Clone this wiki locally