Assessing the draft assembly and correcting it, if necessary.
The assembly statistics are assessed using gaas and SeqKit. Other tools such as Quast and genometools can also be used for the purpose.
gaas_fasta_statistics.pl -f ~/path/to/dir/in/draft.assembly.fa
seqkit stat ~/path/to/dir/in/draft.assembly.fa
Assembly completeness in reference to single copy orthologs databse OrthoDB are checked using BUSCO. Two databases are used for the purpose: endopterygota_odb10
& insecta_odb10
.
busco -i ~/path/to/dir/in/draft.assembly.fa --out_path ~/path/to/dir/out -o $prefix -l ~/path/to/dir/out/$busco.db \
-m genome --offline -c $threads --download_path ./busco_downloads
Then, merqury checks the k-mer distribution in the assembly with respect to that in the sequence reads.
meryl count threads=$threads k=21 ~/path/to/dir/in/sr.corrected*.fq.gz output illumina.meryl
meryl histogram illumina.meryl > illumina.hist #Not required. This analysis gets performed in the next command by default.
$MERQURY/merqury.sh ../illumina.meryl ~/path/to/dir/in/draft.assembly.fa illumina.merqury
As the results from BUSCO and MERQURY indicate for the presence significant levels of duplicatin in the draft assembly, the next step is to purge those duplicate regions with the help of purge.dups. This is a multi-step process.
As a note of caution, prior to purging, the mitochondrial fragment may be removed from the draft assembly as its presence could potentially identify other nuclear fragmnet as duplicates as in the case of putative NUMTs.
Step 1: Splitting the draft assembly & removing the mito genome
#Using blast to locate the mito genome within assembly
mito=~/path/to/dir/mito.draft.fa
draft_asm=~/path/to/dir/draft.assembly.fa
blastn -query $mito -subject $draft_asm -out blast-mito-draft -outfmt 7
# Manually splitting the identified contig to separate mito fragment
grep -A1 "hits found" blast-mito-draft | awk 'NR==2 {print $2}' > one.contig.id
seqkit grep -f one.contig.id $draft_asm > one.contig.fa #under the conda package 'seqkit'
seqkit grep -v -f one.contig.id $draft_asm -o draft.oneless.fa
split_fa one.contig.fa > one.contig.split.fa #under the conda package 'purgedups'
# Renaming every split of the identified contig & retrieving the mito fragment
seqkit replace -p "\:" -r ',s' one.contig.split.fa | seqkit replace -p "\-" -r '_s' > one.contig.split.rename.fa
grep ">" one.contig.split.rename.fa | sed -n 2p | sed 's/>//g' > mito.id
seqkit grep -f mito.id one.contig.split.rename.fa > mito.contig.fa
# Replacing rest of the splits of the identified contigs to the draft assembly
seqkit grep -v -f mito.id one.contig.split.rename.fa -o nomito.contig.fa
seqkit concat draft.oneless.fa nomito.contig.fa --full > nuc.fa
Step 2: Identifying duplicates in nuclear assembly
# Indexing, splitting & self-mapping the draft assembly
minimap2 -t $threads -k 21 -d ./nuc.mm2.index ~/path/to/dir/in/nuc.fa
split_fa ~/path/to/dir/in/nuc.fa > nuc.split
minimap2 -x asm5 -DP -t $threads nuc.split nuc.split | gzip -c - > nuc.self.paf.gz
The resulting
nuc.self.paf.gz
will be used to prepare the.bed
file at the purging step.
# Mapping nanopore long reads to the draft assembly & checking the base coverage
minimap2 -t $threads -x map-ont ./nuc.mm2.index ~/path/to/dir/in/lr.corrected.fa.gz | gzip -c - > nuc.nano.paf.gz
pbcstat nuc.nano.paf.gz # outputs are PB.base.cov & PB.stat files
calcuts PB.stat > cutoffs.default 2> calcults.log
scripts/hist_plot.py -c cutoffs.default PB.stat PB.coverage.default.png
Step 3: Preparing for purging the assembly by checking appropriate parameter values.
Check how the determined cutoff values are positioned across the base coverage graph. This step is essential as the cutoff values assigned at this step will be used to purge the duplications from the draft assembly. To make sure that the right cutoff values are being used, different sets of cutoffs are tested for purging extension with an aim to prevent over-purging.
# Manually setting the cutoffs
lowcut=(2 1 1 1 1)
midcut=(21 21 17 21 21)
upcut=(220 220 220 500 501)
for i in {0..4}
do
j=$(($i+1))
calcuts -l ${lowcut[$i]} -m ${midcut[$i]} -u ${upcut[$i]} PB.stat > cutoffs.m$j
scripts/hist_plot.py -c cutoffs.m$j PB.stat PB.coverage.m$j.png
done
Step 4: Now, the draft assembly can be iteratively purged with the appropriately identified cutoff values.
# Purging the assembly
for i in {1..5}
do
purge_dups -2 -T cutoffs.m$i -c PB.base.cov ./draft.split.self.paf.gz > dups.m$i.bed 2> purge_dups.m$i.log
get_seqs -l 500 -e dups.m$i.bed ./nuc.fa -p nuc.m$i.endpurged
done
Assess the purged assemblies and compare. For instance, by using k-mer based statistics from MERQURY and orthology completeness from BUSCO.
# Using MERQURY
for i in {1..5}
do
$MERQURY/merqury.sh ../illumina.meryl ~/path/to/dir/in/nuc.m$i.endpurged.purged.fa \
~/path/to/dir/in/nuc.m$i.endpurged.hap.fa purge.m$i.merqury
done
# Using BUSCO
for i in {1..5}
do
busco -i ~/path/to/dir/in/nuc.m$i.endpurged.purged.fa -o nuc.m$i.endpurged -m genome -l ./endopterygota_odb10 \
--download_path ./busco_downloads --offline -c $threads
done
Now, the final purged assembly can be shortened for fasta identifier and subsequently renamed for easy downstream processes.
sed 's/,.*//g' nuc.m4.endpurged.purged.fa | sed 's/scaffold/scaff_/g' > nuc.purged.fa
Manual correction of the purged assembly by removing full Ns contigs & terminal Ns
seqkit grep -s -v -r -p "^[Nn]+$" nuc.purged.fa | seqkit replace -is -p "^N+|N+$" -r "" > nuc.clean.fa
Then, compare the draft and purged assembly using Quast.
seqkit concat --full nuc.clean.fa mito.contig.fa > purged.assembly.fa
quast draft.assembly.fa purged.assembly.fa -r reference.fasta.gz -g genes.gff -t $threads -o ~/path/to/dir/out \
--circos --labels Draft,Purged
Now, perform a contamination check with the recently developed NCBI-FCS tool set. This can preferably be performed as a part of NCBI submission or locally as well.
# Detecting adaptor & vector contaminations
./run_fcsadaptor.sh --fasta-input ~/path/to/dir/in/purged.assembly.fa --output-dir ~/path/to/dir/out --euk
# Detecting contaminations from foreign organisms (--tax-id 7041 is for beetle taxa)
python3 ./fcs.py screen genome --fasta ~/path/to/dir/in/purged.assembly.fa --out-dir ~/path/to/dir/out \
--gx-db "$GXDB_LOC/db" --tax-id 7041
Plot the final cleaned assembly for visual inspection using BlobTools.
# Using BlobTools
blobtools create --fasta ~/path/to/dir/in/purged.assembly.fa ~/path/to/dir/out/assembly-blob
blobtools add --busco ~/path/to/dir/in/full_table.tsv ~/path/to/dir/out/assembly-blob
blobtools view --remote ~/path/to/dir/out/assembly-blob
blobtools host `pwd`