Annotation of repeat and CDS in the assembled nuclear genome
As an essential step, prior to the de-novo CDS annotation, the repeat classes were identified and annotated in the genome. To prepare a comprehensive repeat library, a recently benchmarked pipeline that brings a brilliant compilation of analytical tools, The Extensive de novo TE Annotator EDTA, was used. Alternatively, one can try TransposonUltimate for repeat library preparation.
As an initial step, the repeat classes which are known to be poorly represented in EDTA analysis were compiled from existing public repositories.
Compiling data from SINEBase.
# Retrieving insect specific SINE repeats
wget http://sines.eimb.ru/banks/SINEs.bnk
seqkit seq SINEs.bnk -n | grep "Insecta" > sines.insect.id
# Renaming SINE repeats as per EDTA requirement
seqkit grep -n -f sines.insect.id SINEs.bnk -o sines.insect.seq
cat sines.insect.seq | seqkit seq -i | seqkit replace -p $ -r _insecta_sinebase#SINE/Unknown > sines.insect.lib
Compiling data from Dfam database.
# Retrieving insect specific repeat classes from Dfam database
wget https://www.dfam.org/releases/Dfam_3.7/families/Dfam_curatedonly.h5.gz
gunzip Dfam_curatedonly.h5.gz
famdb.py -i Dfam_curatedonly.h5 families -f fasta_name --include-class-in-name -d --class SINE 'Insecta' > dfam.insecta.sine.seq
famdb.py -i Dfam_curatedonly.h5 families -f fasta_name --include-class-in-name -d --class LINE 'Insecta' > dfam.insecta.line.seq
famdb.py -i Dfam_curatedonly.h5 families -f fasta_name --include-class-in-name -d --class Satellite 'Insecta' > dfam.insecta.sat.seq
famdb.py -i Dfam_curatedonly.h5 families -f fasta_name --include-class-in-name -d --class rRNA 'Insecta' > dfam.insecta.rdna.seq
# Renaming the repeat classes as per EDTA requirement
cat dfam.insecta.sine.seq | seqkit seq -i | seqkit replace -p "\#.+" -r "_insecta_dfam#SINE/Unknown" > dfam.insecta.sine.lib
cat dfam.insecta.line.seq | seqkit seq -i | seqkit replace -p "\#.+" -r "_insecta_dfam#LINE/Unknown" > dfam.insecta.line.lib
cat dfam.insecta.sat.seq | seqkit seq -i | seqkit replace -p "\#.+" -r "_insecta_dfam#Satellite/Satellite" > dfam.insecta.sat.lib
cat dfam.insecta.rdna.seq | seqkit seq -i | seqkit replace -p "\#.+" -r "_insecta_dfam#rDNA/5S" > dfam.insecta.rdna.lib
Now, combining the repeat data compiled from databases.
cat sines.insect.lib dfam.insecta.sine.lib dfam.insecta.line.lib dfam.insecta.sat.lib dfam.insecta.rdna.lib > repeat.combined.lib
grep -c ">" repeat.combined.lib
To prevent errornous annotation of coding sequences as repeats, curated CDS regions from a related beetle species Leptinotarsa decemlineata was retrieved and supplied to EDTA during analysis.
# CDS data from LepDec
wget https://data.nal.usda.gov/system/files/lepdec_OGSv1-2.tar.gz (On 2023-07-19)
tar -xvzf lepdec_OGSv1-2.tar.gz
mv OGSv1.2/ lepdec.OGSv1.2
cp ../lepdec.OGSv1.2/lepdec_OGSv1.2_GCF_000500325.1_cds.fa ./lepdec.cds.fa
Repeat library preparation using EDTA.
EDTA.pl --genome ~/path/to/dir/in/purged.assembly.fa --cds lepdec.cds.fa --curatedlib repeat.combined.lib \
--overwrite 1 --sensitive 1 --anno 1 --evaluate 0 --threads 100
# Rename resulting output
cp ~/path/to/dir/in/purged.assembly.fa.mod.EDTA.TElib.fa ./TElib.fa
The ouput of EDTA analysis is a de-novo repeat library
TElib.fa
.
Using the repeat library, the repeat positions and classes can now be annotated with RepeatMasker and the repeat regions get soft-masked at the same time. Soft-making the genome facilitates CDS/gene annotation in the subsequent steps.
# Running repeatmasker
RepeatMasker -pa $threads ~/path/to/dir/in/purged.assembly.fa -lib TElib.fa -gff -xsmall
As a result from this step, the nuclear genome is soft-masked for repeat regions in
purged.assembly.masked.fa
and is ready for gene annotation.
During the process of gene annotation, as implemented in BRAKER, the analytical tools get trained with existing gene sequences from related taxa for efficient gene prediction. Therefore, the publicly available datasets with curated gene sequences were retrieved. One was the Arthropoda specific gene set from the OrthoDB and the other was the community-curated official gene set of a related species Leptinotarsa decemlineata.
# Retrieving protein databases
wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Arthropoda.fa.gz
gunzip Arthropoda.fa.gz
wget https://data.nal.usda.gov/system/files/lepdec_OGSv1-2.tar.gz
tar -xvzf lepdec_OGSv1-2.tar.gz
cat Arthropoda.fa ./lepdec.OGSv1.2/lepdec_OGSv1.2_GCF_000500325.1_pep.fa > proteins.fa
grep -c ">" proteins.fa
# Replacing whitespace in header
cat proteins.fa | seqkit replace -p "\s" -r "_" > proteins.edited.fa
Now conducting the gene annotation with BRAKER.
# Ab-inito Gene prediction with BRAKER3
braker.pl --genome=~/path/to/dir/in/purged.assembly.masked.fa --prot_seq=~/path/to/dir/in/proteins.edited.fa \
--threads $threads --gff3 --workingdir=~/path/to/dir/out/
We can summarizin the output from the BRAKER analysis using the output file braker.gtf
.
# Checking braker output summary
#number of gene models, total length & avg gene length
cat braker.gtf | awk '{ if ($3 == "gene") print $0 }' | awk '{ sum += ($5 - $4) } END { print NR, sum, sum / NR }'
#number of exons and avg length
cat braker.gtf | awk '{ if ($3 == "exon") print $0 }' | awk '{ sum += ($5 - $4) } END { print NR, sum, sum / NR }'
#number of introns and avg length
cat braker.gtf | awk '{ if ($3 == "intron") print $0 }' | awk '{ sum += ($5 - $4) } END { print NR, sum, sum / NR }'
#max gene length
cat braker.gtf | awk '$3 ~ /gene/ {print $5 - $4}' | sort -nr | head
#number of gene models
cat braker.gtf | grep -w gene | wc -l
Assessing the gene annotation by checking the identified CDS regions against the BUSCO datasets. The coding sequences are stored in braker.codingseq
in the BRAKER output directory.
# Annotation quality assessment
busco -i ~/path/to/dir/in/braker.codingseq --out_path ~/path/to/dir/out -o $prefix -l ~/path/to/dir/out/$busco.db \
-m protein --offline -c $threads --augustus_parameters='--progress=true' --download_path ~/path/to/dir/in/busco_downloads
Checking the annotations to find if there exist any potential overlap among the repeat regions (output from RepeatMasker) and the protein-coding regions (BRAKER output).
bedtools intersect -a ~/path/to/dir/in/braker.gff3 -b ~/path/to/dir/in/purged.assembly.fa.out.gff -wo > intersect.txt
# no of overlapping repeats with exon region & the size of total overlap
cat intersect.txt | awk '{ if ($3 == "exon") print $0 }' | awk '{ sum += ($22) } END { print NR, sum}'
# no of overlapping repeats with intron region & the size of total overlap
cat intersect.txt | awk '{ if ($3 == "intron") print $0 }' | awk '{ sum += ($22) } END { print NR, sum}'
cat intersect.txt | awk '{ if ($3 == "intron") print $9 }' | sort | uniq