This script follows some of the main procedures set forth in Coghlan, A., Tsai, I.J., Berriman, M. 2018. Creation of a comprehensive repeat library for a newly sequenced parasitic worm genome. Protocolexchange DOI: 10.1038/protex.2018.054
This is a simple wrapper script that uses multiple repeat finding programs including RepeatModeler, TransposonPSI, LTR_finder, and LTR_harvest. LTR_harvest is coupled with LTR_digest and an HMMsearch against pfam domains associated with LTRs to limit false positive identifications. The constructed libraries are run through RepeatClassifier to classify the LTR's. USEARCH is then used on the concatenated library to remove redundant LTR's based on an 80% similarity. The non-redundant library is then used with RepeatMasker to soft mask the assembly.
Currently, all programs are run using default settings with little to no options to alter settings through flags. Additional options may be added to future versions if there is a need.
It is recommended to provide additional currated libraries such as those from RepBase. Simply select an appropriate taxanomic level and download the file in FASTA format. Then provide the file with the -rb
flag on the command line.
Dependecies should be able to be called from the commandline, if not then the paths to the parent directories of each executable should be located in $PATH. If all else fails, paths to executables can be passed into the script through flags.
usage: ./TransposableELMT.py [options] -in genome_assembly.fasta -o output_basename
optional arguments:
-h, --help show this help message and exit
-in , --input Genome assembly in FASTA format
-o , --out Basename of output directory and file
--cpus Number of cores to use [default: 2]
-id , --identity Cutoff value for percent identity in USEARCH [default: 0.80]
-en , --engine Search engine used in RepeatModeler [abblast|wublast|ncbi] [default: ncbi]
-rb , --repbase_lib RepBase library of TEs or additional curated library in FASTA format
-rl , --repeatmodeler_lib Pre-computed RepeatModeler library
--hmms Path to directory of TE pfam domain files in HMMER3 format [Default: TransposableELMT/te_hmms]
--REPEATMODELER_PATH Path to RepeatModeler exe if not set in $PATH
--REPEATMASKER_PATH Path to RepeatMasker exe if not set in $PATH
--BUILDDATABASE_PATH Path to BuildDatabase exe if not set in $PATH
--REPEATCLASSIFIER_PATH Path to RepeatClassifier exe if not set in $PATH
--LTRFINDER_PATH Path to LTR_Finder exe if not set in $PATH
--GENOMETOOLS_PATH Path to genometools exe if not set in $PATH
--USEARCH_PATH Path to USEARCH exe if not set in $PATH
--TRANSPOSONPSI_PATH Path to transposonPSI.pl if not set in $PATH
--CNV_LTRFINDER2GFF_PATH Path to cnv_ltrfinder2gff.pl if not set in $PATH
- Soft-masked genome assembly in FASTA format
- RepeatMasker Table file
- RepeatMasker Out file