Skip to content

bvaldebenitom/SoloTE

Repository files navigation

SoloTE

DOI

SoloTE README

0. DEPENDENCIES

SoloTE requires the following tools to be installed, and available in your PATH environment variable:

Also, Python3.9.5 or higher should be available in your computer, along with the following modules:

1. SETTING UP NECESSARY FILES

RepeatMasker files (*rm.out) for several genomes, can be found at the UCSC webpage. The helper utility, SoloTE_RepeatMasker_to_BED.py, is packed with SoloTE, and it streamlines the download of RepeatMasker file from UCSC, and conversion to BED format. It can be run like this:

python SoloTE_RepeatMasker_to_BED.py -g GenomeVersion

where

  • GenomeVersion: genome version identifier. For example, hg38 for human, and mm10 for mouse.

Additionally, the utility can be called with the -l option, and it will list identifiers that can be supplied as the -g parameter, of all available genomes at UCSC (first 5 lines shown next):

python SoloTE_RepeatMasker_to_BED.py -l
ailMel1	| Panda [Ailuropoda melanoleuca, Dec. 2009 (BGI-Shenzhen 1.0/ailMel1)]
allMis1	| American alligator [Alligator mississippiensis, Aug. 2012 (allMis0.2/allMis1)]
anoCar1	| Lizard [Anolis carolinensis, Feb. 2007 (Broad/anoCar1)]
anoCar2	| Lizard [Anolis carolinensis, May 2010 (Broad AnoCar2.0/anoCar2)]
anoGam1	| A. gambiae [Anopheles gambiae, Feb. 2003 (IAGEC MOZ2/anoGam1)]

If you have your own RepeatMasker file and/or a file corresponding to Transposable Elements obtained from another tool, make sure to adapt it to the following format for SoloTE:

sequenceName	startPosition	endPosition	sequenceName|startPosition|endPosition|TE_Subfamily:TE_Family:TE_Class|strand	score(optional)	.

So, column 4, the ID, is a concatenation of the locus of the TE and its identifiers at the Subfamily, Family and Class level. The file should look like this:

chr1	11505	11675	chr1|11505|11675|L1MC5a:L1:LINE|25.1|-	25.1	-
chr1	11678	11780	chr1|11678|11780|MER5B:hAT-Charlie:DNA|29.4|-	29.4	-
chr1	15265	15355	chr1|15265|15355|MIR3:MIR:SINE|23.0|-	23.0	-
chr1	18907	19048	chr1|18907|19048|L2a:L2:LINE|33.8|+	33.8	+
chr1	19972	20405	chr1|19972|20405|L3:CR1:LINE|31.2|+	31.2	+

2. RUNNING SOLOTE

Once everything is set up, you can run the SoloTE script:

python SoloTE_pipeline.py --threads NumberOfThreads --bam BAMfile --teannotation BEDfile --outputprefix Prefix --outputdir OutputDirectory

where

--threads: Number of threads to use

--bam: BAM file

--teannotation: TE annotation in BED format

--outputprefix: Output files prefix

--outputdir: Output directory (if it doesn't exist, it will be created)