Skip to content

Releases: oushujun/EDTA

Big update to v2.2.0

12 Jan 17:20
Compare
Choose a tag to compare
Big update to v2.2.0 Pre-release
Pre-release
replace local AnnoSINEv2 with the conda version

panEDTA for consistent pan-genome TE annotation

11 Oct 04:36
9d7f12a
Compare
Choose a tag to compare

Release note and useage

This is the serial version of panEDTA. Each genome will be annotated sequentially and then combined with the panEDTA functionality. Existing EDTA annotation of genomes (--anno 1) will be recognized and reused. A way to acclerate the pan-genome annotation is to execute EDTA annotation of each genomes separately and in parallel, then execute panEDTA to finish the remaining of the runs. You may want to save the GFF files and the sum file of the EDTA results because they will be overwritten by panEDTA. You may want to check out the toy example in the ./test folder to get familiarized.

sh panEDTA.sh -genomes genome_list.txt -cds cds.fasta -threads 10
    -g	A list of genome files with paths accessible from the working directory.
                Required: You can provide only a list of genomes in this file (one column, one genome each row).
                Optional: You can also provide both genomes and CDS files in this file (two columns, one genome and one CDS each row).
                    Missing of CDS files (eg, for some or all genomes) is allowed.
    -c	Optional. Coding sequence files in fasta format.
                The CDS file provided via this parameter will fill in the missing CDS files in the genome list.
                If no CDS files are provided in the genome list, then this CDS file will be used on all genomes.
    -l	Optional. A manually curated, non-redundant library following the RepeatMasker naming format.
    -f	Minimum number of full-length TE copies in individual genomes to be kept as candidate TEs for the pangenome.
                Lower is more inclusive, and will ↑ library size, ↑ sensitivity, and ↑ inconsistency.
                Higher is more stringent, and will ↓ library size, ↓ sensitivity, and ↓ inconsistency.
                Default: 3.
    -t	Number of CPUs to run panEDTA. Default: 10.

Reference:

Ou S., Collins T., Qiu Y., Seetharam A., Menard C., Manchanda N., Gent J., Schatz M., Anderson S., Hufford M.✉, Hirsch C.✉ (2022). Differences in activity and stability drive transposable element variation in tropical and temperate maize. bioRxiv

New features and bug fix

23 Jun 00:33
Compare
Choose a tag to compare

New features

  1. added the --u parameter to allow user-specified mutation rate #271
  2. allow users to use the count_base.pl genome stats to replace the -genome_size and -seq_count parameters in util/buildSummary.pl.

Bug fix and enhancements

  1. check RepeatMasker results in immediate steps to accommodate for situations when no repeat is found.
  2. add more alias to the Sequence Ontology list and partially solve #151 and #178.
  3. resolve the Illegal division by zero error when flanking sequences of candidate TEs are all N/X. #259

EDTA v2.0.0 - faster, better, and nicer!

26 Nov 02:22
Compare
Choose a tag to compare

Performance improvements

  1. Set to use the original LTRharvest and LTR_FINDER when --threads 1. It will be much faster for highly fragmented genomes (> 5,000 sequences) by reducing the number of files created (#225). Users may run EDTA_raw.pl for each TE type with --threads 1, then run EDTA.pl with multi threads and --overwrite 0.
  2. Improve the filtering scheme for TE flanking sequences that are highly repetitive. If both flanking sequences are repetitive, filter out those with copy number > 50k on either side (Based on feedback from Zhigui Bao @baozg). This will avoid program suspension due to the long stretch of tandem repeats that exist in high-quality genomes.
  3. Improve and polish the filtering scheme suggested by Sergei Ryazansky @DrHogart (#136).

New features

  1. change the longest sequence ID limit from 15 to 13 characters to allow sequences > 100 Mb (#239).
  2. support renaming LTR sequences that RepeatModeler reports via --sensitive 1 (#184).
  3. support renaming TEsorter libraries (#184).
  4. cleanup_nested.pl: added the -clean option to allow for cleaning or not cleaning nested sequences.
  5. get_consistent_TE.pl: a new script that helps find TEs that are consistently annotated in a genome.
  6. add more specific guides for EDTA usage installed via conda (#208).
  7. rename and save the existing.EDTA.intact.fa.out file when using the parameter --overwrite 0.
  8. Updated EDTA_processI.pl and TE_purifier.pl: redirect RepeatMasker error msgs to STDERR suggested by Nathalie de Vries.
  9. make_panTElib.pl: a matured script that helps to create a pan-genome TE library for pan-genome TE annotations. A documented usage example (with great details) can be found here: https://github.com/HuffordLab/NAM-genomes/tree/master/te-annotation

Issues fixed

  1. Resolve classification inconsistency when --curatedlib is provided
    1. Added new entries and alias to the TE SO database (#219).
    2. Format sequence IDs for library files provided via --curatedlib to use the TE SO system (#220).
    3. check TIR classification discrepancy between candidate seq and lib seq with TE_SO name conversion.
  2. Resolve singularity warnings by adding "LC_ALL=C" and author info to the Dockerfile (#122).
  3. Fix #150 when flanking sequence is empty.
  4. Fixed typos in EDTA.pl and EDTA_processI.pl reported by Nathalie de Vries.

Note

If your run was successful with version 1.9.4+ and didn't notice any particular errors, you may not need to rerun it with 2.0.0. The core filtering algorithms are not very different between these versions.

More (easy) ways to install EDTA

14 Jan 17:13
Compare
Choose a tag to compare

Make installation easier and quicker

Installation of EDTA has been troublesome for some users (#137, #140, #146, etc...). Here I make a couple more ways to install it across all platforms.

  1. The default and recommended way is changed to use the EDTA.yml file, which freezes all dependency versions. If it works for me, it should also work for you.
  2. Provide new docker/singularity containers that work for the current version (v1.9.x) and hopefully future versions.
  3. Provide the docker container for users to build their own container.

Other improvements

  1. Tidy up the output of --evaluation.
  2. Detect and remove short tandem repeats when removing redundancies. Contributed by Sergei Ryazansky (#136).
  3. Other small improvements that make EDTA better and better!

New Docker image

04 Dec 15:08
Compare
Choose a tag to compare

As suggested by @eburgueno (#122, #125), the Docker version of EDTA is switched to the Biocontainers' Quay.io version with a couple fix contributed by @Juke34 and @philippbayer (#121, #122). I think this version of Docker image should be running OK. This release will help me to figure this out.

Faster and Better

29 Oct 15:28
Compare
Choose a tag to compare

Major updates

  1. parallelize LTRharvest. The code was adapted from LTR_FINDER_parallel and provided by @wild-joker on LTR_HARVEST_parallel. I made some slight modifications to it and also available.
  2. fix a number of bugs for processing input CDS files.
  3. add a 1-MB toy genome for testing purposes.

Formatting standard GFF3 output and more.

28 Aug 23:41
Compare
Choose a tag to compare

Major updates

  1. Format the GFF3 output following the standard specifications.
    1.1. Add common TEs to the Sequence Ontology database.
    1.2. Create an alias file to convert different TE naming system to the Sequence Ontology names.
  2. Improve TE summary (*.mod.EDTA.TEanno.sum) by splitting overlapping TEs and force each bp annotated only once. Splitting rule (retaining preference): 1. Structural > homology; 2. Longer > shorter; 3. Nested inner > outer. (i.e., #98)
    The split GFF3 file is located here if you want to replace the default one: *mod.EDTA.anno/*.mod.EDTA.TEanno.split.gff3.
  3. Add a script (make_panTElib.pl) to construct a pan-genome TE library from a list of TE libraries. This is a beta function.
    Usage: perl make_panTElib.pl -liblist TElib.list [options]

Minor updates

  1. Detect SSRs in flanking sequences and label candidates as false. This can significantly accelerate the TIR and Helitron identification when SSRs are rich in the genome (i.e., #93 #96).
  2. Recover structurally intact Helitrons from the negative strand.
  3. Allow users to provide the path to dependencies.

How to

How to update old annotations to the current version?

  1. Backup old results, because the update will overwrite existing results (.gff3, .sum, etc).
  2. Navigate to the root of the working directory that contains EDTA working folders (i.e., .raw, combine, final, anno).
  3. Execute the patch script by providing the genome name (eg., genome.fa)
    perl ..../EDTA/util/patch_1.8.3_to_1.9.0.pl genome.fa [threads]
  4. Check out the updated gff3 and summary results in the working directory.

Many updates

04 Apr 23:01
07c90c2
Compare
Choose a tag to compare

Bugfix

  1. Correct genome sequence number in the TEanno.sum file #73
  2. Replace RepeatClassifier with TEsorter for RepeatModeler result classification #72 #58

Improvement

  1. Remove excessive TE fragments in intact TEs #76
  2. Add identity info for homology-based annotation
  3. Improve --rmout functionality
  4. Update README for installations and usages #64

Reporting status

  1. Report finishing time for raw/TIR #77
  2. Add warnings for lack of certain TE class #75

v1.8.2

28 Feb 22:44
Compare
Choose a tag to compare

Update usages and installations, fix a couple minor bugs.