Skip to content

Querying genomes for evidence of Programmed DNA Elimination

License

Notifications You must be signed in to change notification settings

bricoletc/delfies

Repository files navigation

PyPI codecov License: MIT JOSS paper status

delfies is a tool for the detection of DNA breakpoints with de-novo telomere addition.

It identifies genomic locations where double-strand breaks have occurred followed by telomere addition. It was initially designed and validated for studying the process of Programmed DNA Elimination in nematodes, but should work for other clades and applications too.

Getting started

delfies takes as input a genome fasta (gzipped supported) and an indexed SAM/BAM of sequencing reads aligned to the genome.

delfies --help
samtools index <aligned_reads>.bam
delfies <genome>.fa.gz <aligned_reads>.bam <output_dir>
cat <output_dir>/breakpoint_locations.bed

For how to obtain a suitable SAM/BAM, see input data, and for downloading a real genome and BAMs for a test run of delfies, see test run.

Table of Contents

Installation

Using pip (or equivalent - poetry, etc.):

# Install latest release from PyPI
pip install delfies

# Or install a specific release from PyPI:
pip install delfies==0.7.0

# Or clone and install tip of main
git clone https://github.com/bricoletc/delfies/
pip install ./delfies

Input data

Sequencing technologies

delfies is designed to work with both Illumina short reads and ONT or PacBio long reads. Long reads are better for finding breakpoints in more repetitive regions of the genome. A high fraction of sequenced bases with a quality >Q20 is desirable (e.g. >70%). I found delfies worked on recent data from all three sequencing technologies: see test run below.

Aligners

To produce a SAM/BAM with which you can find breakpoints, you need to use a read aligner that reports soft clips (parts of a reads that are not aligned to the reference). Both bowtie2 (in --local mode) and minimap2 (by default) do this. Use minimap2 for long reads (>300bp), with the appropriate preset (e.g. -x map-ont for Nanopore data).

Test run with real data

I provide a processed subset of publicly-available data here: https://doi.org/10.5281/zenodo.14101797.

The data consist of a 2kbp region of the assembled genome of Oscheius onirici and three alignment BAMs from sequencing data produced using Illumina, ONT and PacBio. The data were aligned to the 2kbp region using minimap2. See the Zenodo link for details on the sequencing data (read lengths, error rates) and public links to the raw data.

You can run delfies on the inputs in this archive to make sure it is properly installed and produces the expected outputs:

wget https://zenodo.org/records/14101798/files/delfies_zenodo_test_data.tar.gz
tar xf delfies_zenodo_test_data.tar.gz
# Run delfies here
# Compare with the expected outputs:
find delfies_zenodo_test_data -name "*breakpoint_locations.bed" | xargs cat

User Manual

CLI options

delfies --help
  • Do use the --threads option if you have multiple cores/CPUs available.
  • [Breakpoints]
    • There are two types of breakpoints: see detailed docs.
    • Nearby breakpoints can be clustered together to account for variability in breakpoint location (--clustering_threshold).
  • [Region selection]: You can select a specific region to focus on, specified as a string or as a BED file.
  • [Telomeres]
    • Specify the telomere sequence for your organism using --telo_forward_seq. If you're unsure, I recommend the tool telomeric-identifier for finding out.
  • [Aligned reads]
    • To analyse confidently-aligned reads only, you can filter reads by MAPQ (--min_mapq) and by bitwise flag (--read_filter_flag).
    • You can tolerate more or less mutations in the assembly telomeres (and in the sequencing reads) using --telo_max_edit_distance and --telo_array_size.

Outputs

The two main outputs of delfies are:

  • breakpoint_locations.bed: a BED-formatted file containing the location of identified elimination breakpoints.
  • breakpoint_sequences.fasta: a FASTA-formatted file containing the sequences of identified elimination breakpoints

Validating breakpoints

I highly recommend visualising your results! E.g., by loading your input fasta and BAM and output delfies' output breakpoint_locations.bed in IGV.

Confident/true breakpoints will typically have:

  • Good read support. Note that breakpoints are ordered by read support in the delfies output file breakpoint_locations.bed, and you can require a minimum number of supporting reads using the CLI option --min_supporting_reads.
  • A difference in read coverage before and after the breakpoint. The nature of this difference depends on the ratio between cells with and without the breakpoint. As an example, in organisms that eliminate parts of their genome in the soma, if most sequenced cells are from the soma, expect more reads before the breakpoint than after it ('before' and 'after' defined relative to the reported breakpoint strand).

Ultimately though, only biological experiments can truly validate identified breakpoints.

Applications

  • The fasta output enables looking for sequence motifs that occur at breakpoints, e.g. using MEME.
  • The BED output enables classifying a genome into retained and eliminated regions. The 'strand' of breakpoints is especially useful for this: see detailed docs.
  • The BED output also enables assembling past somatic telomeres: for how to do this, see detailed docs.

Detailed documentation

For more details on delfies, including outputs and applications, see detailed_docs.

Contributing

Contributions always welcome!

Please see CONTRIBUTING.md for how (reporting issues, requesting features, contributing code).