Skip to content

Latest commit

 

History

History
137 lines (73 loc) · 7.87 KB

README.md

File metadata and controls

137 lines (73 loc) · 7.87 KB

TreeValGal workflow

May 2: Core TreeValGal components

See the main README.md for the latest updates to the workflow and subworkflows.

The TreeValGal workflow is complex and overwhelming at first glance, so the major components, including 3 subworkflows, are described here.

The JBrowse2 tool on the right side is the most prominent workflow element, producing an integrated interactive JBrowse2 browser configured with 15 tracks. Some tracks (the paf, gaps and telomeres) are turned off in the default view to reduce the complexity at opening the browser screen. All tracks can be turned on and off using the JBrowse2 track menu to allow the user to focus on tracks of specific interest without distraction.

1. Workflows components reporting the fasta reference.

These do not involve external data other than PacBio reads fastqsanger input, and they produce four JBrowse2 tracks for display:

  • Repeatmasker repeats as a GFF3 track
  • Windowmasker repeats as a bigwig track
  • Gaps as a bigwig track
  • PacBio depth of coverage as a bigwig track

image

The top row of tools and data flows run the tool gfastats on the reference fasta to report sequence gaps as a bigwig track. The second row prepares a file with window start/end coordinates, used to count features as a bigwig track for gaps, coverage and repeats. The third row runs the model-free Windowmasker repeat finder tool and the fourth uses Minimap2 to map PacBio reads to the reference for coverage. The bamcoverage tool creates a bigwig directly from the mapped bam file.

The repeatmasker tool seen at the bottom of the workflow section shown above, creates a GFF3 track of repeats. A DFam taxonId can be provided to over-ride the default (Homo sapiens) taxon. The masked fasta output is retained for use in the creation of sequence similarity paf files from mashmap to minimise noise from uninformative low-complexity sequence matches.

Two bigwig tracks (gaps and windowmasker repeats) use a specialised subworkflow to count depth of coverage over the generated window start/end coordinates based on the user window size (default is 100bp) input.

TreeValGal bed to bigwig subworkflow

2. Workflow components reporting external fasta annotation data.

The middle segment of the main TreeValGal workflow canvas is responsible for mapping NCBI or other external annotation fasta files to create JBrowse2 GFF3 and bed tracks.

image

The subworkflow for these tracks is described in detail here

3. Workflow components preparing pairwise mapping format (PAF) tracks for haplotype self-comparison and closely related species sequence synteny or similarity.

image

This process uses the mashmap tool multiple times with a range of settings, to provide choices for the user, depending on how closely related the provided genomes are, in a third subworkflow.

image

Like the gene_alignment equivalent, this workflow uses pick workflow logic components that make all the mashmap steps optional, since the mashmap tool does not natively allow optional inputs.

April 23 update

The latest TreeValGal workflow is on the EU server at https://usegalaxy.eu/published/workflow?id=2e93613c688ea52e

February 21 update

The TreeValGal WF is available on usegalaxy.eu depends on JBrowse2 so only available on usegalaxy.eu for testing.

Hummingbird sample output and Amphioxus fish sample outputs are available.

These JBrowse configurations now include repeatmasker GFF tracks, from the latest Feb_11 revision, that only has a couple of small subworkflows - for making wiggles and for optional hic and paf.

image

The wiggle maker is the most complicated subworkflow and is used for 3 tracks.

image

The optional synteny and hic track subworkflows are relatively trivial...

January 2024 update

This workflow integrates tracks from prototype TreeVal subworkflows into a single JBrowse2 configuration, ready to view, share and download.

Warning

The compressed archives can be very big, because they contain the reference sequences and tracks, albeit compressed and indexed.

Testing results, suggestions and contributions are very welcome.

Using the hummingbird test data from Anna Syme on the EU server, it produces this live JBrowse2 instance. Nadolina's Lancet fish shown below was chosen as the synteny genome.

Only a solitary telomere. Does the bird need a different telomere repeat?

Fixes for the uninformative wiggle and paf track names will be live on EU shortly....

Sample image after manually adding a dot plot: image

Clicking on a syntenic feature shows the details of the match and the fish sequence if wanted:

image

Nadolina's Lancet fish - same workflow:

image

and this time, the detailed view of a syntenic region shows a syntenic segment of bird sequence image

The main treevalgal workflow and the subworkflows it calls are shared on EU as treevalgal_jan27 from fubar :

image

December 15 prototype

With thanks to Bjoern Gruening and Anna Syme for help with testing and tools, and the support of Galaxy Australia, the current (December 15) version combines the two gap tracks, the repeats track and the coverage track for the TreeVal small sample test data.

Note that there are only 3 gaps in the entire test reference so hard to find them, and the pacbio sample only covers some of a single contig, so select ENA|OV656687|OV656687.1 for display, otherwise there will not be any coverage to see in this tiny sample demonstration.

A demonsration history with the viewable preconfigured JBrowse is shared here on usegalaxy.eu and so is the prototype workflow so please try it on your own pacbio/refseq data.

Sample images show how JBrowse does all the work of density and other displays based on the zoom level. All tracks are also in the history as bed files if the user wants them for downstream analyses.

Zoomed out to show windowed bar charts:

image

Zoomed midway to show individual features:

image

Zoomed in to base level:

image