Development: Proposal: partition by MDA

TL;DR

Proposal: partition starting clusters by running two rounds of GeMMA / FunFHMMER (lots of smaller, parallel processes rather than one big, sequential one):

create starting clusters of sequences within the same MDA
run GeMMA / FunFHMMER to create mini MDA-alignments
use these MDA-alignments as starting clusters to...
run GeMMA / FunFHMMER to create FunFam alignments

The following are notes from a meeting with Ian, Christine and Sayoni (25/09/2018):

Background:

we only have 38 v4.2 superfamilies left to run through GeMMA
however, these 38 sfams have >= 2600 starting clusters
it looks like this may take a very, very long time
we were planning on using structure to partition these starting clusters
we're not yet sure how well this will work
there are also potentially some issues with poor quality (gappy) alignments
we might have an alternative that addresses both...

Plan:

partition initial starting clusters according to MDA
run GeMMA to get full tree within each MDA partition
run FunFHMMER to cut tree and get alignments
these MDA-based alignments as new starting clusters...
run GeMMA to get full tree
run FunFHMMER to get full FunFam alignments

So, we run two rounds of GeMMA / FunFHMMER: the results of the first round provide the starting clusters for the second.

Pros:

since the GeMMA code is deliberately agnostic, this is more a data management task than a research task (shouldn't require a huge amount of new code)
there's a good chance that adding MDA information to the initial clustering steps might actually improve the quality of the alignments
this takes advantage of the fact that GeMMA already works really well in parallel

Cons:

it's possible that there might still be some large sets of starting clusters after the first round
can't think of anything else

Proposed work:

brief analysis of MDA partitions in large superfamilies:

largest number of starting clusters within a single MDA?
do existing CDHit-96 clusters contain more than one MDA? (if so, is that a problem?)

generate starting clusters based on MDA string (or use MDA to partition existing CDHit-95 clusters)
update existing scripts and docs to take into account MDA partitions
incorporate FunFHMMER code as an HPC step? (optional)
write script to manage different project layout (round1, round2)

for given a project dir: how many MDA partitions? how many starting clusters? how many alignments left to run?

Other thoughts:

for the first round of GeMMA, it might be worth stopping the tree-building process if we know that FunFHMMER will cut anyway (evalue cutoff? running FunFHMMER after each merge?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly