-
Notifications
You must be signed in to change notification settings - Fork 0
Development: Proposal: partition by MDA
Ian Sillitoe edited this page Sep 25, 2018
·
1 revision
TL;DR
Proposal: partition starting clusters by running two rounds of GeMMA / FunFHMMER (lots of smaller, parallel processes rather than one big, sequential one):
- create starting clusters of sequences within the same MDA
- run GeMMA / FunFHMMER to create mini MDA-alignments
- use these MDA-alignments as starting clusters to...
- run GeMMA / FunFHMMER to create FunFam alignments
The following are notes from a meeting with Ian, Christine and Sayoni (25/09/2018):
- we only have 38 v4.2 superfamilies left to run through GeMMA
- however, these 38 sfams have >= 2600 starting clusters
- it looks like this may take a very, very long time
- we were planning on using structure to partition these starting clusters
- we're not yet sure how well this will work
- there are also potentially some issues with poor quality (gappy) alignments
- we might have an alternative that addresses both...
- partition initial starting clusters according to MDA
- run GeMMA to get full tree within each MDA partition
- run FunFHMMER to cut tree and get alignments
- these MDA-based alignments as new starting clusters...
- run GeMMA to get full tree
- run FunFHMMER to get full FunFam alignments
So, we run two rounds of GeMMA / FunFHMMER: the results of the first round provide the starting clusters for the second.
- since the GeMMA code is deliberately agnostic, this is more a data management task than a research task (shouldn't require a huge amount of new code)
- there's a good chance that adding MDA information to the initial clustering steps might actually improve the quality of the alignments
- this takes advantage of the fact that GeMMA already works really well in parallel
- it's possible that there might still be some large sets of starting clusters after the first round
- can't think of anything else
- brief analysis of MDA partitions in large superfamilies:
- largest number of starting clusters within a single MDA?
- do existing CDHit-96 clusters contain more than one MDA? (if so, is that a problem?)
- generate starting clusters based on MDA string (or use MDA to partition existing CDHit-95 clusters)
- update existing scripts and docs to take into account MDA partitions
- incorporate FunFHMMER code as an HPC step? (optional)
- write script to manage different project layout (round1, round2)
- for given a project dir: how many MDA partitions? how many starting clusters? how many alignments left to run?
- for the first round of GeMMA, it might be worth stopping the tree-building process if we know that FunFHMMER will cut anyway (evalue cutoff? running FunFHMMER after each merge?)