Skip to content

Development: Proposal: partition by MDA

Ian Sillitoe edited this page Sep 25, 2018 · 1 revision

TL;DR

Proposal: partition starting clusters by running two rounds of GeMMA / FunFHMMER (lots of smaller, parallel processes rather than one big, sequential one):

  1. create starting clusters of sequences within the same MDA
  2. run GeMMA / FunFHMMER to create mini MDA-alignments
  3. use these MDA-alignments as starting clusters to...
  4. run GeMMA / FunFHMMER to create FunFam alignments

The following are notes from a meeting with Ian, Christine and Sayoni (25/09/2018):

Background:

  • we only have 38 v4.2 superfamilies left to run through GeMMA
  • however, these 38 sfams have >= 2600 starting clusters
  • it looks like this may take a very, very long time
  • we were planning on using structure to partition these starting clusters
  • we're not yet sure how well this will work
  • there are also potentially some issues with poor quality (gappy) alignments
  • we might have an alternative that addresses both...

Plan:

  • partition initial starting clusters according to MDA
  • run GeMMA to get full tree within each MDA partition
  • run FunFHMMER to cut tree and get alignments
  • these MDA-based alignments as new starting clusters...
  • run GeMMA to get full tree
  • run FunFHMMER to get full FunFam alignments

So, we run two rounds of GeMMA / FunFHMMER: the results of the first round provide the starting clusters for the second.

Pros:

  • since the GeMMA code is deliberately agnostic, this is more a data management task than a research task (shouldn't require a huge amount of new code)
  • there's a good chance that adding MDA information to the initial clustering steps might actually improve the quality of the alignments
  • this takes advantage of the fact that GeMMA already works really well in parallel

Cons:

  • it's possible that there might still be some large sets of starting clusters after the first round
  • can't think of anything else

Proposed work:

  1. brief analysis of MDA partitions in large superfamilies:
  • largest number of starting clusters within a single MDA?
  • do existing CDHit-96 clusters contain more than one MDA? (if so, is that a problem?)
  1. generate starting clusters based on MDA string (or use MDA to partition existing CDHit-95 clusters)
  2. update existing scripts and docs to take into account MDA partitions
  3. incorporate FunFHMMER code as an HPC step? (optional)
  4. write script to manage different project layout (round1, round2)
  • for given a project dir: how many MDA partitions? how many starting clusters? how many alignments left to run?

Other thoughts:

  • for the first round of GeMMA, it might be worth stopping the tree-building process if we know that FunFHMMER will cut anyway (evalue cutoff? running FunFHMMER after each merge?)