Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combinatorial explosion in heteromeric assemblies with many entities #48

Closed
josemduarte opened this issue Jul 17, 2015 · 5 comments
Closed
Assignees
Milestone

Comments

@josemduarte
Copy link
Contributor

With current exhaustive enumeration of assemblies, we have a problem of a combinatorial explosion when we have many entities.

Some examples:

  • 4d10: 8 entities, 58 interfaces -> out of memory
  • 3ulu: 7 entities, 18 interfaces -> completes, but produces 25194 assemblies!
  • 1eo8: only 4 entities, 15 interfaces, but the list of assemblies is already huge (2649) and breaks the WUI with an out of memory error in jetty

A partial solution that would reduce the amount of assemblies would be not to enumerate the assemblies that are not-fully-covering. At the same time it is nice to have those non-fully-covering assemblies in cases where there seems to be co-crystallisation and the right prediction would be the separate monomers.

A better solution for the long run can be graph contraction at an early stage: there we would have also a combinatorial issue in deciding which edges to contract.

@lafita
Copy link
Member

lafita commented Apr 12, 2016

I think we need to use the interface quality information in the generation of assemblies from the graph. We need to exploit the basic requisites of relevant assemblies in terms of interfaces to reduce the search space.

One of the requisites I see is that a relevant assembly must have at least one relevant interface per subunit. The issue here, then, is how to determine if an interface is relevant. We can use a score proportional to a combination of the three EPPIC scores, for example. The score threshold should be set very low to reject very few relevant assemblies (FN) and the majority of irrelevant ones (enough to make the problem computable).

Now we can build another graph only with the relevant interfaces and run the same brute force algorithm. The assemblies generated will have, in addition to the topological requisites included in the algorithm, at least one relevant interface between each subunit. At this stage, for each assembly we can add all other interfaces that occur between the subunits in the graph, although they were not relevant (for completeness).

@lafita
Copy link
Member

lafita commented Apr 12, 2016

To illustrate this point with an example:

2hda
Homomeric D3 example, with 7 interfaces.
Interfaces 1 to 3 have more than 300 A area, while interfaces 4 to 7 have less than 150 A area.
The current code returns 12 assemblies: one monomer, four dimers (C2), three trimers (C3) and four hexamers (D3).

If we apply a cutoff of 200 A area to the interfaces included in the graph we obtain 5 assemblies: one monomer, one dimer (C2), two trimers (C3) and one hexamer (D3).

@sbliven
Copy link
Member

sbliven commented Apr 12, 2016

@lafita, you're describing a heuristic for reducing the search space. I agree that applying a cutoff such that we don't bother engaging particularly poor interfaces (either judged by interface area, as you suggest, or by some combination of the geo/cs/cr scores) is likely to be the best solution.

There are a few reasons I see that we have held off on implementing such heuristics until after finding a good assembly score (eppic-science#83):

  1. Given a linear score you can turn this heuristic into a provably correct branch-and-bound algorithm which will always return the best-scoring assembly
  2. I'm somewhat concerned about the possibility that two borderline interfaces couldn't combine to make a nice assembly. For instance, you sometimes have 3-way "interfaces" where a tail of one chain bridges a weak interface between to other chains (1pmo is an ok example of this).
  3. I still have the sense that we could ameliorate the full enumeration problem through clever edge reduction, but I need to spend a few mornings at the whiteboard to demonstrate or refute this. If so then we could avoid using heuristics in all but the largest cases.

@lafita
Copy link
Member

lafita commented Apr 14, 2016

Although the branch and bound algorithm will return the optimal solution, a single solution is obtained, while the heuristic will reduce the search space enough so that you can find a set of reasonable solutions (not guaranteed to contain the optimal, I agree).

I responded more extensively in eppic-science#83.

@josemduarte
Copy link
Contributor Author

We have a solution for this now. There are alternative ideas discussed above, but let's close this for now and discuss those when necessary in new issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants