Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelCIF output #284

Closed
josemduarte opened this issue Mar 2, 2023 · 5 comments
Closed

ModelCIF output #284

josemduarte opened this issue Mar 2, 2023 · 5 comments

Comments

@josemduarte
Copy link
Collaborator

I think outputting natively ModelCIF instead of PDB format would be a very nice feature for openfold.

I've been working on it for some time, see draft PR: #283

I'm opening the issue to see what's the interest on this and to see what metadata people would deem important to be written into the ModelCIF file. Currently the metadata I write is global pLDDT and local pLDDT per residue (using _ma_qa_metric ModelCIF categories)

What's missing in my PR is writing out the relaxed protein as ModelCIF. There's a few conversions between the protein object and PDB format strings during that process. I'm trying to find a solution that minimises the number of round trips protein -> pdb string -> protein.

Also one related question: the docs say that protein.from_pdb_string() is not able to read multi-chain files. However, the amber relaxation process does seem to use protein.from_pdb_string(): see https://github.com/aqlaboratory/openfold/blob/main/openfold/np/relax/amber_minimize.py#L551

That would imply that multi-chain predictions are lost in the relaxation process. Is that intended? Note I couldn't test it because I'm having trouble building flash attention.

@josemduarte
Copy link
Collaborator Author

Also one related question: the docs say that protein.from_pdb_string() is not able to read multi-chain files. However, the amber relaxation process does seem to use protein.from_pdb_string(): see https://github.com/aqlaboratory/openfold/blob/main/openfold/np/relax/amber_minimize.py#L551
That would imply that multi-chain predictions are lost in the relaxation process. Is that intended? Note I couldn't test it because I'm having trouble building flash attention.

Replying to myself: it looks like the docs of protein.from_pdb_string() are actually not right and multi-chain reading does work fine. Multi-chain predictions are not lost in the relaxation process. I'll fix the docs in my PR too.

@josemduarte
Copy link
Collaborator Author

I'll close this as complete now that the PR is merged.

As mentioned in #287 some headers are missing in ModelCIF. But they can be brought in later if that's deemed as important.

@vaclavhanzl
Copy link
Contributor

vaclavhanzl commented Apr 10, 2023

I fixed the Colab notebook in #285 which got merged the same day as #287 and it seems that #287 immediately broke the Colab notebook again. The problem is that Colab notebook has its own list of dependencies which was not updated to include modelcif. I'll try to fix the Colab again. But it is of couse bad to have dependencies specified at two places, this underlying problem should be fixed as well.

@josemduarte
Copy link
Collaborator Author

Yes, ideally dependencies should be defined in environment.yml only. Does Colab not use the dependencies from there?

@vaclavhanzl
Copy link
Contributor

vaclavhanzl commented Apr 10, 2023

No, it has its own (sub)set of dependencies and a bit tricky install, likely trying to speed things up doing only the minimum needed. But this complicates things of course. I am just running test of an immediate fix (adding one dependency) but unifying dependencies is worth some thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants