Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structuremsa consumes RAM until killed by OS #3

Open
hughhigin opened this issue Feb 15, 2024 · 3 comments
Open

Structuremsa consumes RAM until killed by OS #3

hughhigin opened this issue Feb 15, 2024 · 3 comments

Comments

@hughhigin
Copy link

Expected Behavior

Running on a large local database with default parameters or with --filter-msa 0, occurs above ~2000 sequences without filtering.

Current Behavior

When performing the alignment, during the merging step all available memory (even when running on 100gb RAM workstation) gets consumed until the OS kills the process and the output "Killed" appears.

Steps to Reproduce (for bugs)

Running on any sufficiently large database of sequences with the default --max-seq-len parameter

Foldssek Output (for bugs)

During merging step this is how it terminates (example):
0 0 A0A7G8BFY0.pdb A0A7I8DMP5.pdb 689 (TM-align)
0 0 A0A554IY00.pdb Q2S4W8.pdb 824 (TM-align)
0 0 A0A1Q7IQS0.pdb A0A7K0WNT6.pdb 733
Killed

Context

My current solution is to change the --max-seq-len parameter to something on the order of several thousand instead of the default 65K, which keeps the memory blowup within my hardware limits for the number of sequences I'm using. Most of my sequences are on the order of a few hundred amino acids so the total alignment length is under 2000 in length.

I wonder if there is a way to fix it so structuremsa stops storing everything in RAM while merging, or a simpler improvement might be to trim the length of stored sequences based on the input.

Your Environment

I've primarily been running on a personal computer with 32GB ram on WSL 2, but the bug occurs similarly on a workstation with 128GB ram running Ubuntu.

@gamcil
Copy link
Collaborator

gamcil commented Feb 21, 2024

Hi @hughhigin, I just pushed some changes which should significantly improve memory usage. Could you try again with the latest version and see if it works?

@hughhigin
Copy link
Author

@gamcil yes it looks like the changes allowed structuremsa to complete even on the full set of 23000 proteins!

I did get an error running msa2lddt afterwards (part of easy-msa) that might be useful to know about so I copied the output here. I think for my purposes at the moment it's not an issue since I'm focused on analyzing the 3Di alignment but let me know if you'd like me to dig into this part more.


msa2lddt bigtmp/18050383238977581431/structures smPGTs_3di_align_all.fa --lddt-html smPGTs_3di_align_all.html --guide-tree smPGTs_3di_align_all.nw --pair-threshold 0 --threads 20 -v 3 --report-command '--match-ratio 0.51 --filter-msa 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --output-mode 1 '

terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted
Error: msa2lddt died


Regardless, thanks for the quick work addressing the issue! I really appreciate it.

Best,
Hugh

@gamcil
Copy link
Collaborator

gamcil commented Feb 22, 2024

Great! Thanks for testing that. Not sure about the msa2lddt issue, will have to have a look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants