Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Multi-query input not implemented" #10

Closed
DaRinker opened this issue Sep 8, 2023 · 15 comments
Closed

"Multi-query input not implemented" #10

DaRinker opened this issue Sep 8, 2023 · 15 comments
Assignees
Labels
enhancement New feature or request

Comments

@DaRinker
Copy link

DaRinker commented Sep 8, 2023

What is the fastest way to handle multiple sequences? (e.g. without loading the db into memory each time?).

I know there are some sample scripts, but I'm having a hard time seeing how they relate to the simple, step-by-step example (that is only good for a single fasta file).

@staszekdh
Copy link
Member

@DaRinker Thank you for your message. Currently, multi-query input is not implemented, but this is a feature that we will make available very soon.

@staszekdh staszekdh added the enhancement New feature or request label Sep 11, 2023
@DaRinker
Copy link
Author

@DaRinker Thank you for your message. Currently, multi-query input is not implemented, but this is a feature that we will make available very soon.

In the meantime, can you help me with the question I asked? I'm finding the single query process to run very slowly (much more slowly than the MPI web portal) and I think it's because the database is having to load into memory each time.

What is the (current) best practice for running 1000s of sequences locally?

Thank you

@staszekdh
Copy link
Member

Can you contact me at stanislaw.dunin-horkawicz@tuebingen.mpg.de and we will figure out what the problem is?

@DaRinker
Copy link
Author

Thanks for your help. Using the examples.sh script helped cut the processing time by about 60%

@Argusmocny Argusmocny linked a pull request Sep 26, 2023 that will close this issue
@Citugulia40
Copy link

Hi, I am having 2 million sequences (query) and I want to do homology search of these against 250 sequence database. Currently these is no multi query option. For 2 million sequences, running each query individually will take such a long time. Is there any way to fast pLMBLAST?

@staszekdh
Copy link
Member

staszekdh commented Oct 6, 2023

@Citugulia40 Hi! The multi-query option is already implemented, but needs some testing before merging with the main branch. We expect to release it very soon along with other updates. cc @DaRinker

@Citugulia40
Copy link

Thank you so much.
Sorry to ask you, Are you expecting the multi-query to be merged in the main branch this month or later than this?

@staszekdh
Copy link
Member

Definitely this month, maybe even next week.

@Argusmocny Argusmocny self-assigned this Oct 7, 2023
@Argusmocny Argusmocny pinned this issue Oct 7, 2023
@Citugulia40
Copy link

Hi, I'm curious to know if you have an estimate of how long pLMBLAST would take to execute when running 2 million query sequences against a database of 250 sequences (after implementing the multi-query option)?

@staszekdh
Copy link
Member

We will provide a speed benchmark along with the multi-query support. As a rough estimate, the ECOD benchmark (all-versus-all comparison of 1500 sequences) took about 30 minutes on a 20-core CPU. Running times will depend heavily on the cosine similarity cutoff and the length of the sequences. You may also want to consider clustering your 2M sequences to 40-50% identity at a high coverage cutoff (e.g. with MMSeqs2). Given the sensitivity of pLM-based methods, searching with 1-2 examples per cluster should be sufficient.

@Citugulia40
Copy link

Thank you so much for your kind support, Eagerly waiting for the multi-query option.

@Citugulia40
Copy link

Definitely this month, maybe even next week.

Hi, I just want to ask that is there any update regarding the multi-query option?

Thanks in advance

@Argusmocny
Copy link
Contributor

Hi, all changes are in: https://github.com/labstructbioinf/pLM-BLAST/tree/multi_query_feature i will merge them on Thursday. There is still some work to do

@Citugulia40
Copy link

Ok, Thank you very much

@Argusmocny
Copy link
Contributor

Changes are now live, looking forward for your feedback :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants