-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
anvi-gen-contigs-database is slow #1431
Comments
Hey, I was just taking a look at this....I'm wondering, what data set were you using to generate those profile results? To make a quick test data set that was a bit larger than the ones in the anvio test directory, I catted together Basically, I'm asking to see what kind of inputs you're trying to optimize for....large number of short contigs, small number of large contigs, or something else. Thanks! |
Ryan! Thank you very much for your response :) I think the most realistic test data would be the Infant Gut Dataset. After downloading it, you can use migrate databases using master, and export contig sequences as a FASTA file from the contigs database:
Then you can do your tests with the resulting FASTA file.
It will be both. We sometime deal with a eukaryotic genome of 30 Mbp, or 100,000 contigs that are shorter than 5 kbp. I think the k-mer calculation is the most critical improvement and the low-hanging fruit. Best, |
So I ran the infant gut contigs set through pyinstrument and the kmer counting/frequency still isn't the thing taking up most of the time. It did take up about 1/2 of the CPU time, but only about 1/5 of the wall time. A majority of the wall time (~55%) is just spent in python's subprocess wait (waiting on prodigal). Check out the profile output here: So maybe the most impactful thing would be to start with parallelizing the Prodigal gene calling? (I'm wondering... @ekiefl, when you ran the profiling was it just taking into account CPU cycles that python was actually using, or was it taking into account all walltime of the program? Or, maybe the difference could all be down to different type of test data....) |
Fair enough :) Parallelizing Prodigal is a good start after all, perhaps. We have examples in the codebase to parallelize HMMER. Maybe the solution there can be generalized to use the same code for Prodigal. Perhaps we can also parallelize kmer counting, and tadaa. Thank you for thinking about this, @mooreryan! Best, |
Hey @mooreryan, I don't remember exactly, however I was using
Sounds great. If you parallelize Prodigal, I will numba-ize kmer frequency calculations. |
Ahh
I've got the parallel Prodigal mostly working now...just need to clean it up a bit. I'm pretty interested to see how numba's JIT does with the kmer counting! |
Hey quick question... in this comment (#1443 (comment)) @ekiefl mentioned getting an error when running all tests. Probably a silly question, but how do you run the full suite of tests? For the infant gut dataset, I got the runtime for the Thanks! |
Nice, how did you do it?! In the root directory, cd to the tests dir cd anvio/tests Here there are a bunch of tests you can run, e.g. ./run_all_tests.sh |
<3 Are you essentially getting the same split start/stop positions or are you getting ball park numbers? You can generate the necessary information to compare splits int the original contigs database and the one you have generated by running this:
|
<3 |
Hi!
anvi-gen-contigs-database
, one of the key programs in anvi'o, takes a lot of time running, but its speed can be increased greatly with some of the following suggestions (ordered based on potential gain):1. Refactor
get_kmer_frequency
.This function currently spends 76% of
anvi-gen-contigs-database
's time.One issue is that it appears to be called twice. Finding out more about the call structure is important. If second call is unnecessary, we can see a 2-fold speed gain. More importantly, the code can be re-written using
numba
(see here and below inbamops.py
fornumba
-written functions) or in Cython. I estimate these speed gains would be 40-200x faster, makingget_kmer_frequency
, which currently has a very naive Python implementation, almost instant.2. Parallelize "The South Loop"
"The South Loop" in
dbops.py
takes up most of the time inanvi-gen-contigs-database
and processes contigs independently, and therefore could be parallelized for nearly linear speed gains as a function of number of threads.3. Parallelize Prodigal
Prodigal is a gene caller that does not support multithreading. But one can implement a workaround to split input FASTA files into multiple pieces and running Prodigal in parallel just to combine results later. This has been suggested before (#1344), although it seems like the most amount of work for the least amount of gain (but you're the boss).
If you are interested in addressing any of these suggestions, you can install anvi'o to follow the active codebase and start playing with some FASTA files using this command:
Let us know here if you have any questions.
Thanks!
The text was updated successfully, but these errors were encountered: