Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anvi-gen-contigs-database is slow #1431

Closed
ekiefl opened this issue May 26, 2020 · 10 comments
Closed

anvi-gen-contigs-database is slow #1431

ekiefl opened this issue May 26, 2020 · 10 comments

Comments

@ekiefl
Copy link
Contributor

ekiefl commented May 26, 2020

Hi!

anvi-gen-contigs-database, one of the key programs in anvi'o, takes a lot of time running, but its speed can be increased greatly with some of the following suggestions (ordered based on potential gain):

1. Refactor get_kmer_frequency.

This function currently spends 76% of anvi-gen-contigs-database's time.

image

One issue is that it appears to be called twice. Finding out more about the call structure is important. If second call is unnecessary, we can see a 2-fold speed gain. More importantly, the code can be re-written using numba (see here and below in bamops.py for numba-written functions) or in Cython. I estimate these speed gains would be 40-200x faster, making get_kmer_frequency, which currently has a very naive Python implementation, almost instant.

2. Parallelize "The South Loop"

"The South Loop" in dbops.py takes up most of the time in anvi-gen-contigs-database and processes contigs independently, and therefore could be parallelized for nearly linear speed gains as a function of number of threads.

3. Parallelize Prodigal

Prodigal is a gene caller that does not support multithreading. But one can implement a workaround to split input FASTA files into multiple pieces and running Prodigal in parallel just to combine results later. This has been suggested before (#1344), although it seems like the most amount of work for the least amount of gain (but you're the boss).


If you are interested in addressing any of these suggestions, you can install anvi'o to follow the active codebase and start playing with some FASTA files using this command:

anvi-gen-contigs-database -f FASTSA.fa -o CONTIGS.db

Let us know here if you have any questions.

Thanks!

@mooreryan
Copy link
Contributor

mooreryan commented Jun 5, 2020

Hey, I was just taking a look at this....I'm wondering, what data set were you using to generate those profile results?

To make a quick test data set that was a bit larger than the ones in the anvio test directory, I catted together ./anvio/tests/sandbox/workflows/metagenomics/three_samples_example/G02-contigs.fa until I got 1620 contigs with ~11 million nucleotides total, and using pyinstrument got about ~75% of the time in the prodigal gene calls (rather than kmer counting).

Basically, I'm asking to see what kind of inputs you're trying to optimize for....large number of short contigs, small number of large contigs, or something else.

Thanks!

@meren
Copy link
Member

meren commented Jun 5, 2020

Ryan! Thank you very much for your response :)

I think the most realistic test data would be the Infant Gut Dataset. After downloading it, you can use migrate databases using master, and export contig sequences as a FASTA file from the contigs database:

anvi-export-contigs -c CONTIGS.db -o contigs.fa

Then you can do your tests with the resulting FASTA file.

Basically, I'm asking to see what kind of inputs you're trying to optimize for....large number of short contigs, small number of large contigs, or something else.

It will be both. We sometime deal with a eukaryotic genome of 30 Mbp, or 100,000 contigs that are shorter than 5 kbp.

I think the k-mer calculation is the most critical improvement and the low-hanging fruit.

Best,

@mooreryan
Copy link
Contributor

mooreryan commented Jun 5, 2020

So I ran the infant gut contigs set through pyinstrument and the kmer counting/frequency still isn't the thing taking up most of the time. It did take up about 1/2 of the CPU time, but only about 1/5 of the wall time. A majority of the wall time (~55%) is just spent in python's subprocess wait (waiting on prodigal).

Check out the profile output here:
profile-out

So maybe the most impactful thing would be to start with parallelizing the Prodigal gene calling?

(I'm wondering... @ekiefl, when you ran the profiling was it just taking into account CPU cycles that python was actually using, or was it taking into account all walltime of the program? Or, maybe the difference could all be down to different type of test data....)

@meren
Copy link
Member

meren commented Jun 5, 2020

Fair enough :) Parallelizing Prodigal is a good start after all, perhaps.

We have examples in the codebase to parallelize HMMER. Maybe the solution there can be generalized to use the same code for Prodigal.

Perhaps we can also parallelize kmer counting, and tadaa.

Thank you for thinking about this, @mooreryan!

Best,

@ekiefl
Copy link
Contributor Author

ekiefl commented Jun 7, 2020

(I'm wondering... @ekiefl, when you ran the profiling was it just taking into account CPU cycles that python was actually using, or was it taking into account all walltime of the program? Or, maybe the difference could all be down to different type of test data....)

Hey @mooreryan, I don't remember exactly, however I was using pprofile. IIRC it is able to profile subprocess. I was using a different data set so maybe that is the difference.

So maybe the most impactful thing would be to start with parallelizing the Prodigal gene calling?

Sounds great. If you parallelize Prodigal, I will numba-ize kmer frequency calculations.

@mooreryan
Copy link
Contributor

Ahh pprofile ...I thought that output you used looked like kcachegrind graphs, just looked it up and saw pprofile can output callgrind format...cool!

Sounds great. If you parallelize Prodigal, I will numba-ize kmer frequency calculations.

I've got the parallel Prodigal mostly working now...just need to clean it up a bit. I'm pretty interested to see how numba's JIT does with the kmer counting!

@mooreryan
Copy link
Contributor

Hey quick question... in this comment (#1443 (comment)) @ekiefl mentioned getting an error when running all tests. Probably a silly question, but how do you run the full suite of tests?

For the infant gut dataset, I got the runtime for the get_split_start_stops_with_gene_calls function from ~50 seconds down to ~3 seconds with just a couple of little switches, but wasn't sure how to run the full test suite to make sure everything is still running fine.

Thanks!

more-profiling-info

@ekiefl
Copy link
Contributor Author

ekiefl commented Jun 18, 2020

Nice, how did you do it?!

In the root directory, cd to the tests dir

cd anvio/tests

Here there are a bunch of tests you can run, e.g.

./run_all_tests.sh

@meren
Copy link
Member

meren commented Jun 18, 2020

For the infant gut dataset, I got the runtime for the get_split_start_stops_with_gene_calls function from ~50 seconds down to ~3 seconds with just a couple of little switches

<3

Are you essentially getting the same split start/stop positions or are you getting ball park numbers?

You can generate the necessary information to compare splits int the original contigs database and the one you have generated by running this:

sqlite3 CONTIGS.db 'select split, start, end from splits_basic_info;' -separator $'\t' -header

@meren
Copy link
Member

meren commented Jan 2, 2021

<3

@meren meren closed this as completed Jan 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants