anvi-gen-contigs-database is slow #1431

ekiefl · 2020-05-26T21:43:22Z

Hi!

anvi-gen-contigs-database, one of the key programs in anvi'o, takes a lot of time running, but its speed can be increased greatly with some of the following suggestions (ordered based on potential gain):

1. Refactor `get_kmer_frequency`.

This function currently spends 76% of anvi-gen-contigs-database's time.

One issue is that it appears to be called twice. Finding out more about the call structure is important. If second call is unnecessary, we can see a 2-fold speed gain. More importantly, the code can be re-written using numba (see here and below in bamops.py for numba-written functions) or in Cython. I estimate these speed gains would be 40-200x faster, making get_kmer_frequency, which currently has a very naive Python implementation, almost instant.

2. Parallelize "The South Loop"

"The South Loop" in dbops.py takes up most of the time in anvi-gen-contigs-database and processes contigs independently, and therefore could be parallelized for nearly linear speed gains as a function of number of threads.

3. Parallelize Prodigal

Prodigal is a gene caller that does not support multithreading. But one can implement a workaround to split input FASTA files into multiple pieces and running Prodigal in parallel just to combine results later. This has been suggested before (#1344), although it seems like the most amount of work for the least amount of gain (but you're the boss).

If you are interested in addressing any of these suggestions, you can install anvi'o to follow the active codebase and start playing with some FASTA files using this command:

anvi-gen-contigs-database -f FASTSA.fa -o CONTIGS.db

Let us know here if you have any questions.

Thanks!

The text was updated successfully, but these errors were encountered:

mooreryan · 2020-06-05T03:00:17Z

Hey, I was just taking a look at this....I'm wondering, what data set were you using to generate those profile results?

To make a quick test data set that was a bit larger than the ones in the anvio test directory, I catted together ./anvio/tests/sandbox/workflows/metagenomics/three_samples_example/G02-contigs.fa until I got 1620 contigs with ~11 million nucleotides total, and using pyinstrument got about ~75% of the time in the prodigal gene calls (rather than kmer counting).

Basically, I'm asking to see what kind of inputs you're trying to optimize for....large number of short contigs, small number of large contigs, or something else.

Thanks!

meren · 2020-06-05T21:50:46Z

Ryan! Thank you very much for your response :)

I think the most realistic test data would be the Infant Gut Dataset. After downloading it, you can use migrate databases using master, and export contig sequences as a FASTA file from the contigs database:

anvi-export-contigs -c CONTIGS.db -o contigs.fa

Then you can do your tests with the resulting FASTA file.

Basically, I'm asking to see what kind of inputs you're trying to optimize for....large number of short contigs, small number of large contigs, or something else.

It will be both. We sometime deal with a eukaryotic genome of 30 Mbp, or 100,000 contigs that are shorter than 5 kbp.

I think the k-mer calculation is the most critical improvement and the low-hanging fruit.

Best,

mooreryan · 2020-06-05T23:31:37Z

So I ran the infant gut contigs set through pyinstrument and the kmer counting/frequency still isn't the thing taking up most of the time. It did take up about 1/2 of the CPU time, but only about 1/5 of the wall time. A majority of the wall time (~55%) is just spent in python's subprocess wait (waiting on prodigal).

Check out the profile output here:

So maybe the most impactful thing would be to start with parallelizing the Prodigal gene calling?

(I'm wondering... @ekiefl, when you ran the profiling was it just taking into account CPU cycles that python was actually using, or was it taking into account all walltime of the program? Or, maybe the difference could all be down to different type of test data....)

meren · 2020-06-05T23:55:34Z

Fair enough :) Parallelizing Prodigal is a good start after all, perhaps.

We have examples in the codebase to parallelize HMMER. Maybe the solution there can be generalized to use the same code for Prodigal.

Perhaps we can also parallelize kmer counting, and tadaa.

Thank you for thinking about this, @mooreryan!

Best,

ekiefl · 2020-06-07T00:23:47Z

(I'm wondering... @ekiefl, when you ran the profiling was it just taking into account CPU cycles that python was actually using, or was it taking into account all walltime of the program? Or, maybe the difference could all be down to different type of test data....)

Hey @mooreryan, I don't remember exactly, however I was using pprofile. IIRC it is able to profile subprocess. I was using a different data set so maybe that is the difference.

So maybe the most impactful thing would be to start with parallelizing the Prodigal gene calling?

Sounds great. If you parallelize Prodigal, I will numba-ize kmer frequency calculations.

mooreryan · 2020-06-07T04:24:33Z

Ahh pprofile ...I thought that output you used looked like kcachegrind graphs, just looked it up and saw pprofile can output callgrind format...cool!

Sounds great. If you parallelize Prodigal, I will numba-ize kmer frequency calculations.

I've got the parallel Prodigal mostly working now...just need to clean it up a bit. I'm pretty interested to see how numba's JIT does with the kmer counting!

…nd threads (#1344, #1431)

mooreryan · 2020-06-18T22:53:40Z

Hey quick question... in this comment (#1443 (comment)) @ekiefl mentioned getting an error when running all tests. Probably a silly question, but how do you run the full suite of tests?

For the infant gut dataset, I got the runtime for the get_split_start_stops_with_gene_calls function from ~50 seconds down to ~3 seconds with just a couple of little switches, but wasn't sure how to run the full test suite to make sure everything is still running fine.

Thanks!

ekiefl · 2020-06-18T23:05:54Z

Nice, how did you do it?!

In the root directory, cd to the tests dir

cd anvio/tests

Here there are a bunch of tests you can run, e.g.

./run_all_tests.sh

meren · 2020-06-18T23:34:17Z

For the infant gut dataset, I got the runtime for the get_split_start_stops_with_gene_calls function from ~50 seconds down to ~3 seconds with just a couple of little switches

<3

Are you essentially getting the same split start/stop positions or are you getting ball park numbers?

You can generate the necessary information to compare splits int the original contigs database and the one you have generated by running this:

sqlite3 CONTIGS.db 'select split, start, end from splits_basic_info;' -separator $'\t' -header

meren · 2021-01-02T03:19:20Z

<3

ekiefl added contigs database priority small project labels May 26, 2020

mooreryan added a commit that referenced this issue Jun 8, 2020

Add num_threads argument to anvi-gen-contigs-database (#1344, #1431)

d520bbc

mooreryan added a commit that referenced this issue Jun 8, 2020

Split the input fasta file and process each with Prodigal in backgrou…

89c4321

…nd threads (#1344, #1431)

mooreryan mentioned this issue Jun 19, 2020

Speed up get_split_start_and_stop_with_gene_calls #1445

Merged

meren closed this as completed Jan 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

anvi-gen-contigs-database is slow #1431

anvi-gen-contigs-database is slow #1431

ekiefl commented May 26, 2020 •

edited by meren

Loading

mooreryan commented Jun 5, 2020 •

edited

Loading

meren commented Jun 5, 2020

mooreryan commented Jun 5, 2020 •

edited

Loading

meren commented Jun 5, 2020

ekiefl commented Jun 7, 2020

mooreryan commented Jun 7, 2020

mooreryan commented Jun 18, 2020

ekiefl commented Jun 18, 2020

meren commented Jun 18, 2020

meren commented Jan 2, 2021

anvi-gen-contigs-database is slow #1431

anvi-gen-contigs-database is slow #1431

Comments

ekiefl commented May 26, 2020 • edited by meren Loading

1. Refactor get_kmer_frequency.

2. Parallelize "The South Loop"

3. Parallelize Prodigal

mooreryan commented Jun 5, 2020 • edited Loading

meren commented Jun 5, 2020

mooreryan commented Jun 5, 2020 • edited Loading

meren commented Jun 5, 2020

ekiefl commented Jun 7, 2020

mooreryan commented Jun 7, 2020

mooreryan commented Jun 18, 2020

ekiefl commented Jun 18, 2020

meren commented Jun 18, 2020

meren commented Jan 2, 2021

ekiefl commented May 26, 2020 •

edited by meren

Loading

1. Refactor `get_kmer_frequency`.

mooreryan commented Jun 5, 2020 •

edited

Loading

mooreryan commented Jun 5, 2020 •

edited

Loading