-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split-index mapping in separate processes #787
base: master
Are you sure you want to change the base?
Conversation
Hi @mlin The --split-prefix was implemented for enabling mapping to large genomes in systems with limited RAM and was tested mostly for the human genome. Metagenomics was a possible use case as well, but I did not get to do any tests on the nt database. However, I anticipate the results on the nt using a split prefix would be close to running with a single index. I am also quite interested in seeing how the results look like if you managed to execute. The answers to your questions are as follows. It was a while I implemented this and minimap2 perhaps have had changes in the versions after, so my answers may not be entirely accurate.
Yes. You are correct. Infact, the script I wrote to balance the size of the indexes for the human genome separately indexes and concatenates them. So having them in separate files does not do any difference.
I think this is all right to let the user take care of the options - that makes the implementation simple. Step three does not see the index, but relies on command line parameters passed. For instance, here.
I am not sure how minimap2 with a single index behaves on the nt database. The MAPQ from the split method is anticipated to be closer estimates to ones from the single index. If minimap2 with a single index gives MAPQ 0, split index also likely to give the same. A potential pitfall would be if the architectures of the cluster nodes are different (different endianness or 64-bit vs 32-bit) which is unlikely. |
This reverts commit bebba68.
@lh3 @hasindu2008 Promoting this out of draft now that we (mainly @rzlim08) have built up some experience using the proposed distributed mapping for our metagenomics use case (where the reference index is much larger than memory). Added a FAQ item discussing how to use it too. We found one significant caveat to both the existing
This seems like an effect users ought to be made aware of, perhaps in the FAQ as well? But let us know if this sounds correct and/or if you can help fill in the long read explanation as well. |
@mlin When I checked for the human genome up to 16 partitions, the difference in the sensitivity was marginal (sometimes less, but sometimes higher). As I benchmarked on the human genome and these were very slight, I did not investigate the parameters anymore. However, as those discrepancies were mostly observed for mappings that mapped with very low MAPQ and as your metagenomics use case results in a larger amount of low MAPQ mappings, this may have now become prominent. Are you mapping to the nt database? |
@hasindu2008 I think we were exploring comparable numbers of partitions but @rzlim08 can confirm. I assume the preset thresholds (for short reads) of mid_occ and max_occ are tuned to minimize the sensitivity impact on human-sized references (while speeding up mapping). So it makes sense that the effect would be minimal no matter how much we partition the human genome, yet more pronounced when we try to scale up to much larger references. We have indeed been testing with nt so far (although that's not necessarily the final strategy, for the reasons we've both alluded to on this thread!). |
Hi @lh3 @hasindu2008
My colleagues @morsecodist @katrinakalantar @rzlim08 @tfrcarvalho and I hacked on the existing split-index mode so that we can map against each index part in a separate process (in general on a cluster of machines), then merge the intermediates in a final gather process. It's still a bit rough, but we'd appreciate any early feedback. This is for a metagenomics use case (e.g. mapping to all of nt) which I gather was contemplated in the design of the split index.
minimap2 --split-map part7.intermediate part7.idx reads_1.fq reads_2.fq
on each index part to generate the intermediate files.minimap2 --split-merge reads_1.fq reads_2.fq . part1.intermediate part2.intermediate ...
to merge the intermediates and generate PAF/SAM. This uses the existing merge subroutine unchanged, we're just restructuring the intermediate file I/O.We have a proof-of-concept workflow of this running (still to be tested at full scale).
The new approach coexists with the single-process
--split-prefix
mode -- the diff factors out its logic for formulating temporary filenames and deleting them afterwards.One wart we have right now is that the
mm_mapopt_t
must get initialized consistently in all the invocations for steps 2 and 3. I think it's okay if the user takes care to set the same command-line options to them, but I'm not completely sure, given that step 3 never sees any part of the index -- advice appreciated.Do you know any big pitfalls awaiting us in the merging subroutine if we try to use this on full-size metagenomics databases? (One we're anticipating is the mapQ being zeroed by all the redundancy in nt)