-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dedup dir-adjacency too slow/does not complete #31
Comments
Hi Roy, I've not used qualimap bamqc but from a quick read of the documentation it appears it detects duplicates using the start position. If you're using an aligner which can "clip" the reads it may miss-estimate the duplication rate. Would you mind running Picard MarkDuplicates as well to compare the output? Regarding the time taken, this is indeed very slow. Would you be OK to share with me the one chromosome 314MB undeduped BAM and I can see whether I get the same performance and if so where the bottleneck is. |
Hi Tom, |
|
Quick side note. You have lots of multimapped reads in your BAM.
26598 NH:i:10 As it stands, for a read which maps to 10 locations, each mapping location will be considered independently. Thus, the final output will contain between 0-N alignments for each read, where N is the number of alignments (NH tag). I'm going to add an option to discount secondary and supplementary alignments which you might want to consider. |
On to the real issue of the time taken. This looks to be solely due to the time taken to find the paired end mates. You can test this yourself by dropping the You have a relatively low duplication rate so most positions contains just a single read. The outputting therefore requires many calls to the pysam.AlignmentFile.mate() function which is not an efficient way to retrieve the paired end read. Ideally, when dealing with paired reads, the BAM is sorted by read name. However, we need a coordinate sorted BAM in order to perform the duplicate detection. It looks like we might need to implement so sort of caching to speed up the retrieval of the paired ends. @AndreasHeger Do you have any suggestions for how to best speed up the paired end retrieval? Is caching the right way to go? |
This looks to be solely due to the time taken to find the paired end mates. You can test this yourself by dropping the --paired option. I don't have the profile stats yet because it's still not finished 2 hours later(!). You have a relatively low duplication rate so most positions contains just a single read. The outputting therefore requires many calls to the pysam.AlignmentFile.mate() function which is not an efficient way to retrieve the paired end read. Ideally, when dealing with paired reads, the BAM is sorted by read name. However, we need a coordinate sorted BAM in order to perform the duplicate detection. It looks like we might need to implement so sort of caching to speed up the retrieval of the paired ends. @AndreasHeger Do you have any suggestions for how to best speed up the paired end retrieval? Is caching the right way to go? —You are receiving this because you were mentioned.Reply to this email directly or view it on GitHub |
Thanks for that @AndreasHeger. The output is already non-sorted, so perhaps we should consider caching. |
A very simplistic "two-pass" approach would be to run through the bamfile one chromosome at a time and store the reads as a ordered dictionary mapping reads ids to their paired reads, then re-pass through the keys to perform the dedup, emptying the dictionary after each chromosome. I'll test this out as an immediate solution. Alternatively, we could cache as we go but this will be a little more complicated within our current workflow? I'll have a think about how to best perform the caching. Either way, this would have to be done on a per-chromosome basis to avoid potentially huge memory requirements for chimeric read pairs. Perhaps we should fall back on the .mate() function for these read pairs? What do you think? |
@TomSmithCGAT I think falling back on the .mate() for chimeric reads is the way to go. A dynamic cache, that saved read2's in a dict, and then throw them away, or outputted them as their mates were processed would probably be more memory efficient, but harder to do. Most reads would not spend long in the cache because their mate would be nearby. Might also be slower though because for every read processed, you'd have to check if its mate was in a the cache, and add it to a cache if not. |
@IanSudbery I had to change things around a bit to allow the 2 pass option to store the read pairs on a per-contig basis. Note the new class "Writer" which takes the reads, passes them to the "ClusterAndReducer" functor and has method to write out the resultant reads and stats. I've tested it on the above single chromosome BAM (5M reads, 2M unique read pairs). The original code took 19 hours! With "--2pass" it takes 22 mins. The maximum Virtual Memory Size according to "ps" was 2.1 Gb. |
@TomSmithCGAT: A couple of points
I'm working on a suggestion. |
|
RE: multimappers - I think the cleanest way to deal with multimappers,
is to say its not our problem. The user should deal with them
independently. If they want to leave them, in, thats their decision -i
think we should leave them in if they are there. The only correct
alternative would be to consider all positions simultaneously. Thats
beyond our scope at the moment i reckon.
Im less sure about how to deal with chimeric alignments.
|
I agree this should be up to the user. On reflection perhaps the secondary/supp. skipping should be optional rather than enforced. I think we should mention multi-mapping in the documentation. I think we should take a similar approach to chimeric reads - leave them in (pull out the paired end with .mate() as we discussed) and have an option to skip them entirely at the users choosing. Out of interest, what TLEN do chimeric reads get in the BAM? |
I'd go further than that. I'd say that if they want to filter out secondary/supplimentary alignments, let them do it themselves; in the way they see fit. I'm not really sure how mate pairing works in the presence of supplementary alignments. Is the mate of the primary read1, the primary read2? or just the next on in the chain? I suspect this vaires from mapper to mapper. Could be the same is true of TLEN. Checkout my proposal for the 2 pass pairing in bbbe020. Its a bit more complicated than I'd anticipated because it has to deal with the chimeric alignment problem, but I think its still represents less radical surgery to the execution path. I profiled using:
(I didn't add a seperate option for 2pass, but made it how paired retrieval works) Its about 25% slower than your solution (5mins 30sec rather than 4mins 20 for the test BAM), but uses a quarter as much RAM. (resident set 500Mb vs 1.8GB). |
@IanSudbery I like the simplicity and cleanness of re-defining the outfile object and it's write and close methods for paired end reads. Nice! I beleive lines 959-964 need to be removed so that read pairs are only written once when using the --ignore-umi option. I'm surprised it's any slower to be honest but the gain in memory efficiency more than outweighs this small difference. |
Regarding secondary/supp. alignments in paired end, STAR appears to output a proper pair of alignments for so each alignment1 has it's own alginment2, even where the alignment1 is the same in both cases. I expect this will indeed vary for other mappers. Let's do as you say and leave it to the user to generate a suitable BAM but include some more information in the documentation regarding multimapping. |
@TomSmithCGAT Thanks for catching that - i've sorted it and submitted a PR. WRT secondary/supp - I was more worried about supplementary than secondary - i.e. where different parts of read1 are mapped two several locations, as are several parts of read2. I think STAR outputs these to a separate file, but I think BWA puts them all in the same file. |
@royfrancis Could you update UMI-Tools to the latest version (v.0.2.0). This should speed up the de-duplication stage considerably. |
Hi Tom, |
Hi @royfrancis, I don't suppose its much help, but you are probably getting hammered by the stats computation: running your chr1 BAM on my machine takes 2 mins 30 without stats and 25 mins with stats. This is probably due us not finding an efficient way to randomly sample reads from a BAM file. |
I ran my full bam (unique filtered, 7gb) yesterday for 18 hours and it still failed. Is there some way to output the progress, so as to estimate how much of the bam file was completed? |
Hi Roy, If you turn the verbosity up to 5 with -v5, a message should be printed to the log every 10,000 alignments processed. |
Hi @royfrancis, |
It still did not work for me after a 10hr run time. I am not sure what stats is, but I excluded
|
Yes, exactly. |
Hi Roy, was there anything in the error or log file? |
Here is the log file after running the full bam for 5 hours (3 cores, 12gb ram) without stats. The run was incomplete and timed out. A 1.5gb dedup bam file was produced. The error file was empty. |
I'll have a play with the breath_first_search and see if we can reduce the run time by removing some of the set operations. I downloaded the sample you linked to but there doesn't appear to be any UMIs in this sample?
|
Woops, forgot adjust the barcode from out in-house location to yours. For some reason years ago I decided that the UMI should go in front of the read name, AAACCCCACC is the UMI in this case. Either way, I've uploaded a fixed version of the bam file to the same location, should work out of the box now. I also did what I said would happen, got bored and wrote the fix :). It worked well on my test case that previously finished running, and now I'm testing it on the files that were issues in the past. I'll let you know if those worked / finished soon. |
Hi @TomSmithCGAT and @IanSudbery Congratulations on your paper! Have been following since the blog post. I've just loaded up UMI-tools for the first time this weekend as we finally have our data back, and have also been struck by the incredibly slow speed. Found this issue and thought I would contribute some data. Perhaps it is a misconfiguration on my end? Installed with Anyway, using ca. 1.5M reads of paired-end targeted resequencing data (where an individual locus could have high [>1000X] depth) I am seeing 90 minute runtimes. Unfortunately worsening is exponential -- latest one took 9 hrs for 2.5M reads. I wrote a script to downsample a ~1.5M read BAM file to 10%, 20%, 40%, and 80% and check times both with and without stats. FWIW, the proportional contribution of stats generation to overall runtime seems to be less the more reads one has, so I don't think in my case the suggestion earlier in the thread that this could be stats-related is entirely accurate. Anyway, here are the data. Values are in seconds. Yes, the last row is downsampled to 99% (not 100%) as script was easier that way =) Please let me know if you have suggestions to speed this up as it is kind of a blocker ...
|
Try installing from the master branch rather than from pip. My version of
this issue was fixed a week or two ago in master, but I'm guessing the
change hasn't been pushed to conda / pip yet.
Gabriel Pratt
Bioinformatics Graduate Student, Yeo Lab
University of California San Diego
…On Sun, Apr 2, 2017 at 8:10 PM, James S Blachly, MD < ***@***.***> wrote:
Hi @TomSmithCGAT <https://github.com/TomSmithCGAT> and @IanSudbery
<https://github.com/IanSudbery>
Congratulations on your paper! Have been following since the blog post.
I've just loaded up UMI-tools for the first time this weekend as we finally
have our data back, and have also been struck by the incredibly slow speed.
Found this issue and thought I would contribute some data. Perhaps it is a
misconfiguration on my end? Installed with sudo pip and have also run it
once using sudo umi_tools dedup as instructed. There is no CPU or memory
pressure on this system. Is it possible cython is not being engaged
properly?
Anyway, using ca. 1.5M reads of paired-end targeted resequencing data
(where an individual locus could have high [>1000X] depth) I am seeing 90
minute runtimes. Unfortunately worsening is exponential -- latest one took *9
hrs for 2.5M reads*. I wrote a script to downsample a ~1.5M read BAM file
to 10%, 20%, 40%, and 80% and check times both with and without stats.
FWIW, the proportional contribution of stats generation to overall runtime
seems to be less the more reads one has, so I don't think in my case the
suggestion earlier in the thread that this could be stats-related is
entirely accurate.
Anyway, here are the data. Values are in seconds. Yes, the last row is
downsampled to 99% (not 100%) as script was easier that way =) Please let
me know if you have suggestions to speed this up as it is kind of a blocker
...
reads no stats with stats
149749 69 98
298075 237 307
593073 837 1031
1174608 3232 3664
1446610 4766 5312
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABNUW4AH_w9OzXJDksagVNDmlqnvtW_0ks5rsGMzgaJpZM4IlqZd>
.
|
@gpratt I have commit latest commit from master branch, version 0.4.3, installed through pip. They certainly are on the ball with pip. |
Hi @jblachly. Thanks for the detailed break down of the run time. From what you say, it sounds like the issue is the network building stage which is O(n^2) since perform an all vs. all comparison between the UMIs. Would you mind sharing your BAM so we can profile the script and see what the bottleneck is? |
Hi @TomSmithCGAT , expect that I don't think the stats/no stats difference tallies with that: When building stats, we do the all vs all comparisons with the randomly sampled reads, as well as the actual reads at the locus. Thus should stats not at the very least double the amount of time spent building edit-distance matricies? This suggests that once again it is the connected component finding, rather than the edit distance calculation that is to blame as we don't actually build the components for the stats? Interestingly, the increase we are seeing in the those numbers is sub-quadratic (every doubling in reads is leading to a three fold increase in time, where as quadratic would predict a four-fold increase). To me, the problem here is not the increase in time with increasing depth, but the large amount of time taken to process even the smallest sample. However the jump 1.5hrs -> 9hrs for 1.5M->2.5M reads doesn't seem to fit that pattern. @jblachly Most of our users are doing RNA seq and often see their run times dominated by the single highest depth locus. They are seeing runtimes of days caused by deduplication problems with single locations with 10s or even 100s of thousands of reads - but they only have a single locus like this. By having 1000s of reads per locus, but many loci like this, you may have hit on a particularly pathological case. None-the-less, 90 seconds for 100,000 reads feels wrong. You signature states you are an MD, and you are doing resequencing, so I realise that this might be patient data that you are unable to share. However, one solution to this that I can see is that you strip out the sequence from the BAM files - BAM files are still valid if the sequence is replaced with '*'. Two ways we can do this: the tool If you are worried about us knowing the identity of the genes you are resequencing, then I suggest adding a secret number to each read position (the same number to every read, or even a different number for every gene - as long as it was the same number within a gene). |
Dear @IanSudbery , thanks for your willingness to help with this. I have sent you a note under separate cover with a link to the BAM files. |
Okay, I have located the problem here... You have many reads where the read 1 is on a different chromosome to its mate. This causes some issues as the end of each chromosome we output all the mates from a particular chromosome. If the mates aren't found, they accumulate in a buffer and we go searching for them at the end. This exhaustive search at the end is slow, and only there as a last resort. It is particularly slow in your case because to do the search, all reads that overlap the supposed location are retrieved and searched one by one for the mate. As you have may reads at one location, this is slow. I'm working on something better. |
@IanSudbery -- thank you so much for tracking this down. We are working with cancer samples so the mate mapping to different contig is going to be a common theme for us (and others in this field). Let me know if I can help in any way with the improvement (I can test other branches etc.). |
Dear @jblachly, I've push a branch "getting_mates" (see commit c96dd45). It definitely makes things faster for you. On my machine your samples finish in 2 minutes and 3 minutes respectively without stats. The only question is what they do to other people. @TomSmithCGAT Instead of trying to fetch all the reads at the mate location and then sorting through them looking for the read in question, this commit just reopens the input BAM file, and scans through the whole thing looking for mates. For the files at hand it takes seconds (1.5M and 2.5M reads respectively). I don't know what it would do to someone with 100M reads where the mates were not all concentrated in one location. |
@IanSudbery. At worst this will only ever mean passing through the whole BAM file once more which is tolerable compared to the pathologically poor performance in this case when using the index to try and specifically retrieve the missing pairs. This will also obviously only ever be triggered when the user has chimeric reads since otherwise the It looks like this method of outputting the orphan read pairs was implemented in the first instance of |
Could be.
…On Tue, 4 Apr 2017, 4:48 pm Tom Smith, ***@***.***> wrote:
@IanSudbery <https://github.com/IanSudbery>. At worst this will only ever
mean passing through the whole BAM file once more which is tolerable
compared to the pathologically poor performance in this case when using the
index to try and specifically retrieve the missing pairs. This will also
obviously only ever be triggered when the user has chimeric reads since
otherwise the read1s set will be emptied with each call to write_mates.
I've made a couple of comments on the commit. I'm very happy to merge this
into master when it's ready.
It looks like this method of outputting the orphan read pairs was
implemented in the first instance of TwoPassPairWriter so it could well
explain some of the other slow run times?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFJFjmG_Ea4yz0Pl_M1fQItKDOj011qSks5rsmZkgaJpZM4IlqZd>
.
|
P.s Amazing improvement in run time by the way @IanSudbery. That's >1 hour to <5 mins, right?! Thanks @jblachly for bringing this to our attention and sharing your data with us. Hopefully, we've covered most of the unusual BAMs now and the run time issues have been largely resolved 🤞 |
Adding my own experiences. I have been assessing this collection of tools for use in one of our processing pipelines. Using the 0.4.3 release, I am running a paired-end analysis off of a 36mil line BAM produced from amplicon alignments. I am doing one run only on chromosome 21, and the other is for the whole file. The chr21 run looked to be done in less than a minute, but then has been running for over 2 hours on the unmatched mates stage (~3500 unmatched). The other run claims 110k unmatched mates, and took maybe 30 min to make the first pass through the file. Pulling the gather_mates branch seems to improve the situation dramatically. I am now seeing no more than a minute spent on the finding mates step, at least when run on the full data set. When running on the single chromosome, I see the following error:
Not a big deal, it doesn't look like I will need to scatter gather the dedup step now that the runtime is more reasonable, but I thought it was worth bringing up. Last thing, not completely related to the thread. I generally see entries in the final output that are not completely sorted. Up until this point, I have had to sort before and after dedup so that the next tool in the pipeline won't break. Is this something that you already know about? Thanks for all the work on this tool! |
Thanks for your input. We'll certainly investigate your unbound error.
We're aware of the sort issue: the problem is that the tool outputs reads
in read.start position order rather than alignment start position order.
(E.g. the 3' end of reads on the reverse strand) we need to think about how
to deal with this without having to write out and then sort.
…On Thu, 4 May 2017, 9:30 pm jhl667, ***@***.***> wrote:
Adding my own experiences. I have been assessing this collection of tools
for use in one of our processing pipelines. Using the 0.4.3 release, I am
running a paired-end analysis off of a 36mil line BAM produced from
amplicon alignments. I am doing one run only on chromosome 21, and the
other is for the whole file. The chr21 run looked to be done in less than a
minute, but then has been running for over 2 hours on the unmatched mates
stage (~3500 unmatched). The other run claims 110k unmatched mates, and
took maybe 30 min to make the first pass through the file.
Pulling the gather_mates branch seems to improve the situation
dramatically. I am now seeing no more than a minute spent on the finding
mates step, at least when run on the full data set. When running on the
single chromosome, I see the following error:
UnboundLocalError: local variable 'gene_tag' referenced before assignment
Not a big deal, it doesn't look like I will need to scatter gather the
dedup step now that the runtime is more reasonable, but I thought it was
worth bringing up.
Last thing, not completely related to the thread. I generally see entries
in the final output that are not completely sorted. Up until this point, I
have had to sort before and after dedup so that the next tool in the
pipeline won't break. Is this something that you already know about?
Thanks for all the work on this tool!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFJFjkwRccUtcPouGLIXga-UZbFvudWWks5r2jVzgaJpZM4IlqZd>
.
|
Hi jhl667. Thanks for mentioning the error with the --chrom option. It turns out this option was not covered by testing so this bug was missed when it was introduced. The debug will be included in v0.4.4. |
@IanSudbery I'm not sure how we can output in alignment start position order, unless we cache all reads per contig which could be very memory intensive. We might be able to reduce the memory usage somewhat with a sort of 'rolling cache' where we outputted reads from the cache when they were sufficiently upstream of the reads being read in. Even then, chimeric reads pairs would cause problems, unless we retrieve chimeric read pairs with calls to |
No, I think we would have to output to a temp file and then sort to the output file. |
Ah OK. Yeah I see no reason not to add this as an option. Perhaps make clear to the user than this is how its going to work so they don't try and pipe the output, not that I think anyone is probably doing this? |
@TomSmithCGAT @IanSudbery Ok, so just to be clear, the sensible thing to do in a workflow utilizing umi_tools dedup is to sort before and after? I haven't actually assessed speed differences between sorting or not before dedup, but I am assuming it would make significant difference? Honestly, for me, including an extra sort step doesn't make that much of a difference, though I still like to keep things as optimized as possible. |
I'm closing this issue now as the run time issues seem to have been dealt with. Any future issues with run time can be discussed in a separate issue |
I started with a 12gb BAM (77 mil reads) using 3 cores and 24GB RAM. It did not complete in 2.5 hours or even close. Output BAM after 2.5 hours was 2.3MB. Then I extracted one chromosome to create a 314MB BAM. Dedup was run on this small BAM for 1 hour and it was still incomplete. A truncated BAM file of 12.4MB was exported. I ran
qualimap bamqc
on both files (1 chromosome bam before dedup and after dedup).So, I suppose the deduplication must be working.
Extrapolating from this run time, my 12gb bam would take lot more than 39 hours. Would this be an expected run time? I was wondering if I was doing something wrong or some setting/switches are not optimal.
This is my script:
The text was updated successfully, but these errors were encountered: