-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
excessive dedup memory usage #173
Comments
I am guessing that the extreme depth is the problem. We have noticed that memory usage (and time) explode once you saturate the available UMIs and the networks get very complex. You are also going to have difficulties accurately de-duplicating at this point. The important thing is not how many reads at a loci, but how many UMIs. If these are 10bp UMIs, then 1 million UMIs is effectively all the available UMIs and all sorts of problems start to manifest themselves. If you were to run with With 1 million UMIs (i'm assuming a 10bp UMI thats a 100% saturation of the UMI space), there is not much we can do, other than maybe introduce a bailout where UMI tools refuses to dedup that location. |
My thought, though admitting I don't have all of the details, is you could do some sort of pre-diagnoses of the range of mapping qualities. Meaning, if you already had an idea of the high quality sort of read that would be representative, you wouldn't need to include the potentially tens of thousands of duplicates in a situation like mine. This might allow you to selectively exclude a bunch of them from graph creation? I was thinking about using the unique method also, in fact I was doing some testing on this right now. This may actually be good enough for us, it's hard to tell without additional testing. I suppose I could always do some sort of very trivial merging of identical reads up front (while keeping track of counts of course), and then perform a proper directional deduplication. FWIW, we are using 12bp UMIs. This gives us 16.7 million possibilities I think? I generally don't see more than 11 million of those actually in play... One other quick note. It looks like the run I started last night with 0.5.0 did actually manage to finish processing in about 4 hours using the entirety of 128GB for that tough region. However, the run I had going simultaneously with 0.4.4 is still running at around 19 hours now on the exact same data with the exact same parameters. So, that's a good sign at least that your recent improvements even made it possible to process this type of data! |
Glad to hear that there is an improvement! We already collapse identical reads down onto a representative single read using mapping qualities prior to network formation: we form networks between distinct UMIs, not distinct reads. The memory is used building the graph representation of the UMIs. We use a sparse representation of the graph, but even so, if you had 1 million reads, which had 100,000 distinct UMIs, each of which was connected to 10 other UMIs, thats still a pretty complex object. But the reliance on UMI number rather than read number was why I asked you to run in unique - not because this might be a longer solution (although it might), but because we could check if UMI depth really was the problem. |
I went ahead and ran this in unique mode, certainly a faster solution at around 19 min. Never saw memory usage above 4g. I guess all that makes sense based on how I understand the method. The maximum number of unique UMIs at any position was around 550k. I didn't perform this run outputting stats, as in the past it has caused a serious increase in run time, though for the unique method it's probably fine? |
Hmmm.... the stats stuff does two things. 1) It samples reads at random from the BAM 2) It measure the pairwise edit-distances on them. Neither of these steps are influenced by the dedup method I don't think. The stats would tell you how necessary using directional was. Although I'm pretty sure that at 550K UMIs it will see a difference at that particular location. So the good news is that at 550K UMIs out of 16M, you are not exhausting your UMI space! Although you might be getting to the point where the networks are getting more complex. Working on 10bp UMIs I saw memory inflation happening at around 50,000 UMIs (or 5% of the UMI space). The distribution of the UMIs makes a difference as well. Just picking UMIs at random leads to different memory usage than UMIs from real experiments. Try running stats on a selection of the loci. And see what the difference is. |
It also happened to me. And it is very slow. We have 100M reads. But the progress after a day is only at : INFO Written out 30720000 reads |
@brianpenghe - Are you running with stats too? If so, we have a modification due very shortly which will dramatically improve the run time (#184) |
Yes I did. How can I get the latest version installed? |
We haven't officially released the latest version (0.5.1) yet but if you've installed from github you can pull the up-to-date code on the master branch and re-run |
OK Then I will wait for the official release. |
Hi there, I went ahead and pulled down master today to look at the changes in stats generation. Just for comparison, the previous version ran for 3 days and was not done. I'm using the 'unique' method for this test. Without stats generationMemory usage is low, I never see it get above 2GB.
With stats generationMemory usage hits around 30GB at several points, corresponding to the below intervals where you see the most time pass.
It does seem that the run is proceeding faster than it was before with stats requested. |
Hi @jhl667 - Thanks for testing this out. The memory usage is considerably higher than I would expect. For comparison, I tested on a file with more reads but far fewer UMIs per position (max ~25 from memory) and the max memory was ~0.4GB. It would be interesting to know what is taking up so much memory |
I'm using the new codes. It seems slightly faster but still time-consuming. 2017-09-20 10:58:48,980 INFO Parsed 189000000 input reads |
Hi,
I'm working with the latest v0.5.0. My command line is something like:
umi_tools dedup -I input.bam -S output.bam -L output.log
My BAM file is around 1.3G, and is a product of an amplicon-based RNA-Seq panel. So, there are certainly some regions that are at very high coverage. As I progress through parsing/writing, I see times when the memory usage hits around 30G or so, which is certainly manageable, though I'm not sure this is intended? About half way through the file, I get a really nice RAM spike to pretty much whatever I have available. Looking at htop right now, I am up to about 115G, and it appears we are still parsing some number of lines. Performing this same process on a 64G machine, I end up quickly consuming the available RAM and then the swap right behind it.
So, I guess I'm wondering what some potential issues here are. I have run umi_tools on other samples, and have never seen something quite like this. Does it seems likely that the culprit is simply extreme read depth? I can see counts around 1 million for the most highly represented genes/regions. Do you have any suggestion for a workaround?
Thanks!
The text was updated successfully, but these errors were encountered: