Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

excessive dedup memory usage #173

Closed
crutching opened this issue Aug 23, 2017 · 13 comments
Closed

excessive dedup memory usage #173

crutching opened this issue Aug 23, 2017 · 13 comments

Comments

@crutching
Copy link

Hi,

I'm working with the latest v0.5.0. My command line is something like:
umi_tools dedup -I input.bam -S output.bam -L output.log

My BAM file is around 1.3G, and is a product of an amplicon-based RNA-Seq panel. So, there are certainly some regions that are at very high coverage. As I progress through parsing/writing, I see times when the memory usage hits around 30G or so, which is certainly manageable, though I'm not sure this is intended? About half way through the file, I get a really nice RAM spike to pretty much whatever I have available. Looking at htop right now, I am up to about 115G, and it appears we are still parsing some number of lines. Performing this same process on a 64G machine, I end up quickly consuming the available RAM and then the swap right behind it.

So, I guess I'm wondering what some potential issues here are. I have run umi_tools on other samples, and have never seen something quite like this. Does it seems likely that the culprit is simply extreme read depth? I can see counts around 1 million for the most highly represented genes/regions. Do you have any suggestion for a workaround?

Thanks!

@IanSudbery
Copy link
Member

I am guessing that the extreme depth is the problem. We have noticed that memory usage (and time) explode once you saturate the available UMIs and the networks get very complex. You are also going to have difficulties accurately de-duplicating at this point.

The important thing is not how many reads at a loci, but how many UMIs. If these are 10bp UMIs, then 1 million UMIs is effectively all the available UMIs and all sorts of problems start to manifest themselves.

If you were to run with --method=unique, you could then look to see what the de-duplicated read depth was to find how many unique UMIs there were at each position/amplicon. This should be pretty quick.

With 1 million UMIs (i'm assuming a 10bp UMI thats a 100% saturation of the UMI space), there is not much we can do, other than maybe introduce a bailout where UMI tools refuses to dedup that location.

@crutching
Copy link
Author

crutching commented Aug 23, 2017

My thought, though admitting I don't have all of the details, is you could do some sort of pre-diagnoses of the range of mapping qualities. Meaning, if you already had an idea of the high quality sort of read that would be representative, you wouldn't need to include the potentially tens of thousands of duplicates in a situation like mine. This might allow you to selectively exclude a bunch of them from graph creation?

I was thinking about using the unique method also, in fact I was doing some testing on this right now. This may actually be good enough for us, it's hard to tell without additional testing. I suppose I could always do some sort of very trivial merging of identical reads up front (while keeping track of counts of course), and then perform a proper directional deduplication.

FWIW, we are using 12bp UMIs. This gives us 16.7 million possibilities I think? I generally don't see more than 11 million of those actually in play...

One other quick note. It looks like the run I started last night with 0.5.0 did actually manage to finish processing in about 4 hours using the entirety of 128GB for that tough region. However, the run I had going simultaneously with 0.4.4 is still running at around 19 hours now on the exact same data with the exact same parameters. So, that's a good sign at least that your recent improvements even made it possible to process this type of data!

@IanSudbery
Copy link
Member

Glad to hear that there is an improvement! We already collapse identical reads down onto a representative single read using mapping qualities prior to network formation: we form networks between distinct UMIs, not distinct reads. The memory is used building the graph representation of the UMIs. We use a sparse representation of the graph, but even so, if you had 1 million reads, which had 100,000 distinct UMIs, each of which was connected to 10 other UMIs, thats still a pretty complex object.

But the reliance on UMI number rather than read number was why I asked you to run in unique - not because this might be a longer solution (although it might), but because we could check if UMI depth really was the problem.

@crutching
Copy link
Author

I went ahead and ran this in unique mode, certainly a faster solution at around 19 min. Never saw memory usage above 4g. I guess all that makes sense based on how I understand the method. The maximum number of unique UMIs at any position was around 550k. I didn't perform this run outputting stats, as in the past it has caused a serious increase in run time, though for the unique method it's probably fine?

@IanSudbery
Copy link
Member

Hmmm.... the stats stuff does two things. 1) It samples reads at random from the BAM 2) It measure the pairwise edit-distances on them. Neither of these steps are influenced by the dedup method I don't think. The stats would tell you how necessary using directional was. Although I'm pretty sure that at 550K UMIs it will see a difference at that particular location.

So the good news is that at 550K UMIs out of 16M, you are not exhausting your UMI space! Although you might be getting to the point where the networks are getting more complex. Working on 10bp UMIs I saw memory inflation happening at around 50,000 UMIs (or 5% of the UMI space). The distribution of the UMIs makes a difference as well. Just picking UMIs at random leads to different memory usage than UMIs from real experiments.

Try running stats on a selection of the loci. And see what the difference is.

@brianpenghe
Copy link

It also happened to me. And it is very slow. We have 100M reads. But the progress after a day is only at : INFO Written out 30720000 reads

@TomSmithCGAT
Copy link
Member

@brianpenghe - Are you running with stats too? If so, we have a modification due very shortly which will dramatically improve the run time (#184)

@brianpenghe
Copy link

Yes I did. How can I get the latest version installed?

@TomSmithCGAT
Copy link
Member

We haven't officially released the latest version (0.5.1) yet but if you've installed from github you can pull the up-to-date code on the master branch and re-run python setup.py install. If you installed via conda or pip, v0.5.1 should be available by the end of the week

@brianpenghe
Copy link

OK Then I will wait for the official release.

@crutching
Copy link
Author

Hi there,

I went ahead and pulled down master today to look at the changes in stats generation. Just for comparison, the previous version ran for 3 days and was not done. I'm using the 'unique' method for this test.

Without stats generation

Memory usage is low, I never see it get above 2GB.

2017-09-14 12:12:39,866 INFO Written out 100000 reads
2017-09-14 12:12:44,043 INFO Written out 200000 reads
2017-09-14 12:12:49,273 INFO Written out 300000 reads
2017-09-14 12:13:00,688 INFO Written out 400000 reads
2017-09-14 12:13:02,013 INFO Written out 500000 reads
2017-09-14 12:13:02,851 INFO Written out 600000 reads
2017-09-14 12:13:09,421 INFO Written out 700000 reads
2017-09-14 12:13:14,075 INFO Written out 800000 reads
2017-09-14 12:13:15,203 INFO Parsed 1000000 input reads
2017-09-14 12:13:15,374 INFO Written out 900000 reads
2017-09-14 12:13:20,774 INFO Written out 1000000 reads
2017-09-14 12:13:24,967 INFO Written out 1100000 reads
2017-09-14 12:13:30,195 INFO Written out 1200000 reads
2017-09-14 12:13:35,155 INFO Written out 1300000 reads
2017-09-14 12:14:22,380 INFO Written out 1400000 reads
2017-09-14 12:14:22,939 INFO Written out 1500000 reads
2017-09-14 12:14:24,232 INFO Written out 1600000 reads
2017-09-14 12:14:25,595 INFO Written out 1700000 reads
2017-09-14 12:14:27,220 INFO Written out 1800000 reads
2017-09-14 12:14:28,453 INFO Parsed 2000000 input reads
2017-09-14 12:14:29,179 INFO Written out 1900000 reads
2017-09-14 12:14:30,503 INFO Written out 2000000 reads
2017-09-14 12:14:31,882 INFO Written out 2100000 reads
2017-09-14 12:14:33,198 INFO Written out 2200000 reads
2017-09-14 12:14:34,560 INFO Written out 2300000 reads
2017-09-14 12:14:35,829 INFO Written out 2400000 reads
2017-09-14 12:14:37,142 INFO Written out 2500000 reads
2017-09-14 12:14:38,833 INFO Written out 2600000 reads
2017-09-14 12:14:40,214 INFO Written out 2700000 reads
2017-09-14 12:14:41,729 INFO Written out 2800000 reads
2017-09-14 12:14:42,349 INFO Parsed 3000000 input reads
2017-09-14 12:14:43,100 INFO Written out 2900000 reads
2017-09-14 12:15:35,570 INFO Reads: Input Reads: 3099554
2017-09-14 12:15:35,570 INFO Number of reads out: 2948927
2017-09-14 12:15:35,571 INFO Total number of positions deduplicated: 466932
2017-09-14 12:15:35,571 INFO Mean number of unique UMIs per position: 6.32
2017-09-14 12:15:35,571 INFO Max. number of unique UMIs per position: 87984
# job finished in 181 seconds at Thu Sep 14 12:15:35 2017 -- 177.08  5.76  0.00  0.01 -- 7413b28d-9a14-4186-bcee-aabf529d7541

With stats generation

Memory usage hits around 30GB at several points, corresponding to the below intervals where you see the most time pass.

2017-09-14 10:43:15,063 INFO total_umis 3099554
2017-09-14 10:43:15,063 INFO #umis 2181705
2017-09-14 10:48:10,055 INFO Written out 100000 reads
2017-09-14 10:49:10,599 INFO Written out 200000 reads
2017-09-14 10:49:39,999 INFO Written out 300000 reads
2017-09-14 10:59:57,942 INFO Written out 400000 reads
2017-09-14 11:12:40,618 INFO Written out 500000 reads
2017-09-14 11:21:51,033 INFO Written out 600000 reads
2017-09-14 11:22:37,222 INFO Written out 700000 reads
2017-09-14 11:23:26,258 INFO Written out 800000 reads
2017-09-14 11:30:59,510 INFO Parsed 1000000 input reads
2017-09-14 11:31:20,522 INFO Written out 900000 reads
2017-09-14 11:32:00,258 INFO Written out 1000000 reads
2017-09-14 11:33:15,967 INFO Written out 1100000 reads
2017-09-14 11:33:34,815 INFO Written out 1200000 reads
2017-09-14 11:33:48,856 INFO Written out 1300000 reads
2017-09-14 12:09:21,794 INFO Written out 1400000 reads
2017-09-14 12:11:35,498 INFO Written out 1500000 reads
2017-09-14 12:12:40,654 INFO Written out 1600000 reads
2017-09-14 12:13:15,603 INFO Written out 1700000 reads
2017-09-14 12:21:16,411 INFO Written out 1800000 reads
2017-09-14 12:21:37,870 INFO Parsed 2000000 input reads
2017-09-14 12:22:51,185 INFO Written out 1900000 reads
2017-09-14 12:26:46,100 INFO Written out 2000000 reads
2017-09-14 12:27:25,465 INFO Written out 2100000 reads
2017-09-14 12:29:54,883 INFO Written out 2200000 reads
2017-09-14 13:03:56,843 INFO Written out 2300000 reads
2017-09-14 13:06:38,600 INFO Written out 2400000 reads
2017-09-14 13:07:41,165 INFO Written out 2500000 reads
2017-09-14 13:14:10,360 INFO Written out 2600000 reads
2017-09-14 13:16:19,612 INFO Written out 2700000 reads
2017-09-14 13:21:20,802 INFO Written out 2800000 reads
2017-09-14 13:21:28,892 INFO Parsed 3000000 input reads
2017-09-14 13:21:58,636 INFO Written out 2900000 reads
2017-09-14 13:25:48,439 INFO Reads: Input Reads: 3099554
2017-09-14 13:25:48,439 INFO Number of reads out: 2948927
2017-09-14 13:25:48,440 INFO Total number of positions deduplicated: 466932
2017-09-14 13:25:48,440 INFO Mean number of unique UMIs per position: 6.32
2017-09-14 13:25:48,440 INFO Max. number of unique UMIs per position: 87984
# job finished in 9768 seconds at Thu Sep 14 13:25:48 2017 -- 9452.35 305.11  0.00  0.00 -- 145d84d7-77c5-45cc-9d48-2945be95d806

It does seem that the run is proceeding faster than it was before with stats requested.

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Sep 14, 2017

Hi @jhl667 - Thanks for testing this out. The memory usage is considerably higher than I would expect. For comparison, I tested on a file with more reads but far fewer UMIs per position (max ~25 from memory) and the max memory was ~0.4GB. It would be interesting to know what is taking up so much memory

@brianpenghe
Copy link

I'm using the new codes. It seems slightly faster but still time-consuming.

2017-09-20 10:58:48,980 INFO Parsed 189000000 input reads
2017-09-20 10:59:32,021 INFO Written out 133000000 reads
2017-09-20 11:02:53,897 INFO Written out 133100000 reads
2017-09-20 11:06:05,092 INFO Written out 133200000 reads
2017-09-20 11:08:40,276 INFO Written out 133300000 reads
2017-09-20 11:12:03,696 INFO Written out 133400000 reads
2017-09-20 11:15:02,841 INFO Written out 133500000 reads
2017-09-20 11:17:40,039 INFO Written out 133600000 reads
2017-09-20 11:17:41,700 INFO Parsed 190000000 input reads
2017-09-20 11:18:41,071 INFO Written out 133700000 reads
2017-09-20 11:19:42,786 INFO Written out 133800000 reads
2017-09-20 11:20:43,888 INFO Written out 133900000 reads
2017-09-20 11:25:04,683 INFO Written out 134000000 reads
2017-09-20 11:28:23,903 INFO Written out 134100000 reads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants