-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequential VCF parsing #94
Sequential VCF parsing #94
Conversation
Just putting it here for convenience.
Note: opened an issue about the num_records thing here: brentp/cyvcf2#294 |
If this seems to work reasonably well, I would consider trying to contribute gil-dropping methods that decode GTs and other large arrays into a supplied numpy array upstream. This would allow us to do this in background threads also. |
Will try this out now! |
Update. I've added support here for working on "partitioned" VCFs, where a chromosome has been split into contiguous parts. These are consumed in parallel processes. It seems to work quite well on the simulated data I have. @benjeffery is testing these on some real world horror shows. We're also looking at how it performs on real-world full chromosome VCFs for 200k samples. |
Great idea from @benjeffery - use a shared memory mutex on first and last chunks of each partition rather than running odds and evens separately. This will improve parallelism when number of chunks is < 100 |
Some experiments on UKB WGS data (200k), single large VCFs
|
Some data points with GeL data: Wow! |
After discussing on the community call yesterday, we're going to move this into sgkit. I'm going to merge this here for now as a useful reference, we can delete it later. |
Sticking a prototype of a sequential VCF parsing method here for experimentation.
Currently, this is parsing the 1M sample VCF at a rate of about 50-60 variants per second, which gives about 36 hours for the whole thing (which is totally acceptable). Memory usage seems very predictable, and in-line what what you would expect. CPU usage is at an average of about 150%, with some peaks when the chunks are being flushed. So, the bottleneck is the main thread, where we're moving information from the decoded VCF record (done in a background thread) into the numpy buffers. The main question then, I guess is how much this will be slowed down by adding all the necessary complexity of INFO fields etc.
@benjeffery, do you think you could try this out on your ugly VCFs, and see how it goes? Maybe hard-code in a few extra fields to see how it performs when we put in more?