Sequential VCF parsing #94

jeromekelleher · 2024-01-18T11:22:38Z

Sticking a prototype of a sequential VCF parsing method here for experimentation.

Currently, this is parsing the 1M sample VCF at a rate of about 50-60 variants per second, which gives about 36 hours for the whole thing (which is totally acceptable). Memory usage seems very predictable, and in-line what what you would expect. CPU usage is at an average of about 150%, with some peaks when the chunks are being flushed. So, the bottleneck is the main thread, where we're moving information from the decoded VCF record (done in a background thread) into the numpy buffers. The main question then, I guess is how much this will be slowed down by adding all the necessary complexity of INFO fields etc.

@benjeffery, do you think you could try this out on your ugly VCFs, and see how it goes? Maybe hard-code in a few extra fields to see how it performs when we put in more?

Just putting it here for convenience.

jeromekelleher · 2024-01-18T12:16:07Z

Note: opened an issue about the num_records thing here: brentp/cyvcf2#294

jeromekelleher · 2024-01-18T12:19:55Z

If this seems to work reasonably well, I would consider trying to contribute gil-dropping methods that decode GTs and other large arrays into a supplied numpy array upstream. This would allow us to do this in background threads also.

benjeffery · 2024-01-18T12:25:18Z

Will try this out now!

jeromekelleher · 2024-01-22T12:59:16Z

Update.

I've added support here for working on "partitioned" VCFs, where a chromosome has been split into contiguous parts. These are consumed in parallel processes. It seems to work quite well on the simulated data I have.

@benjeffery is testing these on some real world horror shows.

We're also looking at how it performs on real-world full chromosome VCFs for 200k samples.

jeromekelleher · 2024-01-22T13:33:11Z

Great idea from @benjeffery - use a shared memory mutex on first and last chunks of each partition rather than running odds and evens separately. This will improve parallelism when number of chunks is < 100

jeromekelleher · 2024-01-22T15:02:27Z

Some experiments on UKB WGS data (200k), single large VCFs

chr16, 55 it/s, estimated 65 hours
chr2 estimated 113 hours

benjeffery · 2024-01-22T15:29:33Z

Some data points with GeL data:
3.7TB of chr20 VCF split into 28 parts, parsed with this code (implies max parallelism of 14): ~7 hours (estimated still ongoing). 600it/s.

Wow!

jeromekelleher · 2024-01-23T11:03:23Z

After discussing on the community call yesterday, we're going to move this into sgkit. I'm going to merge this here for now as a useful reference, we can delete it later.

jeromekelleher added 3 commits January 18, 2024 09:21

Experimental sequential vcf writer

a53f3c6

Just putting it here for convenience.

Generalise the code a bit

9d11280

Notes

c77a09c

jeromekelleher added 4 commits January 20, 2024 21:57

Working multiprocessing version

77c49b9

More parallel stuff

beed6fc

Update to use new cyvcf2 functionality

d9b8f97

Fix bugs

d4fcb0f

hammer mentioned this pull request Jan 22, 2024

Alternative execution engines for VCF conversion sgkit-dev/sgkit#1130

Open

jeromekelleher added 3 commits January 22, 2024 23:29

Better parallelism with boundary chunk locks

e7baf2c

Some notes

a31b36c

Rename

37dd14b

jeromekelleher merged commit 35f0a5b into sgkit-dev:main Jan 23, 2024
1 check passed

jeromekelleher deleted the sequential-vcf branch January 23, 2024 11:03

hammer mentioned this pull request Feb 5, 2024

Sequential VCF parsing sgkit-dev/sgkit#1183

Closed

jeromekelleher mentioned this pull request Feb 15, 2024

Initial outline sgkit-dev/bio2zarr#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequential VCF parsing #94

Sequential VCF parsing #94

jeromekelleher commented Jan 18, 2024

jeromekelleher commented Jan 18, 2024

jeromekelleher commented Jan 18, 2024

benjeffery commented Jan 18, 2024

jeromekelleher commented Jan 22, 2024

jeromekelleher commented Jan 22, 2024

jeromekelleher commented Jan 22, 2024

benjeffery commented Jan 22, 2024 •

edited

Loading

jeromekelleher commented Jan 23, 2024

Sequential VCF parsing #94

Sequential VCF parsing #94

Conversation

jeromekelleher commented Jan 18, 2024

jeromekelleher commented Jan 18, 2024

jeromekelleher commented Jan 18, 2024

benjeffery commented Jan 18, 2024

jeromekelleher commented Jan 22, 2024

jeromekelleher commented Jan 22, 2024

jeromekelleher commented Jan 22, 2024

benjeffery commented Jan 22, 2024 • edited Loading

jeromekelleher commented Jan 23, 2024

benjeffery commented Jan 22, 2024 •

edited

Loading