A more compact and useful representation for unphased data #21

dcdehaas · 2024-11-25T20:46:39Z

Previously, unphased data was stored just like phased data. The only difference was a flag set in the file header, otherwise you still had haploid sample lists that you need to traverse and count the number of copies based on the ploidy.

Now we store the number of copies explicitly in the position index, where we reserve 8-bits for this (so supports ploidy up to 255). Each variant row then has an associated number of copies, and you can just skip a variant if it has a number of copies you don't care about. For example, with diploid data you can do analyses with just homozygotes by only scanning variants with numCopies=2, which would massively speed up your analysis.

Previously, unphased data was stored just like phased data. The only difference was a flag set in the file header, otherwise you still had haploid sample lists that you need to traverse and count the number of copies based on the ploidy. Now we store the number of copies explicitly in the position index, where we reserve 8-bits for this (so supports ploidy up to 255). Each variant row then has an associated number of copies, and you can just skip a variant if it has a number of copies you don't care about. For example, with diploid data you can do analyses with just homozygotes by only scanning variants with numCopies=2, which would massively speed up your analysis.

igdtools/igdtools.cpp

Unphased data can be processed in igdtools, either for conversion, filtering, or just stats. There is also a --force-unphased flag that lets you de-phase an IGD file (usually useful for testing tools, if nothing else). VCF files that are unphased will get converted to unphased IGDs, VCF files that are phased will get converted to phased IGDs, unless --force-unphased is used. VCF files that have mixed phasedness can now be used with picovcf (and igdtools) if forceUnphased (or --force-unphased) are set, and the resulting IGD will be unphased. New test/endtoend/run_tests.py file for non-unit tests, currently only testing igdtools.

Very fast because of the sparse representation. Also run the end-to-end tests at CI time.

dcdehaas commented Nov 25, 2024

View reviewed changes

igdtools/igdtools.cpp Show resolved Hide resolved

dcdehaas added 2 commits November 26, 2024 11:03

Add igdroh example, to compute ROH on an unphased IGD

888b221

Very fast because of the sparse representation. Also run the end-to-end tests at CI time.

dcdehaas force-pushed the unphased branch from 15e5f46 to 888b221 Compare November 26, 2024 17:53

dcdehaas added 3 commits November 26, 2024 14:32

Update IGD.FORMAT.md for unphased data

d0e0af4

Slightly generalize ROH example

77e72ca

Fix documentation comment

e02d58b

dcdehaas merged commit e6a67c9 into main Dec 10, 2024
3 checks passed

dcdehaas deleted the unphased branch December 10, 2024 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A more compact and useful representation for unphased data #21

A more compact and useful representation for unphased data #21

dcdehaas commented Nov 25, 2024

A more compact and useful representation for unphased data #21

A more compact and useful representation for unphased data #21

Conversation

dcdehaas commented Nov 25, 2024