Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A more compact and useful representation for unphased data #21

Merged
merged 6 commits into from
Dec 10, 2024
Merged

Conversation

dcdehaas
Copy link
Member

Previously, unphased data was stored just like phased data. The only difference was a flag set in the file header, otherwise you still had haploid sample lists that you need to traverse and count the number of copies based on the ploidy.

Now we store the number of copies explicitly in the position index, where we reserve 8-bits for this (so supports ploidy up to 255). Each variant row then has an associated number of copies, and you can just skip a variant if it has a number of copies you don't care about. For example, with diploid data you can do analyses with just homozygotes by only scanning variants with numCopies=2, which would massively speed up your analysis.

Previously, unphased data was stored just like phased data. The only
difference was a flag set in the file header, otherwise you still had
haploid sample lists that you need to traverse and count the number
of copies based on the ploidy.

Now we store the number of copies explicitly in the position index,
where we reserve 8-bits for this (so supports ploidy up to 255).
Each variant row then has an associated number of copies, and you
can just skip a variant if it has a number of copies you don't
care about. For example, with diploid data you can do analyses with
just homozygotes by only scanning variants with numCopies=2, which
would massively speed up your analysis.
Unphased data can be processed in igdtools, either for conversion,
filtering, or just stats.

There is also a --force-unphased flag that lets you de-phase an
IGD file (usually useful for testing tools, if nothing else).

VCF files that are unphased will get converted to unphased IGDs,
VCF files that are phased will get converted to phased IGDs, unless
--force-unphased is used. VCF files that have mixed phasedness can
now be used with picovcf (and igdtools) if forceUnphased (or
--force-unphased) are set, and the resulting IGD will be unphased.

New test/endtoend/run_tests.py file for non-unit tests, currently
only testing igdtools.
Very fast because of the sparse representation.

Also run the end-to-end tests at CI time.
@dcdehaas dcdehaas merged commit e6a67c9 into main Dec 10, 2024
3 checks passed
@dcdehaas dcdehaas deleted the unphased branch December 10, 2024 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant