-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InvalidReferenceBases error for sequences present in GRCh38 #1
Comments
This error is thrown by noodles-vcf. That package has been updated recently so may not throw the error anymore. However, the updates completely changed the API and some features kanpig uses are not longer available, so it's a huge refactor to pull it in. I will put this on the development roadmap. |
The new version of noodles-vcf did indeed fix this issue. This change is in develop and staged for v0.2.1 Note that the behavior for kmer-featurizing non-ATCG characters is to assume the nucleotide is an 'A'. I'll need to document this somewhere in-case it becomes relevant to some users' debugging of their formerly invalid reference bases. |
So, here's a fun update.... tl;dr; is the problem isn't fixed and these invalid reference bases will now cause a corrupted, unusable VCF output. noodles-vcf can now read IUPAC codes in the REF, but after further testing, I found that it cannot write them. Looking into why noodles can't handle this I found it validates reference bases according to the vcf spec § 1.6.1.4 "Each base must be one of A,C,G,T,N (case insensitive).". This means it technically is a VCF issue and these lines can be considered invalid. The update to the new noodles-vcf means that instead of throwing out the invalid line, kanpig is asking noodles to write an invalid line, noodles is raising an error, and then writing a corrupted vcf. Not just corrupted in that the IUPAC code is in REF, but noodles stops halfway through writing the line and produces e.g.
Somehow fixing the problem just made it worse. Now I have to figure out what to do next. The options are:
Option number two is likely more seamless for the user-experience, but I don't know if data should be "sliently" changing since I can imagine a user asking "Where is my Y?". It also creates a little extra overhead checking every base. Once the option is picked, I need to figure out if kanpig should be doing it or I put in a noodles-vcf pull request. vcf spec for how to handle IUPAC in REF
|
These vcf entries will now be parsed and fixed according to vcf spec v4.4 in the output. A warning is printed to the user if any iupac bases were changed. |
I'm getting a few of these errors:
I've checked and these Rs and Ys are just taken from GRCh38, so it's not a VCF issue. It might be good to add support for these ambiguous bases in future updates.
The text was updated successfully, but these errors were encountered: