-
Notifications
You must be signed in to change notification settings - Fork 28
Called GTs are poorly formatted and sometimes inaccurate #161
Comments
Some clarification questions. For the second input VCF:
|
Sorry for the incomplete bug report. The expectation is that the genotype call matches the PLs. The PLs for sample1 at chr20: 1274367 are [591,166,123,542,171,547], with the lowest number being the lowest likelihood of being incorrect. Those positions correspond to [1/1, 1/2, 2/2, 1/3, 2/3, 3/3], so the lowest is actually at 2/2. I agree that the PLs are not accurate. That's an issue from the GATK code that you faithfully copied where we only allow one spanning deletion allele and pick the "best" one using the likelihoods. But given that those are the PLs, the GT should agree. The VCF spec doesn't seem to specify that the smaller allele number comes first in the genotype (http://samtools.github.io/hts-specs/VCFv4.2.pdf section 1.4.2), but that's been the convention for a long time. Thus, we'd prefer 2/3 to 3/2. I see that you're trying to preserve the order of the alleles from the original variant, but that only matters if the genotype is phased (e.g. 3|2 is permissible). The zeros where data is missing should be ./., which is justified by the specification about missing values being specified with a dot. So if neither allele of the genotype is known it goes to ./. |
Some more questions/clarifications:
|
After poring over the VCF 4.2 spec, it looks like is it NOT required that the GT call match the min PL. It's a convention many tools follow, but not mandatory. The ordering of the alleles is not described in the spec either, so I'm wrong about everything but the 0. |
Would your preference be to use the min PL genotype for the GT field in the spanning deletion? This can be done |
#161 Fixed CI golden output that if correct would have caught this bug :(
Personally my preference would be to use the min PL genotype for the GT field, but this mode isn't used in GATK so that may not be important. |
#161 Fixed CI golden output that if correct would have caught this bug :(
#161 For spanning deletions, when producing the GT field, the GT correponds to the min value PL. A boolean flag controls this behavior (produce_GT_with_min_PL_value_for_spanning_deletions). The min PL is computed by iterating over all genotype combinations for all ploidy - hence,the significant code changes CI tests using input provided by @ldgauthier included
fixed |
Importing sample1:
and sample2:
into a GDB and querying it with PRODUCE_GT_FIELD set to true produces:
Note that the GT for the first sample at the last position is output as 2/1 (I'm not sure if that's against the spec, but it's certainly against convention) and after examining the PLs, the best likelihood is actually on the 2/2 genotype.
The text was updated successfully, but these errors were encountered: