Skip to content
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.

Called GTs are poorly formatted and sometimes inaccurate #161

Closed
ldgauthier opened this issue Feb 7, 2018 · 7 comments
Closed

Called GTs are poorly formatted and sometimes inaccurate #161

ldgauthier opened this issue Feb 7, 2018 · 7 comments

Comments

@ldgauthier
Copy link

ldgauthier commented Feb 7, 2018

Importing sample1:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1
chr20   1274366 .       A       G,<NON_REF>     562.77  .       DP=26;MQRankSum=-1.011;MQ_DP=26;QUALapprox=591;RAW_MQ=93600.00;ReadPosRankSum=0.578;VarDP=24    GT:AD:DP:GQ:PL:SB       1/1:1,16,0:16:99:368,166,0,379,171,547:1,0,12,11

and sample2:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample2
chr20   1274365 .       TAA     T,TA,<NON_REF>  562.77  .       DP=26;MQRankSum=-1.011;MQ_DP=26;QUALapprox=591;RAW_MQ=93600.00;ReadPosRankSum=0.578;VarDP=24    GT:AD:DP:GQ:PL:SB       1/2:1,7,16,0:24:99:591,353,368,166,0,123,542,379,171,547:1,0,12,11

into a GDB and querying it with PRODUCE_GT_FIELD set to true produces:

chr20   1274365 .       TAA     T,TA,<NON_REF>  .       .       DP=26;MQRankSum=-1.011e+00;MQ_DP=26;QUALapprox=591;RAW_MQ=93600.00;ReadPosRankSum=0.578;VarDP=24        GT:AD:DP:GQ:PL:SB       0       1/2:1,7,16,0:24:99:591,353,368,166,0,123,542,379,171,547:1,0,12,11
chr20   1274366 .       A       G,*,<NON_REF>   .       .       DP=52;MQRankSum=-1.011e+00;MQ_DP=26;QUALapprox=591;RAW_MQ=93600.00;ReadPosRankSum=0.578;VarDP=24        GT:AD:DP:GQ:PL:SB       1/1:1,16,0,0:16:99:368,166,0,379,171,547,379,171,547,547:1,0,12,11      3/2:1,0,16,0:24:99:591,542,547,166,171,123,542,547,171,547:1,0,12,11
chr20   1274367 .       A       *,<NON_REF>     .       .       DP=26   GT:AD:DP:GQ:PL:SB       0       2/1:1,16,0:24:99:591,166,123,542,171,547:1,0,12,11

Note that the GT for the first sample at the last position is output as 2/1 (I'm not sure if that's against the spec, but it's certainly against convention) and after examining the PLs, the best likelihood is actually on the 2/2 genotype.

@kgururaj
Copy link
Contributor

kgururaj commented Feb 7, 2018

Some clarification questions. For the second input VCF:

  • The spanning deletion corresponds to genotype 2/2 (TAA->TA, allele index 2)
  • The ALT allele TAA->T (allele index 1) gets mapped to <NON_REF> in the lines with the spanning deletion
  • Hence, in GenomicsDB, the GT field which was 1/2 in the input, got mapped to <NON_REF>/* in the lines with spanning deletion. This corresponds to 3/2 in the second line and 2/1 in the third line. I'm assuming this is not the expected behavior. Could you let us know what's expected?

@ldgauthier
Copy link
Author

Sorry for the incomplete bug report.

The expectation is that the genotype call matches the PLs. The PLs for sample1 at chr20: 1274367 are [591,166,123,542,171,547], with the lowest number being the lowest likelihood of being incorrect. Those positions correspond to [1/1, 1/2, 2/2, 1/3, 2/3, 3/3], so the lowest is actually at 2/2. I agree that the PLs are not accurate. That's an issue from the GATK code that you faithfully copied where we only allow one spanning deletion allele and pick the "best" one using the likelihoods. But given that those are the PLs, the GT should agree.

The VCF spec doesn't seem to specify that the smaller allele number comes first in the genotype (http://samtools.github.io/hts-specs/VCFv4.2.pdf section 1.4.2), but that's been the convention for a long time. Thus, we'd prefer 2/3 to 3/2. I see that you're trying to preserve the order of the alleles from the original variant, but that only matters if the genotype is phased (e.g. 3|2 is permissible). The zeros where data is missing should be ./., which is justified by the specification about missing values being specified with a dot. So if neither allele of the genotype is known it goes to ./.

@kgururaj
Copy link
Contributor

kgururaj commented Feb 9, 2018

Some more questions/clarifications:

  • The 0 in the GT field for sample 1 is a bug in GenomicsDB. I tested the C++ code for produce_GT_field but not with Java (htsjdk). My mistake - this has been fixed now.
  • Regarding the GT field for sample 2, currently GenomicsDB just looks at the original value of the GT field and simply prints the same value (taking into account any change in the allele order). Thus, input GT 1/2 in sample 2 at position 1274365 is maintained at 1274366 and 1274367. Is the requirement that for the lines corresponding to the spanning deletion, the GT corresponds to the min PL value?

@ldgauthier
Copy link
Author

After poring over the VCF 4.2 spec, it looks like is it NOT required that the GT call match the min PL. It's a convention many tools follow, but not mandatory. The ordering of the alleles is not described in the spec either, so I'm wrong about everything but the 0.

@kgururaj
Copy link
Contributor

kgururaj commented Feb 9, 2018

Would your preference be to use the min PL genotype for the GT field in the spanning deletion? This can be done

kgururaj added a commit that referenced this issue Feb 12, 2018
#161

Fixed CI golden output that if correct would have caught this bug :(
@ldgauthier
Copy link
Author

Personally my preference would be to use the min PL genotype for the GT field, but this mode isn't used in GATK so that may not be important.

kgururaj added a commit that referenced this issue Feb 15, 2018
#161

Fixed CI golden output that if correct would have caught this bug :(
kgururaj added a commit that referenced this issue Feb 25, 2018
#161

For spanning deletions, when producing the GT field, the GT correponds
to the min value PL. A boolean flag controls this behavior
(produce_GT_with_min_PL_value_for_spanning_deletions). The min PL is
computed by iterating over all genotype combinations for all ploidy -
hence,the significant code changes

CI tests using input provided by @ldgauthier included
@kgururaj
Copy link
Contributor

kgururaj commented Jul 6, 2018

fixed

@kgururaj kgururaj closed this as completed Jul 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants