-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive GT ./. in GVCF output #220
Comments
I am also very interested in this. I see similar discrepancies with combineGVCFs -> genotypeGVCFs. Heterozygous genotypes are called in the joint GVCF with zero read support for the alternate allele. Other people in the issues mention using GATK, but I can't find any examples showing how to do this correctly. the DBimport function does not work either with clair3 GVCFs. The readme also mentions using GLNexus to merge GVCFs, but it won't take clair3 output (at least for my genome) as it appears to dislike soft-masked sequence. I get "Invalid: allele is not a DNA sequence" errors for alleles like "Ga" |
@nreid does GLNexus work if you simply unmask the masked sequences, like changing "Ga" to "GA"? |
@nreid @aquaskyline Thank you for mentioning GLNexus. I have given it a try using the
The results from GLNexus are less erroneous and more believable, at least in my test. I will close this as completed. |
@aquaskyline I just tried to modify the GVCF (not the reference genome) as follows and GLNexus appears to accept the file and produces a joint genotype VCF.
The genotypes do seem to match the calls in the individual clair3 VCF files, but one thing I'm not clear on is why REF/REF samples with plenty of read coverage are given missing genotypes. They seem to be passed on from the clair3 GVCF files, where the genotypes are also uncalled, but it's still not clear why. Looking at the alignments in IGV, I don't see any problems. Here is an example: From GLNexus, the "II" in the RNC field indicates no call in GVCF:
For the third sample, with the no-call genotype, here is the GVCF for the surrounding sites:
Here is an IGV screenshot of the pileup for the three samples in the same order: I'd appreciate any insight. I'm adding this comment to this issue because it's generally still on topic. The data are ultralong ONT data from the GIAB ashkenazi trio. I am exploring approaches for future work. Having a bunch of no-calls where the data seem to support a reference call is problematic for me. Even filtering down to ballelic SNPs and GLNexus variant Q > 30, I still have 5-10% missing genotypes per sample in this test region. |
@aquaskyline May I add that I have similar issues of missing genotypes as mentioned by @nreid. It is hard to understand why given that reads in my dataset are generally 1.5 kb to 10 kb amplicons and the read depths are as high as 1000x. Since my analyses is sensitive to missing data, I let beagle impute the missing genotypes when phasing the SNPs, which may not really the best approach. It would really go a long way if this is something that can be dealt with during the GVCF and VCF generation steps with Clair3. |
Regarding the issue of missing genotypes (represented as "./."), in Clair3 GVCF output, three PL (Phred-scaled likelihood) values are calculated: hom_ref, hom_alt, and het_alt. These values correspond to the likelihood of being homozygous reference, homozygous alternative, and heterozygous alternative, respectively. The calculation of the hom_ref value relies on the reference coverage over the total coverage. If the hom_ref value is not the highest among the three PLs, the genotype changes from "0/0" to "./.". This logic follows the guidelines set by GATK. The presence of "./." does not indicate a failure to make a genotype call. Instead, it only signifies the candidate site possesses obvious alternative read support, including mismatches and indels. |
The logic is correct, but we are giving our implementation a check on extremely high coverage (DP>1000). Get back soon. |
@zhengzhenxian thanks for the explanation. There's something I'm not quite grasping though. At position 34283744 in the example above, there are 53 reads in the IGV pileup. 51 of them are A (ref), 2 are C and there are 2 insertions. I'm not clear why the highest likelihood would not be "0/0", as it is at site 34283743, only one base upstream (which has 47 G (ref), 1 A, 5 insertions and 1 deletion), and why, if the highest likelihood is "0/1", is that not the assignment? In the end, the GLNexus output remains "missing", even though there seems to be plenty of read support for a "0/0" genotype call. This seems inconsistent with the behavior of other variant callers and I'm not sure why. |
After some investigations, there seems to be problems that lead to excessive number of './.' at very high depth (DP>1000). PS - We verified that 'floating point number overflow' was not happening in Clair3's GVCF output module. |
The process for generating a GVCF is: first, we calculate the likelihood of three different genotypes: homozygous reference (0/0), homozygous alternate (0/1), and heterozygous alternate (1/1). These likelihoods are computed based on factors such as the total coverage at the site, the number of reference bases, and the number of non-reference bases. Additionally, we determine the AF of each potential variant. If the AF of any variant exceeds a predefined threshold (default: 0.08), the site is considered a candidate for variant calling and is subsequently processed by either the pileup or full alignment model. Therefore, in cases where a site may appear as if it should be homozygous reference (0/0) but is instead denoted as "./.", the most common reason is that, in the alignment result, there are multiple potential variants at the same position. These variants, combined with a relatively high count of non-reference bases, can influence the likelihood calculations. This can result in the likelihood of 0/1 being the highest, even though the site may not meet the AF threshold for a variant to be called. For example, given a site with coverage of 151, and alternate allele count as {'-GC': 4, '-GCC': 3, 'G': 3, '-g': 2, '*': 2, '+TT': 1, '-gc': 1, '-G': 1, '-gccg': 1}, #ref_base=133, #alt_base=18. With the default base_err as 0.001, the likelihoods are: likelihood(0/0) = (1-0.001)^133*0.001^18 = 8.75e-55 To address this issue, you can adjust the parameter of the base error rate according to the specific platforms and alignment results you are working with. Based on our experiments, for example, when processing the chr20 of a 144X ONT sample, setting the base_error parameter to 0.03 can significantly reduce the occurrence of missing genotypes. we are planning to provide base_err parameters for various platforms in our upcoming release. Until then, you could manually modify base_err parameter in shared/param_p.py line 26. |
Thank you, this explanation makes a lot of sense. So just to clarify one thing, when calculating the likelihoods, you are using this fixed base error probability instead of the actual base qualities in the input bam file? |
As Clair3 outputs GVCF using the same logics as GATK, |
@ShuminBAL @aquaskyline |
@eiwai81. Correct. |
@aquaskyline Thanks. |
@eiwai81 Added |
Hello,
Firstly, thank you for this tool.
I am working on amplicon sequence data and my workflow involves generating GVCFs with Clair3 with gvcf enabled which I then use as inputs for the CombineGVCFs and GenotypeGVCFs commands in GATK. However, when I compare the sample genotype in the final merge_output.vcf produced by Clair3 with genotype of the same sample in the final combined.vcf from GATK, there are some differences which affects other results downstream. See below.
These are the parameters I used when I generated the GVCFs
I have tried filtering the population VCF using DP, allele counts or frequency thresholds but these have had little or no positive effect. So I was wondering if there are recommendations on some parameters I can modify when generating the GVCFS with Clair3 which would improve the genotyping accuracy of GATK.
Thank you.
The text was updated successfully, but these errors were encountered: