Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gatk compatibility #153

Closed
shinlin77 opened this issue Nov 24, 2022 · 5 comments
Closed

gatk compatibility #153

shinlin77 opened this issue Nov 24, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@shinlin77
Copy link

I'm using

gatk --java-options "-Xmx10g" CombineGVCFs

on clair3 gvcf output. I encountered this error.

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 97073593: unparsable vcf record with allele TGTGB

The offending input line is

chr3 16902879 . TGTGB T,<NON_REF> 12.03 PASS F GT:GQ:DP:AD:AF:PL 0/1:12:38:26,11,0:0.2895:15,0,57,990,990,990

If this output is actually compatible with a version of GATK, could you please tell what version? The error was generated using v4.1.8.1.

Thanks.

@zhengzhenxian
Copy link
Collaborator

Hi,
We test the GVCF compatibility using GATK version 4.2.6.1. And for the error, it seems that the reference consists of IUPAC base B, not sure your GATK version could parse it.

@Coppini
Copy link

Coppini commented Jan 2, 2023

@aquaskyline @zhengzhenxian

Our group has been facing the same issue with the VCF files we obtain with Clair3, which cause problems when we run QC pipelines on them.

Apparently, while GATK tools seem to handle IUPAC codes in the reference, Clair3 does not.

Our current approach is to check the output VCF from Clair3 and replace any IUPAC degenerate codes in it by N, since N is accepted by VCF standards (the other IUPAC degenerate codes are not).

This has allowed us to proceed and use the generated VCF in other tools.

I noticed there was an old issue related to this on Clair as well (HKU-BAL/Clair#33).

Would it perhaps be possible to handle this automatically inside Clair3, converting these IUPAC codes to N before generating the output VCF?

Current code we've been using:

awk -F'\t' -v OFS='\t' '/^[^#]/{sub(/[RYSWKMBDHV]/, "N", $4) sub(/[RYSWKMBDHV]/, "N", $5) 1'

Which is basically a hacky awk code that replaces any degenerate IUPAC codes in the 4th or 5th fields of the VCF (the two columns containing bases, one for reference and one for alt) by N.

I believe if Clair3 took care of that by itself (or at least had an option that enabled this, like a --convert_iupac_to_n), this would simplify things a lot for further analysis and to make the output VCF (and gVCF) compatible with other tools.

Thanks!

@aquaskyline aquaskyline reopened this Jan 2, 2023
@aquaskyline
Copy link
Member

Understood the problem and am working on a fix.

@aquaskyline aquaskyline added the enhancement New feature or request label Jan 13, 2023
@dennishendriksen
Copy link

Running into the same issue with HTSJDK v3.0.4. The VCF 4.3 specification states:

If the reference sequence contains IUPAC ambiguity codes not allowed by this specification (such as R = A/G),
the ambiguous reference base must be reduced to a concrete base by using the one that is first alphabetically
(thus R as a reference base is converted to A in VCF.)

Personally I would prefer conversion to N as mentioned by @Coppini since that matches the reference sequence value in the cases that I encountered.

Looking forward to a fix, thanks in advance!

@aquaskyline
Copy link
Member

aquaskyline commented Mar 6, 2023

in v1.0.0, IUPAC bases are output as N

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants