Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed VCF due to R reference base #124

Open
dennishendriksen opened this issue Jun 11, 2023 · 3 comments
Open

Malformed VCF due to R reference base #124

dennishendriksen opened this issue Jun 11, 2023 · 3 comments

Comments

@dennishendriksen
Copy link

Hello @tjiangHIT, @Meltpinkg and cuteSV developers,

cuteSV v2.0.3 can produce malformed VCF output containing R nucleotides in the REF column. These are not allowed according to the VCF v4.2 specification: REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). The VCF v4.3 specification additionaly mentions: IUPAC ambiguity codes should be converted to a concrete base. Downstream tools such as HTSJDK throw an error correctly stating that the VCF is malformed.

For our use case this result in analysis that cannot complete.

Example:

chr10   131592457       cuteSV.DEL.1321 TCCCAGGTTCAAGTGATTCTCTTGCTTAACCCTCCCGAGTAGCTGGGATTACAGGCACCCACAAGAACACCCAGCTGATTTTTGTATTTTTAGCAGAGACAGGGTTTCACTGTGTTGGCCAGGCTGGTCTCGAACTCCTGACCTTGTGATCTGCCTGCCTTGGCCTCCCAAAGTACTGGGATTAATTATTTTTCCTTTTTAAGGTTAAATAATATTCCATTTTGTGGATATGCCACATTTTGTTTATCCATTCATCTGTCAACAGACACTTGGGTTGCTTCCATCTTTTGACTATTGTGAATAATGCTGTTGTGGACATGGGTGTAGAAACATCTCTTTGAGGCTCTGCTTTTAATTCTTTGAGGTATATACCCAGAGGTGTAATTGCTGGATCATGTGAAATCTGAGAAACCACCATATTGTTTCTATAGTTGTGTAGTATCTCACTGTGGTTTTGATTTGCATTTTCCTAATTATTCATGTTGTTGAGCATCTTTTCATGTACTTATTGGTCATTTGTATATCATTGGAGAAATATATATTCAAGTCCTTTGTCTATTTTTTAATTGTGTTGTTTTTTGGTTGTTGAATTGCAAGAGTTCTTTATATATGGATAGTAATCCGTTATCAGATATATAATTTACAAATATTTCCTGCCATTCAGTGTGTTGCCTTTTACTCTGTTGACAGTGTCATTTGATTCACAAAAATTTTTAATATTTACATGTTCCAATTATCTGATTTTTTTGTTGCCTATGCTTTCGGTGTCGTAGCCAAGAAATCCTTGCCAAATGCAATGCCATGAAGCTGTGCCCCTACATTTTCTTGTGAGTATTCTAACTCTCATATCTAAGTCTTTGACTATTTTTAATTTCTGCATATGGTGTAAGGTAAGGGTACAACTTCATTCTTTTGCATGTGGCTATCCAGTTTTCCCAGTAACATTTGTTGAAAAGACTGTCCTTTTCCCTATTGGATAGTCCTAGCAACTTTTTAAAAAATCACAAGGCCATATATACAAGAGTTTATTTCTGGGCTCTCTATTCTATCTCACTGATCTATGTGTCTGTCTATACGTCAATACCACTCTGTTTTTAATACTGTAGATTTTTAGAAATTTTGAAACTAAGAAGTGTGAGACCTCCAACTGTGTTCTTTTTCAAGATTGTTTTTGCTATTTAGGGTCCCTTGAGATTCTATATGAATGTTAGGATAGATTTTTCTAGTTTTGTAAAAAAAAATTGATGTTGGAATTTTAAGATAAATTGCATTTAATCTAGAGACCACATCTTTCAATTTTAGGTCTTCTCATCTATGAACAAAGGATGTCTATTTTTGTAGTGTCTTTAATTTCTTTGAGCAATATTTCATAGTTTTCAGTGTACACATCTTTCACCTCCTTGGTTCAGTTTGTTTCTATTTTTTATTTTGTTTGGTCCCACTTTAAATGAAATTGCTTTCTTAATTTCTTTTTCAGGTTGTTCATTGTTATTGTATAGAAACACAGCTAATTTCTGTATGCTGAGTATTCTGTAAGTTTGCTAATTTTGTTATTAGTTCTATCATGTTTCTTATGGAATCTTTGGGGTTTTCTACATATGAAATTACATCATCTATGAAAGGGATCGTTTTACTTTTTATTTCCCAATTTTAATGCTTTTTATTTCCTAATTTATCTGGTCAAGATTTCCATTACTATGCTGAATTTAAAAGTAGGCATTCTTCCCTTGTGTCTTAGCTTAGAAGAAAAGTTTTCAATCTTTCATCATTAAGTATGATGTTAGCAATGGGCTTTCCATATATGGCCTTAATTATGTTGAGGTAGTTTCCTTCTGTTCCTAGTTTGGTGRATGTTTTTTATCATGGAAAGGTGTTGGATTTTGTCAAATATTTTTCTCCATCAATTGAGATGATCACATGGGAACTGTTTCTTCATTCTGTTAATGTAGTTATTACATTAATTCATTTTCATATGTTGAACTATCCTTGAATTTCAGAAATAAATCCCACGAGGTCATGTGTATAATTTTTTTGATGTGTCACTTAATTCTGTTCACTAATATTTGGTTGAGGATTTTTACATCAGTATTTATCAGAGATATTGATCTGTAGCTTAATTTTATTGTAGTACCTTTGTCTTGCTTTGGTGAAAGAGTAATCTTGGCCTTGAAGAATAAGTTTGAAAGTGTCCCCTTACCTTAAACTTTTTTGGAAACTTTTGAGAAGGATTAGTGTTAACTCTTCTTTAAATGTTTGGTAGAATTCACGAATGAAGCCATCAGCTCCTGGGATTTTCTTTGTTGGCAGATTTTGGATCATTGATTCAATCTCTTTGCTAGTTATATGTCTGTTCGTATTTTCTATTTCTTTGTGGKTTAGTCTTGGTAGGTGGTATATGTCTAGGAATTTATCCATTTTGTCTAGGTTGTCCAATTTTTTGGCATACAAATATTCATACTATTGTCTTATTAATATAATCATTTTATTTCTGTTAAATCAGTGGTAATGTCTGCACTTACATTTCTGATTTTAGTTATTGAGACTTCCCTCTTTTATCTTACTCAGTCGAACTAATTGTTCATTAATTTTGGTGATTTTTTCAAAGAACTGAACTTGGTTTTGCTAACTTACTCTACCATGTTCCTATTCTTTATTTCAGTTGTCTGTACTCTAGTCTTTATTATTTCTTTCCTTCTACTGGATTTGGGTTTAGTGTGTTCTCCCTTTTTCTACTTCTTTAAGGTATAATGTTAGATTGTTAATTTAAGATCTTTCTTCTTGTTTATCATAAGCATTTACACTATAAACTACCCTCCTAGCACAGATTTTGATGCATCTGGTAAGTTTTGGTATGTTTACTGTAGCCCTGCAATATAGTTTGAAGTCAGGTAATGTGATGCCTCCAGCTGTGTTCTTTTTGCTTAGGGTTGCCTTGGCCATTCGGGCTCTTTTTTGGTTCCATATGAATTTTAAAATAGTTTTTTCTAGTTCTGTGAAGAATGTCATTGGTAGCTTAATAAAAATAGCATTGAATCTGTACACTGCTTTGGGCAGTATGGTCATTTTAATAAGATTGATTCTTCCTATCTGTGAGCATGAGATTTTTAAAAATTTGTTTTTGTCTTACCTGATTTCTTTCAGCAGTGCTTTGTAATTCTCACTGCAGAGATCTTTCACCTCCCTGGTTAGCTGTATTCCTAGATATTTTWTCATTTTTGCAGCAATTGTGAATGAGATTGCCTTCCTGATTTGTTTCTCGGCTTGGTTTCTTCTTGTTGTTTGTGTACAGGAATGCTGGTGATTTTTCTACATTGATTTTGTATCCTGAAACTTTGCTGAAGTTGTTTATCAGCTGAAGGAGCTTTTGGGTCRAGACTATGGGTTTTTCTAGATATAGAATCATGTCATCTGCAAATAGGGATAGTCTGATATCCTCTCTTCCTATTTGGATATGCTTTATTTCTTTATTTTGCCTGATTGCTCTGGCTAAGACTTCCAATAATACTTGAATAGGATTGGTGAAAGAAGGCATTCTTGTCACGTGTTGGTTTTCAAAAGGAATTCTTCCAGCTTTTGCCCATTTAGTATGATGTTGCCTGTTAGTTTGTCACATATGGCTCTTATTATTTTGAGTTGTGTTCCAAAACATCATGGTGCTGGTACAAAAACAGGCACATAGACCAATGSAACAGATAGAGAGCCTAGAAATAAGACTGCACACTTACAACCATCTGATCTTCAACAAAGCTGACAAAAACAAGCAATGGGGAAAAGACTCCCTATTCAATAAATGGTACTTGGATAAGTGGCTAGCCATATGCAGAAGATTGAAGGTAGACCCCTTCCTTGCACCATATACCAAAATCAACTCAAGATGGATTAAAGACTTACATATAAAACCCAAAACTATAAAAAACCCTGGGAGACAACCTAGGCAATATTATCCTGTACATAGGAATGGGCAAAGATTTCATGACAAAGCAATCACAAAAGCAATCACAACAAAAACAAAAATTGACAAATAAGATCTAATTAAACTTAAGAGCTTCTGCACAGCAAAAGAAACTATCAGCAGAGTAAACAGACAACCTACAGGATGGCAGAAAATATTTGCATATTATGCATCTGACAAAGGTCTAATATCCAGCATCTATAAGAAACTTAAACAAGTTTATAAGCAAAAAACAAACAACCCCATTAAAAAGGGGGCAAAGGACATGAACACTTCTCAAAAGAAGACATACGTGCAACCAACAAGCATATGAAGAAAAGCTCAATATCACTGATCATTAGAGAAATGCAAATAAAAACCACAACGAGATACTGTCTCACAACAATCAGAATAGCATTATTAAAAATTCAAAAAAATAACAGATACTGGTGAGGTTGTGGAGAAAAGGGACCACTTATACACTGTTGATGAAAGTGTAAGTTAGTTCAACCATTGTGGAAAGCAGTATGGCGATTCTTCAAAGAAAGAGCTAAAAACAGAATTACCATTCAACTCAGGAATCCCATTACTGGGTATATGCCCAGAGGAATATAAATCATTCCACCATAAAGACACATGCACANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAATCTCGGCTCACTGCAACCTCCGT     T       0       q5      IMPRECISE;SVTYPE=DEL;SVLEN=-4698;END=131597155;CIPOS=-13,13;CILEN=-2,2;RE=10;RNAMES=NULL;AF=0.1471;STRAND=+-;AC=0;AN=2;CSQ=deletion|intergenic_variant|MODIFIER||||||||||||||||1||||1|||||||||||||||||||||||||||||||||||||||||||||||||| GT:DR:DV:PL:GQ  ./.:.:.:.:.     0/0:58:10:0,78,458:78   ./.:.:.:.:.

Command:

cuteSV --sample HG002 --genotype --max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 --max_cluster_bias_DEL 100 --diff_ratio_merging_DEL 0.3 --threads 4 vip_AshkenazimTrio_HG002.cram.bam GCA_000001405.15_GRCh38_no_alt_analysis_set.fna cutesv_output.vcf .

We've replaced Sniffles2 with cuteSV which suffers from what seems to be the same issue including the REF_MISMATCH warnings mentioned in that issue. Actually this issue was the primary reason for the switch.

  • Would it be possible to fix the 'R' reference bases issue such that cuteSV produces valid VCF?
  • Would it be possible to update cuteSV such that the content of the REF column matches the reference genome that was used when calling cuteSV?
  • Would it be possible to update cuteSV to produce VCF with records sorted in the order of contigs as defined in the VCF header? This would ease handling operations like merging VCFs downstream.

Greetings,
@dennishendriksen

@Meltpinkg
Copy link
Collaborator

Hello, @dennishendriksen

In cuteSV, the content of REF column is extracted from the input reference genome, so that the R nucleotides comes from the reference genome. For the three questions mentioned above:

  • The 'R' reference bases can be modified via using sed commands on the output VCF.
  • The content of the REF column is extracted from the reference genome so that it is already matched.
  • cuteSV sorts the VCF file in the lexicographical order now. To change the way of sorting records, you can use commands to modify the output VCF to sort it in other ways.

Hope it will help.

Best,
Shuqi

@dennishendriksen
Copy link
Author

Thank you for your quick reply.

The 'R' reference bases can be modified via using sed commands on the output VCF.

I can confirm that this is indeed what is stored in the reference genome. The sed workaround will probably work, thank you for the suggestion.

In order to produce valid VCF you might consider the following for a proper fix:

From https://samtools.github.io/hts-specs/VCFv4.3.pdf page 8:
If the reference sequence contains IUPAC ambiguity codes not allowed by this specification (such as R = A/G), the ambiguous reference base must be reduced to a concrete base by using the one that is first alphabetically (thus R as a reference base is converted to A in VCF.)

The content of the REF column is extracted from the reference genome so that it is already matched.
I keep seeing differences between the REF column content of cuteSV and the reference genome, for example:

chr10   132172946       cuteSV.BND.9    N       ]chr4:182834343]N       0       q5      IMPRECISE;SVTYPE=BND;RE=5;RNAMES=NULL;AF=0.1282;AC=0;AN=2

Check:

$ samtools faidx GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz chr10:132172946-132172946
>chr10:132172946-132172946
C

Alternative check:

$ bcftools norm --check-ref w --fasta-ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz cutesv.vcf.gz
REF_MISMATCH    chr10   132172946       N       C

Additional examples:

REF_MISMATCH    chr1    143203947       N       A
REF_MISMATCH    chr1    143203948       N       A
REF_MISMATCH    chr1    143211397       N       G
REF_MISMATCH    chr1    143211405       N       T
REF_MISMATCH    chr1    143242873       N       G
REF_MISMATCH    chr10   26893473        N       G
REF_MISMATCH    chr10   26894430        N       C
REF_MISMATCH    chr10   26894431        N       C
REF_MISMATCH    chr10   26894434        N       T
REF_MISMATCH    chr10   38918271        N       A
REF_MISMATCH    chr10   38918278        N       T
REF_MISMATCH    chr10   38918279        N       G
REF_MISMATCH    chr10   38918285        N       G
REF_MISMATCH    chr10   38969688        N       C
REF_MISMATCH    chr10   38969692        N       C
REF_MISMATCH    chr10   38969722        N       G
REF_MISMATCH    chr10   38969741        N       C
REF_MISMATCH    chr10   38969749        N       C
REF_MISMATCH    chr10   38969765        N       C
REF_MISMATCH    chr10   42066267        N       C
REF_MISMATCH    chr10   42066268        N       G
REF_MISMATCH    chr10   42119794        N       A
REF_MISMATCH    chr10   132172946       N       C
REF_MISMATCH    chr10   132172954       N       T
REF_MISMATCH    chr10   133004611       N       A
REF_MISMATCH    chr10   133740610       N       C
REF_MISMATCH    chr10   133740611       N       G
REF_MISMATCH    chr10   133740612       N       C
REF_MISMATCH    chr11   175283  N       C
REF_MISMATCH    chr11   176543  N       T
REF_MISMATCH    chr17   21290420        N       T

This seems like an cuteSV issue? Working around this issue is tricky, because the N REF is part of the ALT as well (e.g. N ]chr4:182834343]N ). I'm curious to hear about your thoughts.

cuteSV sorts the VCF file in the lexicographical order now. To change the way of sorting records, you can use commands to modify the output VCF to sort it in other ways.

That makes perfectly sense. I'll do some postprocessing afterwards.

@dennishendriksen
Copy link
Author

dennishendriksen commented Nov 24, 2023

This seems like an cuteSV issue? Working around this issue is tricky, because the N REF is part of the ALT as well (e.g. N ]chr4:182834343]N ). I'm curious to hear about your thoughts.

We are still struggling with working around this issue. Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants