Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed VCF output due to R nucleotide in reference #401

Open
dennishendriksen opened this issue Jun 1, 2023 · 1 comment
Open

Malformed VCF output due to R nucleotide in reference #401

dennishendriksen opened this issue Jun 1, 2023 · 1 comment

Comments

@dennishendriksen
Copy link

Hello @fritzsedlazeck and Sniffles developers,

Sniffles v2.0.7 can produce malformed VCF output containing R nucleotides in the REF column. These are not allowed according to the VCF v4.2 specification: REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). The VCF v4.3 specification additionaly mentions: IUPAC ambiguity codes should be converted to a concrete base. Downstream tools such as HTSJDK throw an error correctly stating that the VCF is malformed.

For our use case this result in analysis that cannot complete.

Example:

chr10	131592769	Sniffles2.DEL.7E6M9	TGGACATGGGTGTAGAAACATCTCTTTGAGGCTCTGCTTTTAATTCTTTGAGGTATATACCCAGAGGTGTAATTGCTGGATCATGTGAAATCTGAGAAACCACCATATTGTTTCTATAGTTGTGTAGTATCTCACTGTGGTTTTGATTTGCATTTTCCTAATTATTCATGTTGTTGAGCATCTTTTCATGTACTTATTGGTCATTTGTATATCATTGGAGAAATATATATTCAAGTCCTTTGTCTATTTTTTAATTGTGTTGTTTTTTGGTTGTTGAATTGCAAGAGTTCTTTATATATGGATAGTAATCCGTTATCAGATATATAATTTACAAATATTTCCTGCCATTCAGTGTGTTGCCTTTTACTCTGTTGACAGTGTCATTTGATTCACAAAAATTTTTAATATTTACATGTTCCAATTATCTGATTTTTTTGTTGCCTATGCTTTCGGTGTCGTAGCCAAGAAATCCTTGCCAAATGCAATGCCATGAAGCTGTGCCCCTACATTTTCTTGTGAGTATTCTAACTCTCATATCTAAGTCTTTGACTATTTTTAATTTCTGCATATGGTGTAAGGTAAGGGTACAACTTCATTCTTTTGCATGTGGCTATCCAGTTTTCCCAGTAACATTTGTTGAAAAGACTGTCCTTTTCCCTATTGGATAGTCCTAGCAACTTTTTAAAAAATCACAAGGCCATATATACAAGAGTTTATTTCTGGGCTCTCTATTCTATCTCACTGATCTATGTGTCTGTCTATACGTCAATACCACTCTGTTTTTAATACTGTAGATTTTTAGAAATTTTGAAACTAAGAAGTGTGAGACCTCCAACTGTGTTCTTTTTCAAGATTGTTTTTGCTATTTAGGGTCCCTTGAGATTCTATATGAATGTTAGGATAGATTTTTCTAGTTTTGTAAAAAAAAATTGATGTTGGAATTTTAAGATAAATTGCATTTAATCTAGAGACCACATCTTTCAATTTTAGGTCTTCTCATCTATGAACAAAGGATGTCTATTTTTGTAGTGTCTTTAATTTCTTTGAGCAATATTTCATAGTTTTCAGTGTACACATCTTTCACCTCCTTGGTTCAGTTTGTTTCTATTTTTTATTTTGTTTGGTCCCACTTTAAATGAAATTGCTTTCTTAATTTCTTTTTCAGGTTGTTCATTGTTATTGTATAGAAACACAGCTAATTTCTGTATGCTGAGTATTCTGTAAGTTTGCTAATTTTGTTATTAGTTCTATCATGTTTCTTATGGAATCTTTGGGGTTTTCTACATATGAAATTACATCATCTATGAAAGGGATCGTTTTACTTTTTATTTCCCAATTTTAATGCTTTTTATTTCCTAATTTATCTGGTCAAGATTTCCATTACTATGCTGAATTTAAAAGTAGGCATTCTTCCCTTGTGTCTTAGCTTAGAAGAAAAGTTTTCAATCTTTCATCATTAAGTATGATGTTAGCAATGGGCTTTCCATATATGGCCTTAATTATGTTGAGGTAGTTTCCTTCTGTTCCTAGTTTGGTGRATGTTTTTTATCATGGAAAGGTGTTGGATTTTGTCAAATATTTTTCTCCATCAATTGAGATGATCACATGGGAACTGTTTCTTCATTCTGTTAATGTAGTTATTACATTAATTCATTTTCATATGTTGAACTATCCTTGAATTTCAGAAATAAATCCCACGAGGTCATGTGTATAATTTTTTTGATGTGTCACTTAATTCTGTTCACTAATATTTGGTTGAGGATTTTTACATCAGTATTTATCAGAGATATTGATCTGTAGCTTAATTTTATTGTAGTACCTTTGTCTTGCTTTGGTGAAAGAGTAATCTTGGCCTTGAAGAATAAGTTTGAAAGTGTCCCCTTACCTTAAACTTTTTTGGAAACTTTTGAGAAGGATTAGTGTTAACTCTTCTTTAAATGTTTGGTAGAATTCACGAATGAAGCCATCAGCTCCTGGGATTTTCTTTGTTGGCAGATTTTGGATCATTGATTCAATCTCTTTGCTAGTTATATGTCTGTTCGTATTTTCTATTTCTTTGTGGKTTAGTCTTGGTAGGTGGTATATGTCTAGGAATTTATCCATTTTGTCTAGGTTGTCCAATTTTTTGGCATACAAATATTCATACTATTGTCTTATTAATATAATCATTTTATTTCTGTTAAATCAGTGGTAATGTCTGCACTTACATTTCTGATTTTAGTTATTGAGACTTCCCTCTTTTATCTTACTCAGTCGAACTAATTGTTCATTAATTTTGGTGATTTTTTCAAAGAACTGAACTTGGTTTTGCTAACTTACTCTACCATGTTCCTATTCTTTATTTCAGTTGTCTGTACTCTAGTCTTTATTATTTCTTTCCTTCTACTGGATTTGGGTTTAGTGTGTTCTCCCTTTTTCTACTTCTTTAAGGTATAATGTTAGATTGTTAATTTAAGATCTTTCTTCTTGTTTATCATAAGCATTTACACTATAAACTACCCTCCTAGCACAGATTTTGATGCATCTGGTAAGTTTTGGTATGTTTACTGTAGCCCTGCAATATAGTTTGAAGTCAGGTAATGTGATGCCTCCAGCTGTGTTCTTTTTGCTTAGGGTTGCCTTGGCCATTCGGGCTCTTTTTTGGTTCCATATGAATTTTAAAATAGTTTTTTCTAGTTCTGTGAAGAATGTCATTGGTAGCTTAATAAAAATAGCATTGAATCTGTACACTGCTTTGGGCAGTATGGTCATTTTAATAAGATTGATTCTTCCTATCTGTGAGCATGAGATTTTTAAAAATTTGTTTTTGTCTTACCTGATTTCTTTCAGCAGTGCTTTGTAATTCTCACTGCAGAGATCTTTCACCTCCCTGGTTAGCTGTATTCCTAGATATTTTWTCATTTTTGCAGCAATTGTGAATGAGATTGCCTTCCTGATTTGTTTCTCGGCTTGGTTTCTTCTTGTTGTTTGTGTACAGGAATGCTGGTGATTTTTCTACATTGATTTTGTATCCTGAAACTTTGCTGAAGTTGTTTATCAGCTGAAGGAGCTTTTGGGTCRAGACTATGGGTTTTTCTAGATATAGAATCATGTCATCTGCAAATAGGGATAGTCTGATATCCTCTCTTCCTATTTGGATATGCTTTATTTCTTTATTTTGCCTGATTGCTCTGGCTAAGACTTCCAATAATACTTGAATAGGATTGGTGAAAGAAGGCATTCTTGTCACGTGTTGGTTTTCAAAAGGAATTCTTCCAGCTTTTGCCCATTTAGTATGATGTTGCCTGTTAGTTTGTCACATATGGCTCTTATTATTTTGAGTTGTGTTCCAAAACATCATGGTGCTGGTACAAAAACAGGCACATAGACCAATGSAACAGATAGAGAGCCTAGAAATAAGACTGCACACTTACAACCATCTGATCTTCAACAAAGCTGACAAAAACAAGCAATGGGGAAAAGACTCCCTATTCAATAAATGGTACTTGGATAAGTGGCTAGCCATATGCAGAAGATTGAAGGTAGACCCCTTCCTTGCACCATATACCAAAATCAACTCAAGATGGATTAAAGACTTACATATAAAACCCAAAACTATAAAAAACCCTGGGAGACAACCTAGGCAATATTATCCTGTACATAGGAATGGGCAAAGATTTCATGACAAAGCAATCACAAAAGCAATCACAACAAAAACAAAAATTGACAAATAAGATCTAATTAAACTTAAGAGCTTCTGCACAGCAAAAGAAACTATCAGCAGAGTAAACAGACAACCTACAGGATGGCAGAAAATATTTGCATATTATGCATCTGACAAAGGTCTAATATCCAGCATCTATAAGAAACTTAAACAAGTTTATAAGCAAAAAACAAACAACCCCATTAAAAAGGGGGCAAAGGACATGAACACTTCTCAAAAGAAGACATACGTGCAACCAACAAGCATATGAAGAAAAGCTCAATATCACTGATCATTAGAGAAATGCAAATAAAAACCACAACGAGATACTGTCTCACAACAATCAGAATAGCATTATTAAAAATTCAAAAAAATAACAGATACTGGTGAGGTTGTGGAGAAAAGGGACCACTTATACACTGTTGATGAAAGTGTAAGTTAGTTCAACCATTGTGGAAAGCAGTATGGCGATTCTTCAAAGAAAGAGCTAAAAACAGAATTACCATTCAACTCAGGAATCCCATTACTGGGTATATGCCCAGAGGAATATAAATCATTCCACCATAAAGACACATGCACANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAATCTCGGCTCACTGCAACCTCCGTCTCCCAGGTTCAAGTGATTCTCTTGCTTAACCCTCCCGAGTAGCTGGGATTACAGGCACCCACCAGAACACCCAGCTGATTTTTGTATTTTTAGCAGAGACAGGGTTTCACTGTGTTGGCCAGGCTGGTCTCGAACTCCTGACCTTGTGATCTGCCTGCCTTGGCCTCCCAAAGTACTGGGATTAATTATTTTTCCTTTTTAAGGTTAAATAATATTCCATTTTGTGGATATGCCACATTTTGTTTATCCATTCATCTGTCAACAGACACTTGGGTTGCTTCCATCTTTTGACTATTGTGAATAATGCTGT	N	60	PASS	PRECISE;SVTYPE=DEL;SVLEN=-4698;END=131597467;SUPPORT=8;COVERAGE=61,60,53,47,51;STRAND=+-;AC=2;STDEV_LEN=2.854;STDEV_POS=350.578;SUPP_VEC=111;AN=6;CSQ=deletion|intergenic_variant|MODIFIER||||||||||||||||1||||1||||||||||||||||||||||||||||||||||||||||||||||||||	GT:GQ:DR:DV:ID	0/1:9:29:9:Sniffles2.DEL.9AEES9,Sniffles2.DEL.9AEFS9,Sniffles2.DEL.9AF2S9	0/0:60:59:2:Sniffles2.DEL.DEDDS9,Sniffles2.DEL.DEDFS9	0/1:60:36:32:Sniffles2.DEL.FBCAS9,Sniffles2.DEL.FBCBS9

We were able to reproduce the issue with the GIAB HG002 trio (see malformed_vcf_issue.zip) for the snf resources:

    local args=()
    args+=("--input" HG002_9.snf HG003_9.snf HG004_9.snf)
    args+=("--reference" "GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz")
    args+=("--tandem-repeats" "human_GRCh38_no_alt_analysis_set.trf.bed")
    args+=("--vcf" "vip_9_long_read_sv.vcf.gz")
    args+=("--threads" "4")

    ${CMD_SNIFFLES2} "${args[@]}"
}

Greetings,
@dennishendriksen

@dennishendriksen
Copy link
Author

Running bcftools norm --output out.vcf --fasta-ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz --check-ref wx vip_9_long_read_sv.vcf.gz shows many more issues in the REF column:

...
REF_MISMATCH    chr10   133765397       N       A
REF_MISMATCH    chr10   133779600       N       A
REF_MISMATCH    chr10   133785181       N       A
REF_MISMATCH    chr10   133786526       N       C
Lines   total/split/realigned/skipped:  2251/0/3/1055

The commands to generate the .snf files are similar to:

    local args=()
    args+=("--input" "vip_AshkenazimTrio_HG002_sliced.cram")
    args+=("--reference" "GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz")
    args+=("--tandem-repeats" "human_GRCh38_no_alt_analysis_set.trf.bed")
    args+=("--snf" "HG002_9.snf")
    args+=("--sample-id" "HG002")
    args+=("--threads" "4")

    ${CMD_SNIFFLES2} "${args[@]}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant