Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

END coordinates in VCF files #527

Open
nspies opened this issue Aug 15, 2017 · 4 comments
Open

END coordinates in VCF files #527

nspies opened this issue Aug 15, 2017 · 4 comments
Labels

Comments

@nspies
Copy link
Contributor

nspies commented Aug 15, 2017

I'm trying to parse VCF files with perhaps incomplete headers. This generally works okay, but I'm not sure how best to parse the "END" coordinate if it's not defined in the header. It seems to only be parsed into variant.stop when "END" been defined in the header -- which would be okay, except I also can't get it out of variant.info since pysam insists that "END" is a reserved keyword (https://github.com/pysam-developers/pysam/blob/master/pysam/libcbcf.pyx#L2525). I'm not sure how much of this behavior is caused by pysam or by htslib, but ideally variant.stop would get set irrespective of the header value. I can get around this with this awkward workaround:

if not "END" in vcf.header.info:
    vcf.header.add_line(
        """##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate">"""

See also #402 -- I regularly see VCF files without any headers at all and it'd be great if pysam could deal more nicely with this case.

@bioinformed
Copy link
Member

Hi @nspies: I've run some tests locally and it does appear that variant.stop is set correctly even when the original file header does not include metadata for END.

example.vcf

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001
20      1555555 .       A       <NON_REF>       99      PASS    END=2555555     GT      0/0

test.py

import pysam
vars = pysam.VariantFile('foo_vcf42.vcf')
for record in vars:
    print(record.contig, record.start, record.stop)

output

[W::vcf_parse] Contig '20' is not defined in the header. (Quick workaround: index the file with tabix.)
20 1555554 2555555

If END was ignored, then the output would have been:

20 1555554 1555555

@nspies
Copy link
Contributor Author

nspies commented Aug 22, 2017

Hrm, I'm not seeing what you're seeing. If I copy your vcf and python script, I get this when using the latest release or the git development version (0.12):

$ python test.py
pysam version: 0.12
[W::vcf_parse] Contig '20' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] INFO 'END' is not defined in the header, assuming Type=String
20 1555554 1555555

I've tried this in python 3.5 and 3.6 on mac and linux with the same result.

@bioinformed
Copy link
Member

That isn't good. Let me do some more digging and get back to you.

@creisle
Copy link

creisle commented Mar 26, 2018

I am having a similar (the same?) issue to this. Loading a bcf file through pysam will give me different results if I load it as a bcf directly vs if I use bcftools to convert it to a vcf first.

The stop/end is the same as record.pos when loaded from the bcf file (152450946) but if I read it from the converted vcf it is correct (152451266)

This was using pysam 0.14.1

the vcf record

1	152450946	DUP00000307	N	<DUP>	.	LowQual	IMPRECISE;SVTYPE=DUP;SVMETHOD=EMBL.DELLYv0.7.3;CHR2=1;END=152451266;INSLEN=0;PE=2;MAPQ=33;CT=5to3;CIPOS=-578,578;CIEND=-578,578	GT:GL:GQ:FT:RCL:RC:RCR:CN:DR:DV:RR:RV	0/0:0,-23.0804,-485.79:10000:PASS:118:231:107:2:87:1:0:0	0/0:0,-9.33875,-217.7:93:PASS:40:106:51:2:38:1:0:0

I am able to replicate this with the vcf if I remove the END definition from in the header

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants