Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinate system query #2

Open
cbergman opened this issue Jun 4, 2015 · 2 comments
Open

Coordinate system query #2

cbergman opened this issue Jun 4, 2015 · 2 comments

Comments

@cbergman
Copy link

cbergman commented Jun 4, 2015

In the RetroSeq VCF file the position for TE insertions relative to the reference are given on 1-based coordinates in the POS column. In addition, there are a set of two consecutive coordinates in the INFO field, the first of which corresponds to the POS column, and the second corresponds to the next base in the genome. Does this imply that the predicted insertion would intergate between the first and second positions in the INFO field? In other words, to convert RetroSeq predictions to 0-based coordinates, do we (i) use the two coordinates in the INFO field, or (ii) subtract 1 from the POS column to make a new start position on 0-based coordinates?

@tk2
Copy link
Owner

tk2 commented Jun 9, 2015

Yes, that is correct. But to be honest, I never consider the breakpoints to be accurate to the exact bp. Some mini local assembly and realignment could get them to bp accuracy, I just never got around to implementing that.

@cbergman
Copy link
Author

cbergman commented Jul 2, 2015

Thanks and sorry for the slow reply.

We are assuming that "that is correct" refers to "Does this imply that the predicted insertion would integrate between the first and second positions in the INFO field?".

This means that RetroSeq is using the INFO field to represent the TE insertion location (which is in reality inter-base) on 1-based coordinates by annotating a consecutive span of 2 nucleotides, with the insertion site being between the first and second nucleotide. This 2-nucleotide span cannot be represented directly in the POS column of the VCF file, which only allows a 1-based single nucleotide feature to be annotated.

To convert RetroSeq output to 0-based BED format in https://github.com/bergmanlab/mcclintock, we will maintain the 2-nucleotide framework, and thus annotate POS-1 for the start and POS+1 for the end of the 2-nucleotide interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants