-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XM tag when dealing with CIGAR insertions/deletions #135
Comments
Hi Martin, We have generally tried to settle for the following mode of operation:
The C at position 4 should have been called as Unknown context ( In your Example 2 I would in fact expect the context to change to U (instead of |
Hi Felix, Thanks for your swift response. For the second example, here is the read;
and here is the resulting BAM entry;
I think if I'd have seen everything behaving in the same way, e.g. as described in example 1, then I wouldn't have really queried it. I've seen other similar examples too but it takes a long time to extract all the information to dig into what should have happened so I've stuck with these two for now :-) FYI this is bismark VN:v0.18.2 with bowtie2 version 2.3.2 |
Thanks for that. I have now run this read and established that we are in fact padding insertions with
This explains why the context changes from |
I have now changed the changed the methylation call behaviour so that both
|
Hi Felix, Thanks for your swift response to this. This looks like the more sensible option and probably what you originally intended when you wrote it. I will have a play with it over the next day or so to see how it looks before closing the issue. |
Great, thanks. |
This looks exactly how I was expecting and consistent throughout. Thanks for that. |
Thanks for the feedback and for spotting it in the first place. I'll try to include it in a new release soon. |
Hi, Sorry, I've found that it's still doing something odd but with directional this time. I don't know if this is because with this directional library the CIGAR strings are more complex or are the functions are different between directional and non-directional within bismark? Example 3I've highlighted the base with a Read
Result
View
Example 4With this one, I would have expected it to have been a Read
Result
View
|
I'll take a look soon, was already feeling a bit bored anyway... |
Re Example 3: I think technically the result is correct, because the MD:Z field says that the C at the position you marked with 64M1D22M1D14M CIGAR MD:Z:1C2C2C3C0C18C12C3C9C0C0C3^T3C1C1C5C4C0C2 ^C 3C1C1C0C5 MD:Z: So because the C is deleted, there won't be a methylation call for this positions. I would however agree that it is a difficult decision to say that the |
Are you sure it's a C that's deleted? Following it through from the REF to the Modified REF it's a
Where the D is the base that's deleted and the * is the C/T in question. EDIT (this is giving me a headache) In this example, going by the CIGAR string, it's not a
So that makes sense, but by not marking it as a
|
Deconstructed, the read looks like this:
64M1D22M1D14M CIGAR The first deleted position (^T) is irrelevant as there wouldn't have been a methylation call anyway. The second position (^C) does not get a methylation call as the C is deleted from the read (whether this is actually true or not is open for debate, see below). The two cytosines before the Now the last portion of the read and C>T converted genome look like this:
This shows that during the alignment there is basically a stretch or 10 Ts in the genome, but only 9 Ts in the read. It is arguably extremely difficult, if not impossible, to correctly say which of those 10 potential bases is really deleted, and I am sure that in such a case it is a design decision in Bowtie 2 that it will arbitrarily pick the first one of this stretch of homo-polymers. I hope it becomes clearer in this example? |
And here is example 4:
9M 1D 16M 1I 74M To me it appears that the C on the reverse strand is in context |
Morning Felix, You're right, I was miscounting the location of those more complex CIGAR modifications. Thanks for taking the time to check through them. Cheers, Martin |
You're welcome. If there are no further issues I will start preparing the new release later today. Cheers, Felix |
That's great, I don't think there are any more issues. Cheers, Martin |
Hi Felix,
I hope you're well. Sorry, this is a long one, markdown is quite difficult to use to show this kind of stuff.
I think I've found an issue with the way XM tags are being created with regard to the CIGAR operations made to the reference in order to facilitate a match. It appears (but I can't be sure) that the context is being extracted in a way that doesn't take into account these operations.
Example 1
The following non-directional read;
mapped against TAIR10 gives the following mapping result (tags removed for brevity);
CIGAR String = 71M1D4M
So in this case, the final unmethylated
C
in the original reference would be aCTG
which is aCHG
context but it's being marked as aCHH
presumably because in the modified reference theG
is removed and it becomes aCTT
to become aCHH
context.Example 2
I have another example with an Insertion to the reference to make the read match. The read is;
(Again non-directional TAIR10). The mapping result is as follows (this time the opposite strand);
CIGAR String = 4M1I69M
In this instance the first unmethylated
CHH
encountered spans that insertion. The A/G is identified and working backwards since it's the opposite strand we get aCAG
where G is the unmethylatedC
. So reverse complement this to make it easier to discuss, we getCTG
which would make this aCHG
context, however it's being marked as aCHH
.Thoughts?
In the first example it is as though the context is being calculated from the raw or modified reference, but in the second example it is as though something else is happening. Maybe there is a +/- 1 thing happening where it's looking at
AGC
to calculate the context.When looking at a particular cytosine in a read I would expect that the context to be calculated from the original reference since that base in the read is supposed to represent that particular base in the reference. Is that correct and the intended behaviour?
Hopefully this makes sense.
The text was updated successfully, but these errors were encountered: