XM tag when dealing with CIGAR insertions/deletions #135

martinjvickers · 2017-10-09T12:08:14Z

Hi Felix,

I hope you're well. Sorry, this is a long one, markdown is quite difficult to use to show this kind of stuff.

I think I've found an issue with the way XM tags are being created with regard to the CIGAR operations made to the reference in order to facilitate a match. It appears (but I can't be sure) that the context is being extracted in a way that doesn't take into account these operations.

Example 1

The following non-directional read;

@NS500422:333:H3TJVBGXY:1:11101:10255:2527
CTTAAAACAAACTTTAACTTCCCTTCCTAACAAACCAACAAATTAAAAATATAAAAACAATTCCTTACCAAAAAT
+
AAAAAE6E/EAEE6EE/AEEEAEAEEEEE/AEEEEEEEEEAAEE/EAEEE6E6EE/EEEE6EEAEE/AEEEEEEE

mapped against TAIR10 gives the following mapping result (tags removed for brevity);

NS500422:333:H3TJVBGXY:1:11101:10255:2527     0       2       3282285 71M1D4M ATTTTTGGTAAGGAATTGTTTTTATATTTTTAATTTGTTGGTTTGTTAGGAAGGGAAGTTAAAGTTTGTTTTAAG     XM:Z:.h.hx..........xz..h....h..h.h...h.......h.z....................h...hh.....

CIGAR String = 71M1D4M

ORIG_REF  ACTCCTGGTAAGGAACCGTCTTTACATCTCTAACTTGTTGGCTCGTTAGGAAGGGAAGTTAAAGCTTGCCTGTAAG
MOD_REF   ACTCCTGGTAAGGAACCGTCTTTACATCTCTAACTTGTTGGCTCGTTAGGAAGGGAAGTTAAAGCTTGCCT TAAG
READ      ATTTTTGGTAAGGAATTGTTTTTATATTTTTAATTTGTTGGTTTGTTAGGAAGGGAAGTTAAAGTTTGTTT TAAG
XM:Z:     .h.hx..........xz..h....h..h.h...h.......h.z....................h...hh. ....

So in this case, the final unmethylated C in the original reference would be a CTG which is a CHG context but it's being marked as a CHH presumably because in the modified reference the G is removed and it becomes a CTT to become a CHH context.

Example 2

I have another example with an Insertion to the reference to make the read match. The read is;

@NS500422:333:H3TJVBGXY:1:11101:10593:2969
CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEAA6EEEAEEAEEEEEEE

(Again non-directional TAIR10). The mapping result is as follows (this time the opposite strand);

NS500422:333:H3TJVBGXY:1:11101:10593:2969     16      4       3165898 4M1I69M CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA      XM:Z:......h....Z...X.h............h.hh..h.......hhh..............h.........h.h

CIGAR String = 4M1I69M

ORIG_REF   CCGCAGCCTCGACTGAGACCCTTAAATAAGAGGTTGATTCATAGGGATACAAAAATAAAAGAAATAAATAGAGA
MOD_REF    CCGC AGCCTCGACTGAGACCCTTAAATAAGAGGTTGATTCATAGGGATACAAAAATAAAAGAAATAAATAGAGA
READ       CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA
XM:Z:      ......h....Z...X.h............h.hh..h.......hhh..............h.........h.h

In this instance the first unmethylated CHH encountered spans that insertion. The A/G is identified and working backwards since it's the opposite strand we get a CAG where G is the unmethylated C. So reverse complement this to make it easier to discuss, we get CTG which would make this a CHG context, however it's being marked as a CHH.

Thoughts?

In the first example it is as though the context is being calculated from the raw or modified reference, but in the second example it is as though something else is happening. Maybe there is a +/- 1 thing happening where it's looking at AGC to calculate the context.

When looking at a particular cytosine in a read I would expect that the context to be calculated from the original reference since that base in the read is supposed to represent that particular base in the reference. Is that correct and the intended behaviour?

Hopefully this makes sense.

The text was updated successfully, but these errors were encountered:

FelixKrueger · 2017-10-09T15:45:46Z

Hi Martin,

We have generally tried to settle for the following mode of operation:

For deletions we do indeed apply the same deletion to the reference genome and use the resulting sequence to determine the sequence context. So in your Example 1 the CTG would become CTT which is a CHH context. If you look in real data this seems to really happen and be the right thing to do. It is not very obvious if there is a change from CHH to CHG (at least not in mammalian systems, it might already be much more obvious in TAIR10), but when you change a CpG to CHH and vice versa you are almost guaranteed to see a state change from unmethylated to methylated if we wouldn't take the deletions into account (assuming ~80% overall methylation in CpG context and ~0% methylation in non-CG context).
For insertions it is more complicated to determine the context, and I am not sure there is aright way to do it. One can obviously not use the genomic sequence because the insertion as such might change the context, but we also don't really trust the actual base calls in the read so much that we want to use the read to make the context call. So for the actual methylation call we insert Ns into the reference for the context call, like so:

original reference: CAGC   CAATTA
Read:               CAGCATACAATTA
Modified Reference: CAGCNNNCAATTA

The C at position 4 should have been called as Unknown context (U) because we can't actually say with certainty what the context was.

In your Example 2 I would in fact expect the context to change to U (instead of h), which is strange.... Would you mind sending me that one entry of the BAM file and ideally also as FastQ entry so I can investigate?

martinjvickers · 2017-10-09T16:16:10Z

Hi Felix,

Thanks for your swift response. For the second example, here is the read;

@NS500422:333:H3TJVBGXY:1:11101:10593:2969
CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEAA6EEEAEEAEEEEEEE

and here is the resulting BAM entry;

NS500422:333:H3TJVBGXY:1:11101:10593:2969	16	4	3165898	0	4M1I69M	*	0	0	CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEAA6EEEAEEAEEEEEEE	NM:i:14	MD:Z:2G2G10G12G1G0G2G7G0G0G14G9G1G0	XM:Z:......h....Z...X.h............h.hh..h.......hhh..............h.........h.h	XR:Z:GA	XG:Z:GA

I think if I'd have seen everything behaving in the same way, e.g. as described in example 1, then I wouldn't have really queried it. I've seen other similar examples too but it takes a long time to extract all the information to dig into what should have happened so I've stuck with these two for now :-)

FYI this is bismark VN:v0.18.2 with bowtie2 version 2.3.2

FelixKrueger · 2017-10-10T11:00:19Z

Thanks for that. I have now run this read and established that we are in fact padding insertions with X and not with N. (it's obviously quite a while since I have written this...).
Here is the read alignment:

  CCGC-AGCCTCGACTGAGACCCTTAAATAAGAGGTTGATTCATAGGGATACAAAAATAAAAGAAATAAATAGAG genomic sequence
  CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA Read sequence
AACCGCXAGCCTCGACTGAGACCCTTAAATAAGAGGTTGATTCATAGGGATACAAAAATAAAAGAAATAAATAGAG extracted corresponding genomic sequence
  ......h....Z...X.h............h.hh..h.......hhh..............h.........h.h methylation call

This explains why the context changes from CHG to CHH, but I no longer sure this is what we actually want... Maybe I should add conditionals to the methylation call procedure to assign Unknown context if there was either an N or X in the up- or downstream sequence, would you agree?

FelixKrueger · 2017-10-10T11:21:56Z

I have now changed the changed the methylation call behaviour so that both N and X will result in Unknown context ab2d010. Here is the new output:

CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA
......u....Z...X.h............h.hh..h.......hhh..............h.........h.h

martinjvickers · 2017-10-11T08:38:27Z

Hi Felix,

Thanks for your swift response to this. This looks like the more sensible option and probably what you originally intended when you wrote it. I will have a play with it over the next day or so to see how it looks before closing the issue.

FelixKrueger · 2017-10-11T08:39:49Z

Great, thanks.

martinjvickers · 2017-10-11T12:53:09Z

This looks exactly how I was expecting and consistent throughout. Thanks for that.

FelixKrueger · 2017-10-11T12:58:43Z

Thanks for the feedback and for spotting it in the first place. I'll try to include it in a new release soon.

martinjvickers · 2017-10-12T10:08:20Z

Hi,

Sorry, I've found that it's still doing something odd but with directional this time. I don't know if this is because with this directional library the CIGAR strings are more complex or are the functions are different between directional and non-directional within bismark?

Example 3

I've highlighted the base with a * in the view below. Based upon what we said about example 1 from yesterday, shouldn't this base have been counted?

Read

@HS3:420:C3EHMACXX:8:1101:11002:18642_21468
TTTTTTTTAGTTTGGATTATAAGAATTTTGTTGATTGAAGAAAATAGGTTAAATTAAATTTTAATTTTTTTTTTAGATAATTTTTATTTTTTTTTATTTT
+
CCCFFFFFFDFFHHFGHHGGIJ?FGHJJJ?GH?BGHBGGDHIJIJII@HGHJIIHIIEIIJIJEHHHHHDDDDDB?AADCDDEDDBDDEEDDDDDACDEE

Result

HS3:420:C3EHMACXX:8:1101:11002:18642_21468	0	3	391275	0	64M1D22M1D14M	*	0	0	TTTTTTTTAGTTTGGATTATAAGAATTTTGTTGATTGAAGAAAATAGGTTAAATTAAATTTTAATTTTTTTTTTAGATAATTTTTATTTTTTTTTATTTT	CCCFFFFFFDFFHHFGHHGGIJ?FGHJJJ?GH?BGHBGGDHIJIJII@HGHJIIHIIEIIJIJEHHHHHDDDDDB?AADCDDEDDBDDEEDDDDDACDEE	NM:i:23	MD:Z:1C2C2C3C0C18C12C3C9C0C0C3^T3C1C1C5C4C0C2^C3C1C1C0C5	XM:Z:.h..h..x...xz..................z............x...h.........hhh......h.h.h.....h....hh.....h.h.hh.....	XR:Z:CT	XG:Z:CT

View

REF  TTTCTTCTTCAGTCCGGATTATAAGAATTTTGTCGATTGAAGAAAACAGGCTAAATTAAACCCTAATTTTCTCTCTTAGACAATTCCTACTTTCTCTCCATTTTTC
MOD  TTTCTTCTTCAGTCCGGATTATAAGAATTTTGTCGATTGAAGAAAACAGGCTAAATTAAACCCTAATTTCTCTCTTAGACAATTCCTACTTCTCTCCATTTTTC
READ   TTTTTTTTAGTTTGGATTATAAGAATTTTGTTGATTGAAGAAAATAGGTTAAATTAAATTTTAATTTTTTTTTTAGATAATTTTTATTTTTTTTTATTTT
       .h..h..x...xz..................z............x...h.........hhh......h.h.h.....h....hh.....h.h.hh.....
                                                                                             *

Example 4

With this one, I would have expected it to have been a u.

Read

@HS3:420:C3EHMACXX:8:1101:10457:4989_3714
ATATAATTTTTTTGTAAGCTATTTATTATTTTTTTGTATTATTTTATTTTATTTTTTTGTTATATTTTTTTTTTTGGTATTTTTTTTTTTTTTATATTTT
+
CCCFFFFFHHHHHGGIJIHIHHIJJJJJJJJJJJJHHIIJJJJJJJJJJJJJJJJJHFFEEEEEEEFEDDDDDDDD<A@CDEEDDDDDDD&5<8:3>BDD

Result

HS3:420:C3EHMACXX:8:1101:10457:4989_3714	16	1	18445446	0	9M1D16M1I74M	*	0	0	AAAATATAAAAAAAAAAAAAATACCAAAAAAAAAAATATAACAAAAAAATAAAATAAAATAATACAAAAAAATAATAAATAGCTTACAAAAAAATTATAT	DDB>3:8<5&DDDDDDDEEDC@A<DDDDDDDDEFEEEEEEEFFHJJJJJJJJJJJJJJJJJIIHHJJJJJJJJJJJJIHHIHIJIGGHHHHHFFFFFCCC	NM:i:25	MD:Z:1G1G1G3^C7G1G0G7G6G1G12G6G11G0G2G2G0G2G0A7G1G0G0G4G1	XM:Z:.h.h.h..........h.hh........h......h.h............h......h...........hh..h..hh..h........h.hhh....h.	XR:Z:CT	XG:Z:GA

View

REF  ATAGAGTGTAACAAAAAAAGAGGATACCAAGAAAAAAGTGTAACAAAAAAATGAAATAAGATAATACAAAAGGATGATGGATGACTTACAAGAGGGTTATGTTTG
MOD  ATAGAGTGTAAAAAAAAAGAGGATACCANAGAAAAAAGTGTAACAAAAAAATGAAATAAGATAATACAAAAGGATGATGGATGACTTACAAGAGGGTTATGTTTG
READ   AAAATATAAAAAAAAAAAAAATACCAAAAAAAAAAATATAACAAAAAAATAAAATAAAATAATACAAAAAAATAATAAATAGCTTACAAAAAAATTATAT
       .h.h.h..........h.hh........h......h.h............h......h...........hh..h..hh..h........h.hhh....h.
                                   *

FelixKrueger · 2017-10-12T10:24:21Z

I'll take a look soon, was already feeling a bit bored anyway...

FelixKrueger · 2017-10-12T13:43:15Z

Re Example 3:

I think technically the result is correct, because the MD:Z field says that the C at the position you marked with * is deleted:

64M1D22M1D14M CIGAR

MD:Z:1C2C2C3C0C18C12C3C9C0C0C3^T3C1C1C5C4C0C2 ^C 3C1C1C0C5 MD:Z:

So because the C is deleted, there won't be a methylation call for this positions. I would however agree that it is a difficult decision to say that the C is deleted when the genomic sequence is CTTT and the read sequence is TTTT, because in mapping terms it will simply look like TTTT anyway. I reckon if one can't be exactly sure then it is probably best to completely ignore the position completely (which is what happens). Will look at Example 4 next.

martinjvickers · 2017-10-12T14:08:15Z

Are you sure it's a C that's deleted? Following it through from the REF to the Modified REF it's a T that is deleted from the reference to make this part of it work.

RAW  CTACTTTCTC
MOD  CTACTT CTC
READ TTATTT TTT
     h..... h.h
        *  D

Where the D is the base that's deleted and the * is the C/T in question.

EDIT (this is giving me a headache) In this example, going by the CIGAR string, it's not a C that is being deleted it's actually the first T that's been removed from the reference;

RAW  CTACTTTCTC
MOD  CTAC TTCTC
READ TTAT TTTTT
     h... ..h.h
        *D

So that makes sense, but by not marking it as a CHH because of a base being deleted goes against what we said in example 1 where we said that in the following example the final h should be a CHH rather than a CHG because the modified reference would show a CTT because of the deleted G from the original reference.

RAW  CCTGTAAG
MOD  CCT TAAG
READ TTT TAAG
     hh. ....

FelixKrueger · 2017-10-12T15:34:07Z

Deconstructed, the read looks like this:

REF    TCTTCTTCAGTCCGGATTATAAGAATTTTGTCGATTGAAGAAAACAGGCTAAATTAAACCCTAATTTTCTCTCTTAGACAATTCCTACTTTCTCTCCATTTT
MOD    TCTTCTTCAGTCCGGATTATAAGAATTTTGTCGATTGAAGAAAACAGGCTAAATTAAACCCTAA TTTCTCTCTTAGACAATTCCTA TTTCTCTCCATTTT
READ   TTTTTTTTAGTTTGGATTATAAGAATTTTGTTGATTGAAGAAAATAGGTTAAATTAAATTTTAA TTTTTTTTTTAGATAATTTTTA TTTTTTTTTATTTT
                                                                       D                      D
       MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMM
       .h..h..x...xz..................z............x...h.........hhh... ...h.h.h.....h....hh.. ...h.h.hh.....

                                                                                              *

64M1D22M1D14M CIGAR
MD:Z:1C2C2C3C0C18C12C3C9C0C0C3^T3C1C1C5C4C0C2 ^C 3C1C1C0C5 MD:Z:

The first deleted position (^T) is irrelevant as there wouldn't have been a methylation call anyway. The second position (^C) does not get a methylation call as the C is deleted from the read (whether this is actually true or not is open for debate, see below). The two cytosines before the * are in CCT and CTA context, and are thus correctly classified as hh.

Now the last portion of the read and C>T converted genome look like this:

CTTTCTCTCCATTTT true genome 
TTTTTTTTTTATTTT C>T genome
 TTTTTTTTTATTTT read

This shows that during the alignment there is basically a stretch or 10 Ts in the genome, but only 9 Ts in the read. It is arguably extremely difficult, if not impossible, to correctly say which of those 10 potential bases is really deleted, and I am sure that in such a case it is a design decision in Bowtie 2 that it will arbitrarily pick the first one of this stretch of homo-polymers.

I hope it becomes clearer in this example?

FelixKrueger · 2017-10-13T08:20:01Z

And here is example 4:

REF  ATAGAGTGTAACAAAAAAAGAGGATACC AAGAAAAAAGTGTAACAAAAAAATGAAATAAGATAATACAAAAGGATGATGGATGACTTACAAGAGGGTTATGTTTG
MOD  ATAGAGTGTAA AAAAAAAGAGGATACCXAAGAAAAAAGTGTAACAAAAAAATGAAATAAGATAATACAAAAGGATGATGGATGACTTACAAGAGGGTTATGTTTG
READ   AAAATATAA AAAAAAAAAAAATACCAAAAAAAAAAATATAACAAAAAAATAAAATAAAATAATACAAAAAAATAATAAATAGCTTACAAAAAAATTATAT
       MMMMMMMMM MMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
                D                I 
       .h.h.h... .......h.hh........h......h.h............h......h...........hh..h..hh..h........h.hhh....h.
                                    *

9M 1D 16M 1I 74M
MD:Z:1G1G1G3^C7G1G0G7G6G1G12G6G11G0G2G2G0G2G0A7G1G0G0G4G1
XR:Z:CT XG:Z:GA

To me it appears that the C on the reverse strand is in context CTT and this is called correctly as h. The X would be the third downstream base in this example. Does this make sense?

martinjvickers · 2017-10-13T09:53:25Z

Morning Felix,

You're right, I was miscounting the location of those more complex CIGAR modifications. Thanks for taking the time to check through them.

Cheers,

Martin

FelixKrueger · 2017-10-13T09:55:46Z

You're welcome. If there are no further issues I will start preparing the new release later today. Cheers, Felix

martinjvickers · 2017-10-13T10:06:54Z

That's great, I don't think there are any more issues. Cheers, Martin

martinjvickers closed this as completed Oct 11, 2017

martinjvickers reopened this Oct 12, 2017

martinjvickers closed this as completed Oct 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XM tag when dealing with CIGAR insertions/deletions #135

XM tag when dealing with CIGAR insertions/deletions #135

martinjvickers commented Oct 9, 2017

FelixKrueger commented Oct 9, 2017

martinjvickers commented Oct 9, 2017

FelixKrueger commented Oct 10, 2017

FelixKrueger commented Oct 10, 2017

martinjvickers commented Oct 11, 2017

FelixKrueger commented Oct 11, 2017

martinjvickers commented Oct 11, 2017

FelixKrueger commented Oct 11, 2017

martinjvickers commented Oct 12, 2017 •

edited

Loading

FelixKrueger commented Oct 12, 2017

FelixKrueger commented Oct 12, 2017

martinjvickers commented Oct 12, 2017 •

edited

Loading

FelixKrueger commented Oct 12, 2017

FelixKrueger commented Oct 13, 2017

martinjvickers commented Oct 13, 2017

FelixKrueger commented Oct 13, 2017

martinjvickers commented Oct 13, 2017

XM tag when dealing with CIGAR insertions/deletions #135

XM tag when dealing with CIGAR insertions/deletions #135

Comments

martinjvickers commented Oct 9, 2017

Example 1

Example 2

Thoughts?

FelixKrueger commented Oct 9, 2017

martinjvickers commented Oct 9, 2017

FelixKrueger commented Oct 10, 2017

FelixKrueger commented Oct 10, 2017

martinjvickers commented Oct 11, 2017

FelixKrueger commented Oct 11, 2017

martinjvickers commented Oct 11, 2017

FelixKrueger commented Oct 11, 2017

martinjvickers commented Oct 12, 2017 • edited Loading

Example 3

Read

Result

View

Example 4

Read

Result

View

FelixKrueger commented Oct 12, 2017

FelixKrueger commented Oct 12, 2017

martinjvickers commented Oct 12, 2017 • edited Loading

FelixKrueger commented Oct 12, 2017

FelixKrueger commented Oct 13, 2017

martinjvickers commented Oct 13, 2017

FelixKrueger commented Oct 13, 2017

martinjvickers commented Oct 13, 2017

martinjvickers commented Oct 12, 2017 •

edited

Loading

martinjvickers commented Oct 12, 2017 •

edited

Loading