Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XM tag when dealing with CIGAR insertions/deletions #135

Closed
martinjvickers opened this issue Oct 9, 2017 · 17 comments
Closed

XM tag when dealing with CIGAR insertions/deletions #135

martinjvickers opened this issue Oct 9, 2017 · 17 comments

Comments

@martinjvickers
Copy link

Hi Felix,

I hope you're well. Sorry, this is a long one, markdown is quite difficult to use to show this kind of stuff.

I think I've found an issue with the way XM tags are being created with regard to the CIGAR operations made to the reference in order to facilitate a match. It appears (but I can't be sure) that the context is being extracted in a way that doesn't take into account these operations.

Example 1

The following non-directional read;

@NS500422:333:H3TJVBGXY:1:11101:10255:2527
CTTAAAACAAACTTTAACTTCCCTTCCTAACAAACCAACAAATTAAAAATATAAAAACAATTCCTTACCAAAAAT
+
AAAAAE6E/EAEE6EE/AEEEAEAEEEEE/AEEEEEEEEEAAEE/EAEEE6E6EE/EEEE6EEAEE/AEEEEEEE

mapped against TAIR10 gives the following mapping result (tags removed for brevity);

NS500422:333:H3TJVBGXY:1:11101:10255:2527     0       2       3282285 71M1D4M ATTTTTGGTAAGGAATTGTTTTTATATTTTTAATTTGTTGGTTTGTTAGGAAGGGAAGTTAAAGTTTGTTTTAAG     XM:Z:.h.hx..........xz..h....h..h.h...h.......h.z....................h...hh.....

CIGAR String = 71M1D4M

ORIG_REF  ACTCCTGGTAAGGAACCGTCTTTACATCTCTAACTTGTTGGCTCGTTAGGAAGGGAAGTTAAAGCTTGCCTGTAAG
MOD_REF   ACTCCTGGTAAGGAACCGTCTTTACATCTCTAACTTGTTGGCTCGTTAGGAAGGGAAGTTAAAGCTTGCCT TAAG
READ      ATTTTTGGTAAGGAATTGTTTTTATATTTTTAATTTGTTGGTTTGTTAGGAAGGGAAGTTAAAGTTTGTTT TAAG
XM:Z:     .h.hx..........xz..h....h..h.h...h.......h.z....................h...hh. ....

So in this case, the final unmethylated C in the original reference would be a CTG which is a CHG context but it's being marked as a CHH presumably because in the modified reference the G is removed and it becomes a CTT to become a CHH context.

Example 2

I have another example with an Insertion to the reference to make the read match. The read is;

@NS500422:333:H3TJVBGXY:1:11101:10593:2969
CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEAA6EEEAEEAEEEEEEE

(Again non-directional TAIR10). The mapping result is as follows (this time the opposite strand);

NS500422:333:H3TJVBGXY:1:11101:10593:2969     16      4       3165898 4M1I69M CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA      XM:Z:......h....Z...X.h............h.hh..h.......hhh..............h.........h.h

CIGAR String = 4M1I69M

ORIG_REF   CCGCAGCCTCGACTGAGACCCTTAAATAAGAGGTTGATTCATAGGGATACAAAAATAAAAGAAATAAATAGAGA
MOD_REF    CCGC AGCCTCGACTGAGACCCTTAAATAAGAGGTTGATTCATAGGGATACAAAAATAAAAGAAATAAATAGAGA
READ       CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA
XM:Z:      ......h....Z...X.h............h.hh..h.......hhh..............h.........h.h

In this instance the first unmethylated CHH encountered spans that insertion. The A/G is identified and working backwards since it's the opposite strand we get a CAG where G is the unmethylated C. So reverse complement this to make it easier to discuss, we get CTG which would make this a CHG context, however it's being marked as a CHH.

Thoughts?

In the first example it is as though the context is being calculated from the raw or modified reference, but in the second example it is as though something else is happening. Maybe there is a +/- 1 thing happening where it's looking at AGC to calculate the context.

When looking at a particular cytosine in a read I would expect that the context to be calculated from the original reference since that base in the read is supposed to represent that particular base in the reference. Is that correct and the intended behaviour?

Hopefully this makes sense.

@FelixKrueger
Copy link
Owner

Hi Martin,

We have generally tried to settle for the following mode of operation:

  • For deletions we do indeed apply the same deletion to the reference genome and use the resulting sequence to determine the sequence context. So in your Example 1 the CTG would become CTT which is a CHH context. If you look in real data this seems to really happen and be the right thing to do. It is not very obvious if there is a change from CHH to CHG (at least not in mammalian systems, it might already be much more obvious in TAIR10), but when you change a CpG to CHH and vice versa you are almost guaranteed to see a state change from unmethylated to methylated if we wouldn't take the deletions into account (assuming ~80% overall methylation in CpG context and ~0% methylation in non-CG context).

  • For insertions it is more complicated to determine the context, and I am not sure there is aright way to do it. One can obviously not use the genomic sequence because the insertion as such might change the context, but we also don't really trust the actual base calls in the read so much that we want to use the read to make the context call. So for the actual methylation call we insert Ns into the reference for the context call, like so:

original reference: CAGC   CAATTA
Read:               CAGCATACAATTA
Modified Reference: CAGCNNNCAATTA

The C at position 4 should have been called as Unknown context (U) because we can't actually say with certainty what the context was.

In your Example 2 I would in fact expect the context to change to U (instead of h), which is strange.... Would you mind sending me that one entry of the BAM file and ideally also as FastQ entry so I can investigate?

@martinjvickers
Copy link
Author

Hi Felix,

Thanks for your swift response. For the second example, here is the read;

@NS500422:333:H3TJVBGXY:1:11101:10593:2969
CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEAA6EEEAEEAEEEEEEE

and here is the resulting BAM entry;

NS500422:333:H3TJVBGXY:1:11101:10593:2969	16	4	3165898	0	4M1I69M	*	0	0	CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEAA6EEEAEEAEEEEEEE	NM:i:14	MD:Z:2G2G10G12G1G0G2G7G0G0G14G9G1G0	XM:Z:......h....Z...X.h............h.hh..h.......hhh..............h.........h.h	XR:Z:GA	XG:Z:GA

I think if I'd have seen everything behaving in the same way, e.g. as described in example 1, then I wouldn't have really queried it. I've seen other similar examples too but it takes a long time to extract all the information to dig into what should have happened so I've stuck with these two for now :-)

FYI this is bismark VN:v0.18.2 with bowtie2 version 2.3.2

@FelixKrueger
Copy link
Owner

Thanks for that. I have now run this read and established that we are in fact padding insertions with X and not with N. (it's obviously quite a while since I have written this...).
Here is the read alignment:

  CCGC-AGCCTCGACTGAGACCCTTAAATAAGAGGTTGATTCATAGGGATACAAAAATAAAAGAAATAAATAGAG genomic sequence
  CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA Read sequence
AACCGCXAGCCTCGACTGAGACCCTTAAATAAGAGGTTGATTCATAGGGATACAAAAATAAAAGAAATAAATAGAG extracted corresponding genomic sequence
  ......h....Z...X.h............h.hh..h.......hhh..............h.........h.h methylation call

This explains why the context changes from CHG to CHH, but I no longer sure this is what we actually want... Maybe I should add conditionals to the methylation call procedure to assign Unknown context if there was either an N or X in the up- or downstream sequence, would you agree?

@FelixKrueger
Copy link
Owner

I have now changed the changed the methylation call behaviour so that both N and X will result in Unknown context ab2d010. Here is the new output:

CCCCGAACCTCGACTGAAACCCTTAAATAAAAAATTAATTCATAAAAATACAAAAATAAAAAAAATAAATAAAA
......u....Z...X.h............h.hh..h.......hhh..............h.........h.h 

@martinjvickers
Copy link
Author

Hi Felix,

Thanks for your swift response to this. This looks like the more sensible option and probably what you originally intended when you wrote it. I will have a play with it over the next day or so to see how it looks before closing the issue.

@FelixKrueger
Copy link
Owner

Great, thanks.

@martinjvickers
Copy link
Author

This looks exactly how I was expecting and consistent throughout. Thanks for that.

@FelixKrueger
Copy link
Owner

Thanks for the feedback and for spotting it in the first place. I'll try to include it in a new release soon.

@martinjvickers
Copy link
Author

martinjvickers commented Oct 12, 2017

Hi,

Sorry, I've found that it's still doing something odd but with directional this time. I don't know if this is because with this directional library the CIGAR strings are more complex or are the functions are different between directional and non-directional within bismark?

Example 3

I've highlighted the base with a * in the view below. Based upon what we said about example 1 from yesterday, shouldn't this base have been counted?

Read
@HS3:420:C3EHMACXX:8:1101:11002:18642_21468
TTTTTTTTAGTTTGGATTATAAGAATTTTGTTGATTGAAGAAAATAGGTTAAATTAAATTTTAATTTTTTTTTTAGATAATTTTTATTTTTTTTTATTTT
+
CCCFFFFFFDFFHHFGHHGGIJ?FGHJJJ?GH?BGHBGGDHIJIJII@HGHJIIHIIEIIJIJEHHHHHDDDDDB?AADCDDEDDBDDEEDDDDDACDEE
Result
HS3:420:C3EHMACXX:8:1101:11002:18642_21468	0	3	391275	0	64M1D22M1D14M	*	0	0	TTTTTTTTAGTTTGGATTATAAGAATTTTGTTGATTGAAGAAAATAGGTTAAATTAAATTTTAATTTTTTTTTTAGATAATTTTTATTTTTTTTTATTTT	CCCFFFFFFDFFHHFGHHGGIJ?FGHJJJ?GH?BGHBGGDHIJIJII@HGHJIIHIIEIIJIJEHHHHHDDDDDB?AADCDDEDDBDDEEDDDDDACDEE	NM:i:23	MD:Z:1C2C2C3C0C18C12C3C9C0C0C3^T3C1C1C5C4C0C2^C3C1C1C0C5	XM:Z:.h..h..x...xz..................z............x...h.........hhh......h.h.h.....h....hh.....h.h.hh.....	XR:Z:CT	XG:Z:CT
View
REF  TTTCTTCTTCAGTCCGGATTATAAGAATTTTGTCGATTGAAGAAAACAGGCTAAATTAAACCCTAATTTTCTCTCTTAGACAATTCCTACTTTCTCTCCATTTTTC
MOD  TTTCTTCTTCAGTCCGGATTATAAGAATTTTGTCGATTGAAGAAAACAGGCTAAATTAAACCCTAATTTCTCTCTTAGACAATTCCTACTTCTCTCCATTTTTC
READ   TTTTTTTTAGTTTGGATTATAAGAATTTTGTTGATTGAAGAAAATAGGTTAAATTAAATTTTAATTTTTTTTTTAGATAATTTTTATTTTTTTTTATTTT
       .h..h..x...xz..................z............x...h.........hhh......h.h.h.....h....hh.....h.h.hh.....
                                                                                             *

Example 4

With this one, I would have expected it to have been a u.

Read
@HS3:420:C3EHMACXX:8:1101:10457:4989_3714
ATATAATTTTTTTGTAAGCTATTTATTATTTTTTTGTATTATTTTATTTTATTTTTTTGTTATATTTTTTTTTTTGGTATTTTTTTTTTTTTTATATTTT
+
CCCFFFFFHHHHHGGIJIHIHHIJJJJJJJJJJJJHHIIJJJJJJJJJJJJJJJJJHFFEEEEEEEFEDDDDDDDD<A@CDEEDDDDDDD&5<8:3>BDD
Result
HS3:420:C3EHMACXX:8:1101:10457:4989_3714	16	1	18445446	0	9M1D16M1I74M	*	0	0	AAAATATAAAAAAAAAAAAAATACCAAAAAAAAAAATATAACAAAAAAATAAAATAAAATAATACAAAAAAATAATAAATAGCTTACAAAAAAATTATAT	DDB>3:8<5&DDDDDDDEEDC@A<DDDDDDDDEFEEEEEEEFFHJJJJJJJJJJJJJJJJJIIHHJJJJJJJJJJJJIHHIHIJIGGHHHHHFFFFFCCC	NM:i:25	MD:Z:1G1G1G3^C7G1G0G7G6G1G12G6G11G0G2G2G0G2G0A7G1G0G0G4G1	XM:Z:.h.h.h..........h.hh........h......h.h............h......h...........hh..h..hh..h........h.hhh....h.	XR:Z:CT	XG:Z:GA

View
REF  ATAGAGTGTAACAAAAAAAGAGGATACCAAGAAAAAAGTGTAACAAAAAAATGAAATAAGATAATACAAAAGGATGATGGATGACTTACAAGAGGGTTATGTTTG
MOD  ATAGAGTGTAAAAAAAAAGAGGATACCANAGAAAAAAGTGTAACAAAAAAATGAAATAAGATAATACAAAAGGATGATGGATGACTTACAAGAGGGTTATGTTTG
READ   AAAATATAAAAAAAAAAAAAATACCAAAAAAAAAAATATAACAAAAAAATAAAATAAAATAATACAAAAAAATAATAAATAGCTTACAAAAAAATTATAT
       .h.h.h..........h.hh........h......h.h............h......h...........hh..h..hh..h........h.hhh....h.
                                   *

@FelixKrueger
Copy link
Owner

I'll take a look soon, was already feeling a bit bored anyway...

@FelixKrueger
Copy link
Owner

Re Example 3:

I think technically the result is correct, because the MD:Z field says that the C at the position you marked with * is deleted:

64M1D22M1D14M CIGAR

MD:Z:1C2C2C3C0C18C12C3C9C0C0C3^T3C1C1C5C4C0C2 ^C 3C1C1C0C5 MD:Z:

So because the C is deleted, there won't be a methylation call for this positions. I would however agree that it is a difficult decision to say that the C is deleted when the genomic sequence is CTTT and the read sequence is TTTT, because in mapping terms it will simply look like TTTT anyway. I reckon if one can't be exactly sure then it is probably best to completely ignore the position completely (which is what happens). Will look at Example 4 next.

@martinjvickers
Copy link
Author

martinjvickers commented Oct 12, 2017

Are you sure it's a C that's deleted? Following it through from the REF to the Modified REF it's a T that is deleted from the reference to make this part of it work.

RAW  CTACTTTCTC
MOD  CTACTT CTC
READ TTATTT TTT
     h..... h.h
        *  D

Where the D is the base that's deleted and the * is the C/T in question.

EDIT (this is giving me a headache) In this example, going by the CIGAR string, it's not a C that is being deleted it's actually the first T that's been removed from the reference;

RAW  CTACTTTCTC
MOD  CTAC TTCTC
READ TTAT TTTTT
     h... ..h.h
        *D

So that makes sense, but by not marking it as a CHH because of a base being deleted goes against what we said in example 1 where we said that in the following example the final h should be a CHH rather than a CHG because the modified reference would show a CTT because of the deleted G from the original reference.

RAW  CCTGTAAG
MOD  CCT TAAG
READ TTT TAAG
     hh. ....

@FelixKrueger
Copy link
Owner

Deconstructed, the read looks like this:

REF    TCTTCTTCAGTCCGGATTATAAGAATTTTGTCGATTGAAGAAAACAGGCTAAATTAAACCCTAATTTTCTCTCTTAGACAATTCCTACTTTCTCTCCATTTT
MOD    TCTTCTTCAGTCCGGATTATAAGAATTTTGTCGATTGAAGAAAACAGGCTAAATTAAACCCTAA TTTCTCTCTTAGACAATTCCTA TTTCTCTCCATTTT
READ   TTTTTTTTAGTTTGGATTATAAGAATTTTGTTGATTGAAGAAAATAGGTTAAATTAAATTTTAA TTTTTTTTTTAGATAATTTTTA TTTTTTTTTATTTT
                                                                       D                      D
       MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMM
       .h..h..x...xz..................z............x...h.........hhh... ...h.h.h.....h....hh.. ...h.h.hh.....

                                                                                              *

64M1D22M1D14M CIGAR
MD:Z:1C2C2C3C0C18C12C3C9C0C0C3^T3C1C1C5C4C0C2 ^C 3C1C1C0C5 MD:Z:

The first deleted position (^T) is irrelevant as there wouldn't have been a methylation call anyway. The second position (^C) does not get a methylation call as the C is deleted from the read (whether this is actually true or not is open for debate, see below). The two cytosines before the * are in CCT and CTA context, and are thus correctly classified as hh.

Now the last portion of the read and C>T converted genome look like this:

CTTTCTCTCCATTTT true genome 
TTTTTTTTTTATTTT C>T genome
 TTTTTTTTTATTTT read

This shows that during the alignment there is basically a stretch or 10 Ts in the genome, but only 9 Ts in the read. It is arguably extremely difficult, if not impossible, to correctly say which of those 10 potential bases is really deleted, and I am sure that in such a case it is a design decision in Bowtie 2 that it will arbitrarily pick the first one of this stretch of homo-polymers.

I hope it becomes clearer in this example?

@FelixKrueger
Copy link
Owner

And here is example 4:

REF  ATAGAGTGTAACAAAAAAAGAGGATACC AAGAAAAAAGTGTAACAAAAAAATGAAATAAGATAATACAAAAGGATGATGGATGACTTACAAGAGGGTTATGTTTG
MOD  ATAGAGTGTAA AAAAAAAGAGGATACCXAAGAAAAAAGTGTAACAAAAAAATGAAATAAGATAATACAAAAGGATGATGGATGACTTACAAGAGGGTTATGTTTG
READ   AAAATATAA AAAAAAAAAAAATACCAAAAAAAAAAATATAACAAAAAAATAAAATAAAATAATACAAAAAAATAATAAATAGCTTACAAAAAAATTATAT
       MMMMMMMMM MMMMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
                D                I 
       .h.h.h... .......h.hh........h......h.h............h......h...........hh..h..hh..h........h.hhh....h.
                                    *

9M 1D 16M 1I 74M
MD:Z:1G1G1G3^C7G1G0G7G6G1G12G6G11G0G2G2G0G2G0A7G1G0G0G4G1
XR:Z:CT XG:Z:GA

To me it appears that the C on the reverse strand is in context CTT and this is called correctly as h. The X would be the third downstream base in this example. Does this make sense?

@martinjvickers
Copy link
Author

Morning Felix,

You're right, I was miscounting the location of those more complex CIGAR modifications. Thanks for taking the time to check through them.

Cheers,

Martin

@FelixKrueger
Copy link
Owner

You're welcome. If there are no further issues I will start preparing the new release later today. Cheers, Felix

@martinjvickers
Copy link
Author

That's great, I don't think there are any more issues. Cheers, Martin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants