-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amalgamate multiple CIGAR ops into single entry. #1607
Conversation
Multiple matching (or sequence (mis)matching)) ops (e.g. 10M40M) give a different VCF using BAQ than a single operation of the same length (e.g. 50M). This change compresses the multiple operations into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small corrections, and a suggestion which I can take or leave either way.
It could also do with test data demonstrating the problem, but I can't reproduce it! I'll take a look at the original issue and see if I can get something smaller to test on.
realn.c
Outdated
for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) { | ||
int op = cigar[k]&0xf, l = cigar[k]>>4; | ||
|
||
// concatinate alignment matches (including sequence (mis)matches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typo of concatenate.
realn.c
Outdated
// concatinate alignment matches (including sequence (mis)matches) | ||
// otherwise 50M50M gives a different result to 100M | ||
if (op == BAM_CMATCH || op == BAM_CEQUAL || op == BAM_CDIFF) { | ||
if (((k + 1) < c->n_cigar)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that it matters, but curious why the extra set of parentheses are here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember, I think I had something else in there originally.
int next_op = bam_cigar_op(cigar[k + 1]); | ||
|
||
if (next_op == BAM_CMATCH || next_op == BAM_CEQUAL || next_op == BAM_CDIFF) { | ||
len += l; | ||
continue; | ||
} | ||
} | ||
|
||
// last of M/X/= ops | ||
l += len; | ||
len = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this construction a bit complex to follow.
It does work, but adding to an additional len variable which is 0 normally or >0 for previous M/=/X, and then using len in the last step when switching away from M/=/X, feels a bit clumsy.
Something like this feels easier for me to understand personally. Thoughts?
@@ -269,7 +269,21 @@ int sam_prob_realn(bam1_t *b, const char *ref, hts_pos_t ref_len, int flag) {
uint8_t *left = tseq;
uint8_t *rght = tref;
for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) {
- int op = cigar[k]&0xf, l = cigar[k]>>4;
+ int op = bam_cigar_op(cigar[k]), l = bam_cigar_oplen(cigar[k]);
+
+ if (op == BAM_CMATCH || op == BAM_CEQUAL || op == BAM_CDIFF) {
+ // concatenate alignment matches including M/=/X,
+ // otherwise 50M50M gives a different result to 100M
+ while (k + 1 < c->n_cigar &&
+ (bam_cigar_op(cigar[k+1]) == BAM_CMATCH ||
+ bam_cigar_op(cigar[k+1]) == BAM_CEQUAL ||
+ bam_cigar_op(cigar[k+1]) == BAM_CDIFF)) {
+ l += bam_cigar_oplen(cigar[++k]);
+ }
+ }
+
+ //fprintf(stderr, "Cigar %c %d\n", BAM_CIGAR_STR[op], l);
+
if (l == 0) continue;
if (op == BAM_CMATCH || op == BAM_CEQUAL || op == BAM_CDIFF) {
// Sanity check running off the end of the sequence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, a while loop inside for loop changing the for loop's controlling variable doesn't look great to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough - different views. I'm happy as it is then (bar obvious typo).
To replicate the original error you also need the version of bcftools without the fix. This change keeps the ID=16 numbers the same, the bcftools one has the alternative bases. Also, the test data the provided by the users was the smallest they could produce to show the error. |
I've managed to produce something really tiny now, starting with the original test data provided. I think that's just simulated data, but I don't know the license on it so will try and figure out the cause of it going wrong to produce our own. The easiest way I found of reproducing it also, and what we'd need here, is via test/test_realn. We already have test files, so we can just add one more with two entries using =/X vs M. |
Here is some test data. test/realn03.fa
test/realn03.sam
Then run:
You should see the same BQ tag in both cases, but don't with develop. There are already test files for test_realn, so this can just be another 1 to add and put into test.pl I've verified it's only affecting extended mode, which tallies with the code changes. |
Also minor changes to realn.c.
Amalgamate multiple CIGAR ops into single entry. Multiple matching (or sequence (mis)matching)) ops (e.g. 10M40M) give a different VCF using BAQ than a single operation of the same length (e.g. 50M). This change compresses the multiple operations into one.
Multiple matching (or sequence (mis)matching)) ops (e.g. 10M40M) give a different VCF using BAQ than a single operation of the same length (e.g. 50M). This change compresses the multiple operations into one.
In combination with samtools/bcftools#1914 it should fix #1597.