Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amalgamate multiple CIGAR ops into single entry. #1607

Merged
merged 3 commits into from
May 23, 2023

Conversation

whitwham
Copy link
Contributor

@whitwham whitwham commented Apr 24, 2023

Multiple matching (or sequence (mis)matching)) ops (e.g. 10M40M) give a different VCF using BAQ than a single operation of the same length (e.g. 50M). This change compresses the multiple operations into one.

In combination with samtools/bcftools#1914 it should fix #1597.

Multiple matching (or sequence (mis)matching)) ops (e.g. 10M40M) give a different VCF using BAQ than a single operation of the same length (e.g. 50M).  This change compresses the multiple operations into one.
Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small corrections, and a suggestion which I can take or leave either way.

It could also do with test data demonstrating the problem, but I can't reproduce it! I'll take a look at the original issue and see if I can get something smaller to test on.

realn.c Outdated
for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) {
int op = cigar[k]&0xf, l = cigar[k]>>4;

// concatinate alignment matches (including sequence (mis)matches)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo of concatenate.

realn.c Outdated
// concatinate alignment matches (including sequence (mis)matches)
// otherwise 50M50M gives a different result to 100M
if (op == BAM_CMATCH || op == BAM_CEQUAL || op == BAM_CDIFF) {
if (((k + 1) < c->n_cigar)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it matters, but curious why the extra set of parentheses are here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember, I think I had something else in there originally.

Comment on lines +280 to +290
int next_op = bam_cigar_op(cigar[k + 1]);

if (next_op == BAM_CMATCH || next_op == BAM_CEQUAL || next_op == BAM_CDIFF) {
len += l;
continue;
}
}

// last of M/X/= ops
l += len;
len = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this construction a bit complex to follow.

It does work, but adding to an additional len variable which is 0 normally or >0 for previous M/=/X, and then using len in the last step when switching away from M/=/X, feels a bit clumsy.

Something like this feels easier for me to understand personally. Thoughts?

@@ -269,7 +269,21 @@ int sam_prob_realn(bam1_t *b, const char *ref, hts_pos_t ref_len, int flag) {
             uint8_t *left = tseq;
             uint8_t *rght = tref;
             for (k = 0, x = c->pos, y = 0; k < c->n_cigar; ++k) {
-                int op = cigar[k]&0xf, l = cigar[k]>>4;
+                int op = bam_cigar_op(cigar[k]), l = bam_cigar_oplen(cigar[k]);
+
+                if (op == BAM_CMATCH || op == BAM_CEQUAL || op == BAM_CDIFF) {
+                    // concatenate alignment matches including M/=/X,
+                    // otherwise 50M50M gives a different result to 100M
+                    while (k + 1 < c->n_cigar &&
+                           (bam_cigar_op(cigar[k+1]) == BAM_CMATCH ||
+                            bam_cigar_op(cigar[k+1]) == BAM_CEQUAL ||
+                            bam_cigar_op(cigar[k+1]) == BAM_CDIFF)) {
+                        l += bam_cigar_oplen(cigar[++k]);
+                    }
+                }
+
+                //fprintf(stderr, "Cigar %c %d\n", BAM_CIGAR_STR[op], l);
+
                 if (l == 0) continue;
                 if (op == BAM_CMATCH || op == BAM_CEQUAL || op == BAM_CDIFF) {
                     // Sanity check running off the end of the sequence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, a while loop inside for loop changing the for loop's controlling variable doesn't look great to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough - different views. I'm happy as it is then (bar obvious typo).

@whitwham
Copy link
Contributor Author

whitwham commented May 2, 2023

Some small corrections, and a suggestion which I can take or leave either way.

It could also do with test data demonstrating the problem, but I can't reproduce it! I'll take a look at the original issue and see if I can get something smaller to test on.

To replicate the original error you also need the version of bcftools without the fix. This change keeps the ID=16 numbers the same, the bcftools one has the alternative bases.

Also, the test data the provided by the users was the smallest they could produce to show the error.

@jkbonfield
Copy link
Contributor

I've managed to produce something really tiny now, starting with the original test data provided. I think that's just simulated data, but I don't know the license on it so will try and figure out the cause of it going wrong to produce our own.

The easiest way I found of reproducing it also, and what we'd need here, is via test/test_realn. We already have test files, so we can just add one more with two entries using =/X vs M.

@jkbonfield
Copy link
Contributor

jkbonfield commented May 2, 2023

Here is some test data.

test/realn03.fa

>MX
CGTCTACTACG

test/realn03.sam

@HD	VN:1.6	SO:coordinate
@SQ	SN:MX	LN:11
M	64	MX	1	60	11M	*	0	0	CGTCTCCTACG	IIIIIIIIIII
X	64	MX	1	60	5=1X5=	*	0	0	CGTCTCCTACG	IIIIIIIIIII

Then run:

./test/test_realn -e -f test/realn03.fa -i test/realn03.sam

You should see the same BQ tag in both cases, but don't with develop. There are already test files for test_realn, so this can just be another 1 to add and put into test.pl

I've verified it's only affecting extended mode, which tallies with the code changes.

whitwham added 2 commits May 10, 2023 11:42
Also minor changes to realn.c.
@jkbonfield jkbonfield merged commit 878cff4 into samtools:develop May 23, 2023
vasudeva8 pushed a commit to vasudeva8/htslib that referenced this pull request May 30, 2023
Amalgamate multiple CIGAR ops into single entry.

Multiple matching (or sequence (mis)matching)) ops (e.g. 10M40M) give a different VCF using BAQ than a single operation of the same length (e.g. 50M).  This change compresses the multiple operations into one.
@whitwham whitwham deleted the xme_error branch September 29, 2023 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bcftools mpileup behaves differently for cigar M vs =/X
2 participants