Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cigars of secondary alignments are always set to the primary one #157

Closed
1pakch opened this issue Apr 26, 2018 · 7 comments
Closed

Cigars of secondary alignments are always set to the primary one #157

1pakch opened this issue Apr 26, 2018 · 7 comments

Comments

@1pakch
Copy link
Contributor

1pakch commented Apr 26, 2018

I have noticed that the secondary alignments do not have informative cigars:

# Realign a sequence from human chrM from hg19
In [72]: q = a.seq('chrM', 2500, 2600)

# Secondary alignments have mismatches but their cigar strings do not reflect that
In [73]: print('\n'.join(map(str, a.map(q))))
0       100     +       chrM    16571   2500    2600    100     100     54      tp:A:P  ts:A:.  cg:Z:100M
0       100     +       chr6    171115067       62284098        62284198        97      100     0       tp:A:S  ts:A:.  cg:Z:100M
0       100     -       chr3    198022430       96336151        96336251        96      100     0       tp:A:S  ts:A:.  cg:Z:100M

Is it a bug or by design? Is there a way to get informative cigars for all alignments?

@lh3 lh3 added the question label Apr 27, 2018
@lh3
Copy link
Owner

lh3 commented Apr 27, 2018

What do you mean by "informative cigars"? "100M" is the right CIGAR for all three alignments.

@1pakch
Copy link
Contributor Author

1pakch commented Apr 27, 2018

SAM spec v1.5 includes = and X operators which can be used instead of M to tell matches from mismatches.

I see now that you have cs tag for that, but it's not exposed in mappy for instance. Is there a public C API for getting positions of insertions/deletions/overwrites?

The cs tag is reconstructed from CIGAR and some additional data in

minimap2/format.c

Lines 136 to 142 in aef7b07

static void write_cs_core(kstring_t *s, const uint8_t *tseq, const uint8_t *qseq, const mm_reg1_t *r, char *tmp, int no_iden)
{
int i, q_off, t_off;
mm_sprintf_lite(s, "\tcs:Z:");
for (i = q_off = t_off = 0; i < r->p->n_cigar; ++i) {
int j, op = r->p->cigar[i]&0xf, len = r->p->cigar[i]>>4;
assert(op >= 0 && op <= 3);

This additional data (qseq and tseq) is, in turn, generated in

minimap2/format.c

Lines 220 to 234 in aef7b07

qseq = (uint8_t*)kmalloc(km, r->qe - r->qs);
tseq = (uint8_t*)kmalloc(km, r->re - r->rs);
tmp = (char*)kmalloc(km, r->re - r->rs > r->qe - r->qs? r->re - r->rs + 1 : r->qe - r->qs + 1);
mm_idx_getseq(mi, r->rid, r->rs, r->re, tseq);
if (!r->rev) {
for (i = r->qs; i < r->qe; ++i)
qseq[i - r->qs] = seq_nt4_table[(uint8_t)t->seq[i]];
} else {
for (i = r->qs; i < r->qe; ++i) {
uint8_t c = seq_nt4_table[(uint8_t)t->seq[i]];
qseq[r->qe - i - 1] = c >= 4? 4 : 3 - c;
}
}
if (is_MD) write_MD_core(s, tseq, qseq, r, tmp);
else write_cs_core(s, tseq, qseq, r, tmp, no_iden);

I am not quite sure what exactly is being done there since this definition

minimap2/sdust.c

Lines 24 to 26 in aef7b07

unsigned char seq_nt4_table[256] = {
0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,

is a bit cryptic

@1pakch
Copy link
Contributor Author

1pakch commented Apr 27, 2018

It looks like minimap uses 4 bits for cigar codes and only 4 values are actually used, so one could have separate codes for = and X without changing the layout. BTW, the cigar codes are hardcoded which makes it difficult to locate all the places where they are used.

@lh3 lh3 added the duplicate label Apr 27, 2018
@1pakch
Copy link
Contributor Author

1pakch commented Apr 27, 2018

I see there is a PR #156 actually

@lh3
Copy link
Owner

lh3 commented Apr 27, 2018

Then this is a duplicate to #124. I think X/= is the second biggest mistake in the SAM spec (the top one being the bit flag). It is against the principle of alignment, inflates file sizes and has little use in practice as it doesn't tell us what are the deleted and mismatching bases. Typically I wouldn't consider X/= at all. Nonetheless, pacbio required their BAMs to have X/=. I could make an exception for them as a vendor, even though that is a wrong decision. Now #156 is pending. I will merge some form of it in future, but mappy won't support X/=.

I am not quite sure what exactly is being done there since this definition

That is the typical way to encode bases to integers.

Is there a public C API for getting positions of insertions/deletions/overwrites?

Positions of insertions/deletions are encoded in CIGAR. For now, you can retrieve reference sequences with seq() and find mismatches. I will consider to expose "cs" to mappy. Won't be soon.

@1pakch
Copy link
Contributor Author

1pakch commented Apr 27, 2018

Thanks for the explanations.

Exposing cs to mappy is not a priority for me as well, but I would be interested in having access to cs tags when using minimap from C/C++. What is your opinion on changing the current code so that instead of cs tag being rendered into a string one would create an array of c structs first?

This intermediate representation could be used by library users and it would make it easier to expose to Python in meaningful way.

@lh3
Copy link
Owner

lh3 commented May 30, 2018

With #156 effectively merged (with small changes), I am closing this issue. Thanks a lot for participating the discussions, which has helped the decision making.

@lh3 lh3 closed this as completed May 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants