Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Add input-fmt-option to show number of match bases? #1436

Closed
y9c opened this issue May 15, 2022 · 2 comments · Fixed by #1441
Closed

[FEATURE REQUEST] Add input-fmt-option to show number of match bases? #1436

y9c opened this issue May 15, 2022 · 2 comments · Fixed by #1441
Assignees

Comments

@y9c
Copy link

y9c commented May 15, 2022

The document is a little confusion for qlen and rlen.

  • qlen: no. query bases
  • rlen: no. reference bases

It took me some time to realize that qlen includes match bases and soft-clipped bases, while rlen includes the intron skip. So, there is no parameters recording the number of matched bases, exclude the soft-clipped and intron-skip region. It is possible to have another parameter for "number of matched bases", say mlen?

Thanks!

@jkbonfield
Copy link
Contributor

It doesn't sound too hard to add an extra parameter which is based on the CIGAR seq "match" count, but that would be the number of bases with a known alignment coordinate and not strictly matching as it's not possible to tell SNP from REF match using CIGAR alone. Note this would also treat "N" (ref skip) and "D" (deletion) equivalent too, as neither consume sequence bases.

Is that sufficient?

However note that's actually how it already works. It just calls the htslib bam_cigar2qlen and bam_cigar2rlen which is already doing everything you ask for except for also including soft-clips. So potentially it may be better resolved by a function to return the number of soft-clipped bases. That would then allow e.g. qlen - sclen.

Would that be preferable to a new term such as mlen which could be ambiguous: see cigar = vs cigar M opcode.

@y9c
Copy link
Author

y9c commented May 24, 2022

Thank you @jkbonfield for the suggestion. If a sclen function is available, that can also solve this isssue.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue May 27, 2022
This is the length of soft-clips, both left and right end.
It may be combined with qlen (qlen-sclen) to obtain the number of
bases in the query sequence that have been aligned to the genome.  Ie
it provides a way to compare local-alignment vs global-alignment
length.

Fixes samtools#1436
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants