Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bam2msa small softclipping fix #56

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

TheGreatJack
Copy link

Adding softclip and hardclip hadling to output correct reads with those kind of alignments

I was testing the "bam2msa" script, and it was really helpful, specially for the insertion ignoring functionality. But i found some problems regarding softclipped alignments as the program doesn't skip the necessary bases before writing to the sequence record. If you look at the third row of this msa that i made using BioExt v0.21.9 you will see that it doesn't fit quite well. In particular that read had this CIGAR string "22S56M63S" :

Before_zoom_01

I simply modified the CIGAR matching regex so that it also matches "S" and "H" for softclippled or hardclippled bases. The downstream code logic seems to be able to handle this modification without major problems. After said modification the same row in the previous alignment now fits as expected:

After_zoom_01

I also found good fit in other reads that also had softclipping. I understand that these heavily softclipped alignments may be a signal for chimeric reads or alignment errors. But that is something that can be filtered out before using bam2msa(See samclip). A warning to the user may be good. But in case someone wants to include softclipped reads or uses the program without filtering, this would be less of a headache.

I briefly checked for the effects on the function "gapful" and "clip" from misc. And the code logic seems robust to this change. I'm not too familiar with testing. I just directly tested "bam2msa"

Adding softclip and hardclip hadling to output correct reads with these kind of alignments
Hard clipping removed. Hard clipping doesn't include the "clipped" sequence for a bam record. So no need to skip over the query sequence. Adding "H" it in the match would clip things that it shouldn't in hardclipped alignments.
@TheGreatJack
Copy link
Author

Had to commit a second time because matching "H" would cause unnecessary skipping over the query sequence. This is because hardclipped sections of a query sequence arent included in the alignment record.. Silly mistake on my part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant