Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adding softclip and hardclip hadling to output correct reads with those kind of alignments
I was testing the "bam2msa" script, and it was really helpful, specially for the insertion ignoring functionality. But i found some problems regarding softclipped alignments as the program doesn't skip the necessary bases before writing to the sequence record. If you look at the third row of this msa that i made using BioExt v0.21.9 you will see that it doesn't fit quite well. In particular that read had this CIGAR string "22S56M63S" :
I simply modified the CIGAR matching regex so that it also matches "S" and "H" for softclippled or hardclippled bases. The downstream code logic seems to be able to handle this modification without major problems. After said modification the same row in the previous alignment now fits as expected:
I also found good fit in other reads that also had softclipping. I understand that these heavily softclipped alignments may be a signal for chimeric reads or alignment errors. But that is something that can be filtered out before using bam2msa(See samclip). A warning to the user may be good. But in case someone wants to include softclipped reads or uses the program without filtering, this would be less of a headache.
I briefly checked for the effects on the function "gapful" and "clip" from misc. And the code logic seems robust to this change. I'm not too familiar with testing. I just directly tested "bam2msa"