Filter SAM file for soft and hard clipped alignments
Most short read aligners perform
local alignment
of reads to the reference genome.
Examples includes bwa mem
, minimap2
, and bowtie2
(unless in --end-to-end
mode).
This means the ends of the read may not be part of the best alignment.
This can be caused by:
- adapter sequences (aren't in the reference)
- poor quality bases (mismatches only make the alignment score worse)
- structural variation in your sample compared to the reference
- reads overlapping the start and end of contigs (including circular genomes)
Read aligners output a SAM file.
Column 6 in this format stores the
CIGAR string.
which describes which parts of the read aligned and which didn't.
The unaligned ends of the read can be "soft" or "hard" clipped,
denoted with S
and H
at each end of the CIGAR string.
It is possible for both types to be present, but that is not common.
Soft and hard don't mean anything biologically, they just refer
to whether the full read sequence is in the SAM file or not.
Some examples of a 100bp aligned read:
100M
- the read is fully aligned30S70M
- soft clipped 30 bp on left end, followed by 70 aligned bp.95M5H
- 95 aligned bp, followed by 5 bp unaligned (hard clipped).25S40M35S
- only the middle 40bp of the read aligned
One may wish to remove these alignments to avoid downstream problems.
In particular, samclip
was designed to remove clipped alignments
to improve variant calling, by removing suspicious local aligments
causing false positives near structural variation. However, it does
keep them if they hit the ends of contigs, which is particularly
important given the lower coverage often observed at those locations.
samclip
has no dependencies except Perl 5.10
or higher. It only uses core modules, so no CPAN needed.
% cd /usr/local/bin # choose a folder in your $PATH
% wget https://raw.githubusercontent.com/tseemann/samclip/master/samclip
% chmod +x samclip
% brew install brewsci/bio/samclip
% conda install -c bioconda -c conda-forge samclip
% git clone https://github.com/tseemann/samclip.git
% cp samclip/samclip /usr/local/bin # choose a folder in your $PATH
% ./samclip --version
samclip 0.2
% ./samclip --help
SYNOPSIS
Filter SAM file for soft & hard clipped alignments
AUTHOR
Torsten Seemann (@torstenseemann)
USAGE
% samclip --ref ref.fa < in.sam > out.sam
% minimap2 ref.fa R1.fq R2.fq | samclip --ref ref.fa | samtools sort > out.bam
OPTIONS
--help This help
--version Print version and exit
--ref FASTA Reference genome - needs FASTA.fai index
--max NUM Maximum clip length to allow (default=5)
--invert Output rejected SAM lines and ignore good ones
--debug Print verbose debug info to stderr
--progress N Print progress every NUM records (default=100000,none=0)
HOMEPAGE
https://github.com/tseemann/samclip
% samclip --ref ref.fa < in.sam > out.sam
% samclip --ref ref.fa in1.sam in2.sam > out.sam
# the -h is to emsure the SAM header is included
% samtools view -h in.bam | samclip --ref ref.fa | samtools sort > out.bam
% bwa mem ref.fa R1.fq R2.fq | samclip --ref ref.fa | samtools sort > out.bam
--ref FILE
should be a FASTA file indexed withsamtools faidx FILE
--max INTEGER
is the maximum soft+hard clipping to allow, use 0 to reject all bar contig edges--invert
will output the records that would have clipped and discard the good ones--progress N
will print a progress message every N records it processes, use 0 to disable--debug
is verbose debugging information for testing purposes
Submit feedback to the Issue Tracker