[New]: HISAT2 and SLAM-mode; [Retired]: Bowtie 1
For the upcoming version Bismark has undergone some substantial changes, which sometimes affect more than one module within the Bismark suite. Here is a short description of the major changes:
[Retired]: Bowtie 1 support
Bowtie (1)
support, and all of its options, has been completely dropped frombismark_genome_preparation
andbismark
. This decision was not made lightly, but it seems no one is using the original Bowtie short read aligner anymore, even short reads have moved on...- Consequently, the option
--vanilla
and its handling has been removed from a number of modules (bismark_genome_preparation
,bismark
,bismark_methylation_extractor
anddeduplicate_bismark
). Too bad, I liked that name...
[Added]: HISAT2 support
-
Instead, the DNA and RNA aligner HISAT2 has been added as a new choice of aligner. The reason for this is not necessarily that RNA methylation is now a thing, but certain alignment modes (see below) do require splice-aware mapping if we don't want to miss out on a whole class of (spliced) alignments. Bowtie 2 is the default mode, HISAT2 alignments can be enabled with the option
--hisat2
-
Similar to the Bowtie2 mode, alignments with HISAT2 are restricted to global (end-to-end) alignments, i.e. soft-clipping is disabled. Furthermore, in paired-end mode, the options
--no-mixed
and--no-discordant
are permanently enabled, meaning that only properly aligned read pairs are put out. -
As the
--hisat2
mode supports spliced alignments, the newCIGAR
operationN
is now supported in all Bismark modules (this includesbismark_genome_preparation
,bismark
,bismark_methylation_extractor
,deduplicate_bismark
and some others).
At the time of writing this, the --hisat2
mode appears to be working as expected. It should be mentioned however that we have not done a lot of testing of these new files, so comments and feedback are welcome.
SLAM-seq mode
We also added a new, experimental and completely different type of alignment for SLAM-seq type data (option --slam
). This fairly recent method to interrogate newly synthesized messenger RNA is akin to bisulfite conversion, in that newly synthesized RNA may contain T to C conversions following an alkylation reaction (original publication and https://www.nature.com/articles/nmeth.4435). The new Bismark alignment mode --slam
performs T>C conversions of both the genome (in the genome preparation step) and the subsequent alignment steps (Bismark alignment step). Currently, the rest of the processing of SLAM-seq data hijacks the standard methylation pipeline:
- T>C conversions are written out as
methylation events
in CpG context, while T-T matches are scored asunmethylated events
in CpG context. Other cytosine contexts are not being used.
So in a nut-shell: methylation calls in --slam
mode are either Ts (unmethylated calls = matches at T positions), or T to C mismatches (methylated calls = C mismatches at T positions).
It should be noted that this is currently an experimental workflow. One might argue that T/C conversion aware (or T/C mis-mapping agnostic) mapping is currently not necessary for SLAM-seq, NASC-Seq, or scSLAM-seq data as the labeling reaction is very inefficient (1 in only 50 to 200 newly incorporated Ts is a 4sU, which may get alkylated). This might be true - for now. If and when the conversion reaction improves over time, C/T agnostic mapping, similar to bisulfite-Seq data, might very well become necessary.
Here is a screenshot of a comparison of aligning the same data (SLAM-seq-like) with Bismark in Bowtie 2 mode (top track) and HISAT2 mode (middle track). Alignments with HISAT2 recover a lot more alignments to short exons, as well as exon-exon spanning reads (evidenced in bottom track):
- Added documentation for NOMe-seq or scNMT-seq processing.
bismark
-
Dropped support for Bowtie
-
Removed all traces of
--vanilla
-
Added support for HISAT2 with option
--hisat2
. -
Added HISAT2 option
--no-spliced-aligments
to disable spliced alignments altogether -
Added HISAT2 option
--known-splicesite-infile <path>
to provide a list of known splice sites. -
Added option
--slam
to allow T/C mismatch agnostic mapping (3-letter alignment). More here. -
Added a new option
--icpc
to truncate read IDs at the first space (or tab) it encounters in the (FastQ) read ID, which are sometimes used to add comments to a FastQ entry (instead of replacing them with underscores which is the default behaviour).
bismark_genome_preparation
-
Dropped support for Bowtie
-
Added support for HISAT2 with option
--hisat2
. -
Added option
--slam
. Instead of performing an in-silico bisulfite conversion, this mode transforms T to C (forward strand), or A to G (reverse strand). The folder structure and rest of the indexing process is currently exactly the same as for bisulfite sequences, but this might change at some point. This means that a genome prepared in--slam
mode is currently indistinguishable from a true Bisulfite Genome (until the alignments are in) so please make sure you name the genome folder appropriately to avoid confusion.
deduplicate_bismark
-
Removed all traces of
--vanilla
-
--bam
mode is now the default. Uncompressed SAM output may still be obtained using the new option--sam
-
Added new option
-o/--outfile <basename>
. This basename is then modified to remove file endings such as.bam
,.sam
,.txt
or.gz
, and.deduplicated.bam
, or.multiple.deduplicated.bam
in--multiple
mode, is then appended for consistency reasons.
- Added support for new CIGAR operation
N
bismark_methylation_extractor
-
Added support for new CIGAR operation
N
for all extraction modes -
Removed all traces of
--vanilla
bismark2summary/bismark2report
- Adapted to work with Bismark HISAT2 reports instead of Bowtie 1 reports.
bam2nuc
- Reads containing spliced reads are now also skipped when determining the genomic base composition (as are reads with InDels).