MIRFLOWZ is a Snakemake workflow for mapping miRNAs and isomiRs.
The workflow lives inside this repository and will be available for you to run after following the installation instructions laid out in this section.
Traverse to the desired path on your file system, then clone the repository and change into it with:
git clone https://github.com/zavolanlab/mirflowz.git
cd mirflowz
For improved reproducibility and reusability of the workflow, as well as an easy means to run it on a high performance computing (HPC) cluster managed, e.g., by Slurm, all steps of the workflow run inside isolated environments (Singularity containers or Conda environments). As a consequence, running this workflow has only a few individual dependencies. These are managed by the package manager Conda, which needs to be installed on your system before proceeding.
If you do not already have Conda installed globally on your system,
we recommend that you install Miniconda. For faster
creation of the environment (and Conda environments in general), you can also
install Mamba on top of Conda. In that case, replace conda
with
mamba
in the commands below (particularly in conda env create
).
Create and activate the environment with necessary dependencies with Conda:
conda env create -f environment.yml
conda activate mirflowz
If you plan to run MIRFLOWZ via Conda, we recommend using the following command for a faster environment creation, specially if you will run it on an HPC cluster.
conda config --set channel_priority strict
If you plan to run MIRFLOWZ via Singularity and do not already
have it installed globally on your system, you must further update the Conda
environment using the environment.root.yml
with the command below.
Mind that you must have the environment activated to update it.
conda env update -f environment.root.yml
Note that you will need to have root permissions on your system to be able to install Singularity. If you want to run MIRFLOWZ on an HPC cluster (recommended in almost all cases), ask your systems administrator about Singularity.
If you would like to contribute to MIRFLOWZ development, you may find it useful to further update your environment with the development dependencies:
conda env update -f environment.dev.yml
Several tests are provided to check the integrity of the installation. Follow the instructions in this section to make sure the workflow is ready to use.
Execute one of the following commands to run the test workflow on your local machine:
- Test workflow on local machine with Singularity:
bash test/test_workflow_local_with_singularity.sh
- Test workflow on local machine with Conda:
bash test/test_workflow_local_with_conda.sh
Execute one of the following commands to run the test workflow on a slurm-managed high-performance computing (HPC) cluster:
- Test workflow with Singularity:
bash test/test_workflow_slurm_with_singularity.sh
- Test workflow with Conda:
bash test/test_workflow_slurm_with_conda.sh
Execute the following command to generate a rule graph image for the workflow.
The output will be found in the images/
directory in the repository root.
bash test/test_rule_graph.sh
You can see the rule graph below in the workflow description section.
After successfully running the tests above, you can run the following command to remove all artifacts generated by the test runs:
bash test/test_cleanup.sh
Now that your virtual environment is set up and the workflow is deployed and tested, you can go ahead and run the workflow on your samples.
It is suggested to have all the input files for a given run (or hard links pointing to them) inside a dedicated directory, for instance under the MIRFLOWZ root directory. This way, it is easier to keep the data together, set up Singularity access to them and reproduce analyses.
Refer to test/test_files/sample_table.tsv
to know what this file
must look like, or use it as a template.
touch path/to/your/sample/table.tsv
Fill the sample table according to the following requirements:
sample
. Arbitrary name for the miRNA sequencing library.sample_file
. Path to the miRNA sequencing library file. The path must be relative to the directory where the workflow will be run.adapter
. Sequence of the 3'-end adapter used during library preparation.format
. One offa
/fasta
orfq
/fastq
, if the library file is in FASTA or FASTQ format, respectively.
There are 4 files you must provide:
-
A
gzip
ped FASTA file containing reference sequences, typically the genome of the source/organism from which the library was extracted. -
A
gzip
ped GTF file with matching gene annotations for the reference sequences above.
MIRFLOWZ expects both the reference sequence and gene annotation files to follow Ensembl style/formatting. If you obtained these files from a source other than Ensembl, you must ensure that they adhere to the expected format by converting them, if necessary.
- An uncompressed GFF3 file with microRNA annotations for the reference sequences above.
MIRFLOWZ expects the miRNA annotations to follow miRBase style/formatting. If you obtained this file from a source other than miRBase, you must ensure that it adheres to the expected format by converting it, if necessary.
-
An uncompressed tab-separated file with a mapping between the reference names used in the miRNA annotation file (column 1; "UCSC style") and in the gene annotations and reference sequence files (column 2; "Ensembl style"). Values in column 1 are expected to be unique, no header is expected, and any additional columns will be ignored. This resource provides such files for various organisms, and in the expected format.
-
OPTIONAL: A BED6 file with regions for which to produce ASCII-style pileups. If not provided, no pileups are generated. See here for the expected format.
General note: If you want to process the genome resources before use (e.g., filtering), you can do that, but make sure the formats of any modified resource files meet the formatting expectations outlined above!
We recommend creating a copy of the configuration file template:
cp config/config_template.yaml path/to/config.yaml
Open the new copy in your editor of choice and adjust the configuration parameters to your liking. The template explains what each of the parameters mean and how you can meaningfully adjust them.
With all the required files in place, you can now run the workflow locally via Singularity with the following command:
snakemake \
--snakefile="path/to/Snakefile" \
--cores 4 \
--configfile="path/to/config.yaml" \
--use-singularity \
--singularity-args "--bind ${PWD}/../" \
--printshellcmds \
--rerun-incomplete \
--verbose
NOTE: Depending on your working directory, you do not need to use the parameters
--snakefile
and--configfile
. For instance, if theSnakefile
is in the same directory or theworkflow/
directory is beneath the current working directory, there's no need for the--snakefile
directory. Refer to the Snakemake documentation for more information.
After successful execution of the workflow, results and logs will be found in
the results/
and logs/
directories, respectively.
Upon successful execution of MIRFLOWZ, the tool automatically removes all intermediate files generated during the process. The final outputs comprise:
-
A SAM file containing alignments intersecting a pri-miR locus. These alignments intersect with extended start and/or end positions specified in the provided pri-miR annotations. Please note that they may not contribute to the final counting and may not appear in the final table.
-
A SAM file containing alignments intersecting a mature miRNA locus. Similar to the previous file, these alignments intersect with extended start and/or end positions specified in the provided miRNA annotations. They may not contribute to the final counting and might be absent from the final table.
-
A BAM file containing the set of alignments contributing to the final counting and its corresponding index file (
.bam.bai
). -
Table(s) containing the counting data from all libraries for (iso)miRs and/or pri-miRs. Each row corresponds to a miRNA species, and each column represents a sample library. Each read is counted towards all the annotated miRNA species it aligns to, with 1/n, where n is the number of genomic and/or transcriptomic loci that read aligns to.
-
OPTIONAL. ASCII-style pileups of read alignments produced for individual libraries, combinations of libraries and/or all libraries of a given run. The exact number and nature of the outputs depends on the workflow inputs/parameters. See the pileups section for a detailed description.
To retain all intermediate files, include --no-hooks
in the workflow call.
snakemake \
--snakefile="path/to/Snakefile" \
--cores 4 \
--configfile="path/to/config.yaml" \
--use-conda \
--printshellcmds \
--rerun-incomplete \
--no-hooks \
--verbose
After successful execution of the workflow, the intermediate files will be
found in the results/intermediates
directory.
Snakemake provides the option to generate a detailed HTML report on runtime statistics, workflow topology and results. If you want to create a Snakemake report, you must run the following command:
snakemake \
--snakefile="path/to/Snakefile" \
--configfile="path/to/config.yaml" \
--report="snakemake_report.html"
NOTE: The report creation must be done after running the workflow in order to have the runtime statistics and the results.
MIRFLOWZ consists of a main Snakefile
and four functional modules. In the
Snakefile
, the configuration file is validated, and the various modules are
imported. In addition, a handler for both, a successful and a failed run are
set. If the workflow finishes without any errors, all the intermediate files
are removed, otherwise, a log file is created. To keep the intermediate files
upon completion, use the --no-hooks
CLI argument when running the pipeline.
The modules (1) process the genome resources, (2) map and (3) quantify the reads, and (4) generate pileups, as described in detail below.
NOTE: MIRFLOWZ uses the notation provided by miRBase (i.e. "miRNA primary transcript" for precursors and "miRNA" for the canonical mature miRNA). This implies that precursors are named "pri-miRs" across the workflow instead of pre-miR. This decision is made upon the lack of guarantee that "miRNA primary transcripts" are full pre-miR (and pre-miR only) sequences.
The MIRFLOWZ workflow initially processes and indexes the genome resources
provided by the user. The regions corresponding to mature miRNAs are extended
by a fixed but user-adjustable number of nucleotides on both sides to
accommodate isomiR species with shifted start and/or end positions. If
necessary, pri-miR loci are extended to adjust to the new miRNA coordinates.
In addition, to account for the different genomic locations a miRNA sequence
can be annotated, the name of these sequences are modified to have the format
SPECIES-mir-NAME-#
for pri-miRs and SPECIES-miR-NAME-#-ARM
or
SPECIES-miR-NAME-#
for mature miRNAs with both or just one arm respectively,
where #
is the replica number.
The user-provided short-read small RNA-seq libraries undergo quality filtering (skipped if libraries are provided in FASTA rather than FASTQ), followed by adapter removal. The resulting reads are independently mapped to both the genome and the transcriptome using two distinct aligners: Segemehl and our in-house tool Oligomap.
Segemehl implements a fast heuristic strategy that returns the alignment(s) with the smallest edit distance. Oligomap, on the other hand, implements a slower and more restricted approach that reports all the alignments with an edit distance of at most 1. The combination of the fast and flexible results and the strict selection ensures results with a higher fidelity than if only one of the tools was to be used.
Two merging steps are done in order to have all the alignments in a single
file. In the first one, the transcriptome and the genome mappings from both
aligners are fused and only those alignments with a smaller NH than the one
provided are kept. For the second step, transcriptomic coordinates are turned
into genomic ones and alignments are combined into a single file. Duplicate
alignments resulting from the partially redundant mapping strategy are
discarded and only the best alignments for each read are retained (i.e. the
ones with the smallest edit distance). In addition, and due to the alignment's
aggregation, a second filtering according to the new NH is performed.
If a read has been aligned beyond a specified threshold, it is removed due to
(1) performance reasons as the file size can rapidly increase, and (2) the fact
that each read contributes to each count 1/N
where N
is the number of
genomic loci it aligns to and a large N
makes the contribution negligible.
A final filter is made to further increase the classification accuracy and reduce the amount of multimappers. Given that isomiRs are known to present more mismatches than InDels when compared to the canonical sequence they come from, when addressing the multiple genomic locations a read has been mapped to, the alignments with fewer InDels are kept. Note that some multimappers might still be present if the number of InDels and mismatches is the same across alignments.
The filtered alignments are subsequently intersected with the user-provided, pre-processed miRNA annotation files using BEDTools. Each alignment is classified according to the miRNA species it fully intersects with in order to do the counts.
Counts are tabulated separately for reads consistent with either miRNA
precursors, mature miRNA and/or isomiRs, and all library counts are fused
into a single table. Note that an alignment is only counted towards a given
miRNA (or isomiR) species if one of its alignments fully falls within the
(previously extended) locus annotated for that miRNA. Specifically, reads
contribute with 1/N
for each miRNA for which that is the case, where N
is
the total number of genomic loci the read aligns to. Under this criterion, the
precursor counts contain reads that intersect with its mature arm(s), its
hairpin sequence and/or the whole precursor itself.
A sequence is considered to be an isomiR if it has a shift on either end, an InDel or a mismatch on its sequence when compared to the canonical miRNA it maps and intersects with.
MIRFLOWZ employs an unambiguous notation to classify isomiRs using the
format miRNA_name|5p-shift|3p-shift|CIGAR|MD
, where 5p-shift
and
3p-shift
represent the differences between the annotated mature miRNA
start and end positions and those of the read alignment, respectively.
Finally, to visualize the distribution of read alignments around miRNA loci, ASCII-style alignment pileups are optionally generated for user-defined regions of interest.
The schema below is a visual representation of the individual workflow steps and how they are related:
NOTE: For an elaborated description of each rule along with some examples, please, refer to the workflow documentation.
MIRFLOWZ is an open-source project which relies on community contributions. You are welcome to participate by submitting bug reports or feature requests, taking part in discussions, or proposing fixes and other code changes. Please refer to the contributing guidelines if you are interested in contributing.
This project is covered by the MIT License.
For questions or suggestions regarding the code, please use the issue tracker. Do not hesitate to contact us via email for any other inquiries.