NanoSim-H is a simulator of Oxford Nanopore reads that captures the technology-specific features of ONT data, and allows for adjustments upon improvement of Nanopore sequencing technology. NanoSim-H has been derived from NanoSim, a software package developed by Chen Yang at Canada's Michael Smith Genome Sciences Centre. The fork was created from version 1.0.1 and the versions of NanoSim-H and NanoSim are kept synchronized.
NanoSim-H is implemented using Python uses R for model fitting.
In silico reads can be simulated from a given reference genome using nanosim-h
.
The NanoSim-H package is distributed with several precomputed error profiles, but
additional profiles can be computed using the nanosim-h-train
.
The main improvements compared to NanoSim are:
- Support for Python 3
- Support for RNF read names
- Installation from PyPI
- Error profiles distributed with the main package
- Automatic testing using Travis
- Reproducible simulations (setting a seed for PRG)
- Improved interface with new parameters (e.g., for merging all contigs) and a progress bar
- Several minor bugs fixed
Simulation of 100 reads from an E.coli genome.
pip install --upgrade nanosim-h
curl "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=fasta&val=545778205&sendto=on" | \
nanosim-h -n 100 -
From BioConda (recommended):
conda install -y -c bioconda nanosim-h
From PyPI :
pip install --upgrade nanosim-h
From Github:
git clone https://github.com/karel-brinda/nanosim-h
cd nanosim-h
pip install --upgrade .
or
git clone https://github.com/karel-brinda/nanosim-h
cd nanosim-h
python setup.py install
Dependencies:
For read simulation:
For computing new error profiles:
When installed using Bioconda, all NanoSim-H dependencies get installed automatically. When installed using PIP, all dependencies for read simulation are installed automatically.
Simulation stage takes a reference genome and possibly a read profile as input, and outputs simulated reads in FASTA format.
$ nanosim-h --help usage: nanosim-h [-h] [-p str] [-o str] [-n int] [-u float] [-m float] [-i float] [-d float] [-s int] [--circular] [--perfect] [--merge-contigs] [--rnf] [--rnf-add-cigar] [--max-len int] [--min-len int] [--kmer-bias int] <reference.fa> Program: NanoSim-H - a simulator of Oxford Nanopore reads. Version: 1.1.0.3 Authors: Chen Yang <cheny@bcgsc.ca> - author of the original software package (NanoSim) Karel Brinda <kbrinda@hsph.harvard.edu> - author of the NanoSim-H fork positional arguments: <reference.fa> reference genome (- for standard input) optional arguments: -h, --help show this help message and exit -p str, --profile str error profile - one of precomputed profiles ('yeast', 'ecoli_R9_2D', 'ecoli_R7', 'ecoli_UCSC1b', 'ecoli_R9_1D', 'ecoli_R7.3') or own directory with an error profile [ecoli_R9_2D] -o str, --out-pref str prefix of output file [simulated] -n int, --number int number of generated reads [10000] -u float, --unalign-rate float rate of unaligned reads [detect from the error profile] -m float, --mis-rate float mismatch rate (weight tuning) [1.0] -i float, --ins-rate float insertion rate (weight tuning) [1.0] -d float, --del-rate float deletion rate (weight tuning) [1.0] -s int, --seed int initial seed for the pseudorandom number generator (0 for random) [42] --circular circular simulation (linear otherwise) --perfect output perfect reads, no mutations --merge-contigs merge contigs from the reference --rnf use RNF format for read names --rnf-add-cigar add cigar to RNF names (not fully debugged, yet) --max-len int maximum read length [inf] --min-len int minimum read length [50] --kmer-bias int prohibits homopolymers with length >= n bases in output reads [6] Examples: nanosim-h --circular ecoli_ref.fasta nanosim-h --circular --perfect ecoli_ref.fasta nanosim-h -p yeast --kmer-bias 0 yeast_ref.fasta Notice: the use of `max-len` and `min-len` will affect the read length distributions. If the range between `max-len` and `min-len` is too small, the program will run slowlier accordingly.
Examples:
If you want to simulate reads from E. coli genome, then circular mode should be used because it is a circular genome.
nanosim-h --circular Ecoli_ref.fasta
If you want to simulate only perfect reads, i.e. no SNPs, or indels, just simulate the read length distribution.
nanosimh-h --circular --perfect Ecoli_ref.fasta
If you want to simulate reads from a S. cerevisiae genome with no k-mer bias, then linear mode should be chosen because it is a linear genome.
nanosimh-h -p yeast --kmer-bias 0 yeast_ref.fasta
Output files:
simulated.log
– Log file for simulation process.simulated.fa
– FASTA file of simulated reads. Reads can contain information about how they were created either in RNF, or in the original NanoSim naming convention.RNF naming convention
See the associated RNF paper and RNF specification.
NanoSim naming convention
Each reads has "unaligned", "aligned", or "perfect" in the header determining their error rate. "unaligned" means that the reads have an error rate over 90% and cannot be aligned. "aligned" reads have the same error rate as training reads. "perfect" reads have no errors.
To explain the information in the header, we have two examples:
>ref|NC-001137|-[chromosome=V]_468529_unaligned_0_F_0_3236_0
- All information before the first
_
are chromosome information.468529
is the start position and unaligned suggesting it should be unaligned to the reference. The first0
is the sequence index.F
represents a forward strand.0_3236_0
means that sequence length extracted from the reference is 3236 bases.
>ref|NC-001143|-[chromosome=XI]_115406_aligned_16565_R_92_12710_2
- This is an aligned read coming from chromosome XI at position 115406.
16565
is the sequence index. R represents a reverse complement strand.92_12710_2
means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.
The information in the header can help users to locate the read easily.
simulated.errors.txt
– List of introduced errors.The output contains error type, position, original bases and current bases.
Characterization stage takes a reference and a training read set in FASTA format as input. User can also provide their own alignment file in MAF format.
Profiles distributed with NanoSim-H:
ecoli_R7
ecoli_R7.3
ecoli_R9_1D
ecoli_R9_2D
(default error profile for read simulation)ecoli_UCSC1b
yeast
New error profiles:
A new error profile can be obtained using the nanosim-h-train
command.
$ nanosim-h-train --help usage: nanosim-h-train [-h] [-i str] [-m str] [-b int] [--no-model-fit] <reference.fa> <profile.dir> Program: NanoSim-H-Train - compute an error profile for NanoSim-H. Version: 1.1.0.3 Authors: Chen Yang <cheny@bcgsc.ca> - author of the original software package (NanoSim) Karel Brinda <kbrinda@hsph.harvard.edu> - author of the NanoSim-H fork positional arguments: <reference.fa> reference genome of the training reads <profile.dir> error profile dir optional arguments: -h, --help show this help message and exit -i str, --infile str training ONT real reads, must be fasta files -m str, --maf str user can provide their own alignment file, with maf extension -b int, --num-bins int number of bins (for development) [20] --no-model-fit no model fitting
Files associated with an error profile:
aligned_length_ecdf
– Length distribution of aligned regions on aligned reads.aligned_reads_ecdf
– Length distribution of aligned reads.align_ratio
– Empirical distribution of align ratio of each read.besthit.maf
– The best alignment of each read based on length.match.hist
,mis.hist
,ins.hist
,del.hist
– Histograms of matches, mismatches, insertions, and deletions.first_match.hist
– Histogram of the first match length of each alignment.error_markov_model
– Markov model of error types.ht_ratio
– Empirical distribution of the head region vs total unaligned region.training.maf
– The output of LAST, alignment file in MAF format.match_markov_model
– Markov model of the length of matches (stretches of correct base calls).model_profile
– Fitted model for errors.processed.maf
– A re-formatted MAF file for user-provided alignment file.unaligned_length_ecdf
– Length distribution of unaligned reads