This page compiles a list of links to tutorials which have been written by numerous authors for many of the steps involved in whole genome sequence (WGS) analysis of prokaryotic organisms. Some of these steps contain concepts and ideas that are generally applicable to whole genome sequencing of other organisms (e.g. read QC) although in many cases the recommended software would be different. It should be noted that the first step for any aspiring bioinformatician of any level is to build up familiarity with the Linux command line. This will provide access to powerful and flexible tools for and applications.
The links and tutorials listed below were not written, and are not owned, by the author of this page unless explicitly noted. We take no responsibility for their maintenance or accuracy.
- Linux command line
- Programming
- Python
- Perl
- R
- Core Concepts in WGS
- Whole Genome Sequencing (WGS)
- Library Preparation
- Sequencing Technology
- Coverage
- Sequencing Reads
- Short Reads
- Long Reads
- Read QC
- Mapping and Variant Calling
- Assembly
- Assembly QC
- Annotation
- Phylogenomics
- Pangenomics
- K-mer and related
- Databases
- NCBI
- ENA
- BIGSdb
- Enterobase
- Servers
- EDGE
Familiarity with the Linux command-line is usually the first step for budding informaticians. Many tools are only designed or distributed for Linux-based systems. In addition to this many powerful operations, such as iterating through batches of files, can dramatically reduce and simplify workflows.
- Introduction to the command-line (swcarpentry) - this tutorial covers a description of the command line, file operations and some loops and more advanced operations.
- Bash for Genomics – using bash for genomics data tutorial.
Picking up a programming language allows for an informatician to be more flexible in how they approach analysis workflows. Scripts can be used to automate many complex tasks in a more bespoke way than loops on the command-line. There are some excellent tutorials online for many languages. Python is considered the most powerful and popular language for bioinformatics. Perl comes in a (debatably) close second. R is often used to perform advanced statistical analyses and to produce publication worthy figures.
- Official Perl tutorial page - includes a free book on perl programming
- Official python tutorial page - multiple tutorials for all levels.
- R for begginers – basic introduction to R and statistical analysis.
- ggplot2 tutorial – an incredibly flexible and powerful family of packages for creating figures using the grammar of graphics.
- Short read sequencing library perpetration concepts (BitesizeBio)
- Overview of past and current WGS sequencing technologies
- Illumina Sequencing (Video)
- Nanopore (Video)
- PacBio (Video)
Sequence coverage or depth (depth of coverage) is the number of times a base in the target genome is covered by a read e.g. 30x coverage would mean that, on average, each base in your sample will be coverage by 30 reads.
- Introduction to paired-end reads – slide intro to paired/mate pair reads.
- Intro to long reads and long read technologies (Slides, Torsten Seeman)
- Fastqc – an introduction to fastqc, a tool for assessing multiple read quality metrics.
- Trimmomatic manual - a tools for trimming reads and removing adapter sequences.
- snippy - a tool for mapping (BWA) and variant calling.