Bioinformatics curriculum

This document is an incomplete list of suitable topics to study to learn the basics of DNA sequencing-based bioinformatics.

Learn to work on the Linux command line

As most bioinformatics tools are written to run in a Linux environment it is important to learn how to work on the command line. In addition, accessing high-performance computer resources is also normally done via a terminal interface. There is a lot to learn, but after learning the 20 or so most used commands you can start to be productive on the command line!

Basic file handling and navigation: ls, cd, mkdir, cp, mv, rm, cat, less/more, chmod, etc...
At least one terminal editor: nano, emacs, vim
Learn to use terminal multiplexers: screen or preferably tmux
Advanced tools:
- Pipes, output redirection (<, >, stdout, and stderr)
- Shell scripts
- Regular expressions in grep, sed, awk...
- Non-standard power tools, e.g. GNU parallel, Ebay's tsv-utils, etc.
Learn to work remotely over SSH
Connect to remote computers
- Transfer files to/from remote computer using command line interface, e.g. scp, sftp, rsync, lftp
- Work on shared cluster systems with job submission systems (e.g. Slurm on UPPMAX)

ExplainShell does an amazing job at explaining the different components of a command line. Try it out!

Linux books:

The Linux Command Line
Bash Pocket Reference
Linux, MacOs, Windows and more command line reference - Nothing extra

Programming

Knowing basic programming is essential for a bioinformatician. Programming is often used to handle input and output files, pre-process data files, create plots, create workflows that run several different tools in a specified order. A well-constructed bioinformatics data analysis is reproducible, meaning that any one can run the same analysis on a different computer using the same input files to produce the same output results. This is challenging in practice, but it is important to consider all scripts that are written in the course of a bioinformatics analysis project as the "log book" or "lab book" of how the analysis was actually performed. And while it can be rewarding to do a quick analysis of some output files at the command line, you should make it a habit to always include everything you do to the data in a script that you can come back to in the future when you have forgotten exactly what you did.

Version control systems

There are several revision control systems that one can use to maintain a versioned history of for example program code. The most popular version control system in widespread use today is Git. Three common places to publish code are Github, Gitlab, and Bitbucket. They all work pretty much the same.

There are some very good guides and tutorials listed below that will introduce you to the vocabulary and concepts concerning version control. Version control rocks(!) and is a crucial tool in a bioinformatician's tool belt, so take the opportunity to learn it as soon as possible. When you start writing code, you will eventually encounter a situation where you want to make changes to the code, but without losing the older version (that you know worked). Version control makes it possible to go back in time to older versions of the code, without having to mess with copies of files called my_code_version-20181015_final_final2.py. It will make your life so much easier and you will enjoy it!

Tutorials

Here are some nice introductions to version control:

git - the simple guide
Github resources for learning git
Blog post: A visual guide to version control
Blog post: Source control for scientists and soloists
Online tutorial: Atlassian Git tutorials and introduction to workflows

Academic accounts on GitHub

GitHub has some nice resources for research/education. Check out their education portal! You can also get a free researcher account that enables unlimited free private repositories.

Python

Python is the most common (and in my opinion most easy to learn) programming language. It is typically available on all Linux systems. Some resources for learning about Python in general:

https://automatetheboringstuff.com/
http://rosalind.info (specifically "Python Village", but then later also "Bioinformatics Stronghold")

There was a big debate a couple of years ago about which version of Python to learn. That discussion is no longer valid: you should learn Python 3 (start by installing the latest available version (3.6+)). There are several ways to download and install Python, but I recommend learning to use conda. There is a conda getting started guide that is OK, after you are familiar with the command line.

Unfortunately, there are no de facto standard integrated development environments (IDEs) for Python like there is for R (i.e. RStudio, see more below). The most common alternatives are probably Microsofts' Visual Studio Code and JetBrain's PyCharm, both are great and cross-platform. VS Code is free for everyone, and a free community edition (without professional support) is available for PyCharm. An other important Python programming tool you should learn is Jupyter. It is a tool to work with interactive programming notebooks where you can combine blocks of Markdown formatted notes with individually executable code blocks (with inline plots!). It is actually not specific to Python: it started out as a notebook format for the languages Julia, Python, and R (JuPyteR), but now runs more than a hundred different language kernels. It is often used in bioinformatics analyses and is getting more and more common nowadays as a way to share how analyses and plots were made for scientific papers.

R

R is by far the most commonly used language/environment for any type of data analysis that requires statistics. A bioinformatician has to be familiar with R. There is a very good Integrated Development Environment (IDE) available for R: RStudio. Ensure you become familiar with R, RStudio, and R Markdown (kind of like Jupyter notebooks, but focused on R).

Databases

There are several database systems, but the most common are some kind of relational database system (often called SQL databases). There are others, especially NoSQL-databases, that are gaining popularity (MongoDB is a NoSQL database that is seeing some use in bioinformatics applications). A bioinformatician can definitely benefit from learning the basics of SQL and a NoSQL system.

A word on coding style

Using a consistent coding style is important to ensure code readability (you are going to read your code much more than you write it). Python has a style document called PEP8 (Python Enhancement Proposal number 8), which is a great starting point for a standard Python coding style. Every Python programmer should read and try their best to follow PEP8 to make it easy for other Python programmers to read and understand your code. In addition, have a look at The Zen of Python (i.e. PEP20).

There are style guides for R and SQL as well. A decent style guide for R is explained in the R for Data Science book (see link below).

Workflows

Workflow managers are tools that help you write reliable and easy-to-use bioinformatics workflows. They make it easy to run several different programs after each other, or sometimes in parallel. This is an advanced bioinformatics topic that will be most useful after you have learnt the basics of programming (in either Python or R) and started using established bioinformatics tools to process your data.

Nextflow, also checkout nf-core
Snakemake

General data analysis

General techniques

Ordination: PCA/PCoA, NMDS, t-SNE, OPLS-DA etc.
Classification: LDA, Decision trees, Random Forests, SVM, ANN, ROC curves, supervised/un-supervised, etc.
Clustering: Hierarchical, k-means, etc.
GUSTAME is a very useful field guide to multivariate statistics

Statistics

Simple hypothesis tests (T-tests, etc.)
Multiple testing correction: Bonferroni, FDR
ANOVA
Regression
Differential abundance testing
Published article: ANCOM
Official tutorial: DESeq2 (for RNA-seq, but applicable to metagenomics as well)
Official user's guide: edgeR
Build your own differential abundance tool
TileStats Videos

Python

Book: Python for Data Anlysis
Book: Pthon Data Science Handbook
Blog post: Greg Reda's Intro to pandas data structures
YouTube channel: Kevin Markham's pandas video series
Package docs: Pandas' 10 minutes to pandas
Package docs: Seaborn
Package docs: Jupyter

R

(Online) book: R for Data Science (If you're only going to read one book about R; this is the one!)
(Online) book: Orchestrating Microbiome Analysis
Online course: Coursera R-programming
Online course: Data Science in a Box

Databases

Online course: [Stanfords Mini-Courses, Practical Relational Databases and SQL] (https://lagunita.stanford.edu/courses/DB/2014/SelfPaced/about)
SQL
- Python documentation: SQLite3
- Online tutorial: PostgreSQL
NoSQL
- Official website: MongoDB

Bioinformatics stuff

Overviewsi/course materials for 16s and Shotgun

General sequencing stuff

Common file formats:
- FASTA / FASTQ
- SAM/BAM (see Samtools
- (Newick)
Quality assessment
- FastQC
Adapter trimming and quality filtering
- fastp, the only quality trimming and filtering tool you will need, most likely.
- BBDuk (part of the BBMap suite of tools, see link below)
- Older, but still commonly used tools: Trimmomatic, FASTX-toolkit, TrimGalore! I recommend you avoid these when writing new analyses.
Mapping/aligning reads (maybe also something general on sequence alignment)
- Bowtie2
- BWA
- BBMap
- BLAST
- USEARCH
- VSEARCH
- MiniMap2
- HMMER (also read about Profile Hidden Markov Models in this fantastic book (PDF): Durbin et al. 1998, Biological Sequence Analysis )
Assembling reads
- MEGAHIT
- SPAdes
Phylogenetic trees

16S read processing and OTU picking

QIIME2
DADA2
The Unoise suite
For functional predictions (use with caution): PICRUSt or Tax4Fun

16S taxonomic annotation

RDP
SILVA

General 16S tools

QIIME2 (again)
Mothur
Microbiome analyst
The Huttenhower Galaxy server
Take a look at Luisa's review

Shotgun metagenomics

Taxonomic profiling
- Using marker genes: MetaPhlAn2 and mOTU
- Using whole genome references: Kraken, Kaiju, etc.
Functional profiling
- HUMANn2
- SUPERFOCUS
- Mapping to gene database (e.g. IGC)
Metagenome assembly
- MEGAHIT
Binning
- Binning tools: CONCOCT, MaxBin, etc.
- Check quality of bins: CheckM
Metagenome-assembled genomes, MAGs

Online resources for bioinformatics questions

seqanswers.com
biostars.com
SciLifeLab Slack
wiki for metagenomics-related terminology

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bioinformatics curriculum

Learn to work on the Linux command line

Linux books:

Programming

Version control systems

Tutorials

Academic accounts on GitHub

Python

R

Databases

A word on coding style

Workflows

General data analysis

General techniques

Statistics

Python

R

Databases

Bioinformatics stuff

Overviewsi/course materials for 16s and Shotgun

General sequencing stuff

16S read processing and OTU picking

16S taxonomic annotation

General 16S tools

Shotgun metagenomics

Online resources for bioinformatics questions

About

Releases

Packages

Contributors 3

License

ctmrbio/bioinformatics_curriculum

Folders and files

Latest commit

History

Repository files navigation

Bioinformatics curriculum

Learn to work on the Linux command line

Linux books:

Programming

Version control systems

Tutorials

Academic accounts on GitHub

Python

R

Databases

A word on coding style

Workflows

General data analysis

General techniques

Statistics

Python

R

Databases

Bioinformatics stuff

Overviewsi/course materials for 16s and Shotgun

General sequencing stuff

16S read processing and OTU picking

16S taxonomic annotation

General 16S tools

Shotgun metagenomics

Online resources for bioinformatics questions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages