Table of Contents:
- Setting up
- Connnecting to Puhti
- Interactive use of Puhti
- QC and trimming for Nanopore reads
- Metagenomic assembly with metaFlye
- QC and trimming for Illumina reads
- Assembly polishing
- Assembly QC
- Taxonomic composition
- Genome-resolved metagenomics
- Genome taxonomy
- Genome annotation with Bakta
- Viromics
- Visualizing Metaphlan results
- Launch Visual Studio Code
- Only on the first time:
- Open
Extensions
(one of the icons on the left) and installRemote - SSH
- Open
- Down left corner you will have a (green) button with "><" (hoover over it and it says "Open a Remote Window"), click it
- Choose "Connect Current Window to Host..."
- Type in the user@puhti.csc.fi and hit "Enter" (change "user" for your own CSC username)
- Type your password and hit "Enter"
- In the following dialogue, type yes and hit "Enter"
When the down left corner says SSH:puhti.csc.fi
, you're connected.
-
From the menu select
Terminal > New Terminal
and you should see a new panel. This is the command line. -
When you need to logout just type exit in the terminal/command line and hit "Enter"
(or you can click the down left corner and choose "Close Remote Connection")
The main course directory is located in /scratch/project_2001499
.
There you will set up your own directory where you will perform all the tasks for this course.
First list all projects you're affiliated with in CSC.
csc-workspaces
You should see the course project MBDP_Metagenomics_2022
.
So let's create a folder for you inside the scratch folder, you can find the path in the output from the previous command.
cd PATH/TO/COURSE/SCRATCH
mkdir $USER
Check with ls
; which folder did mkdir $USER
create?
This directory (/scratch/project_2001499/your-user-name
) is your own working directory.
Every time you log into Puhti, you should use cd
to navigate to this directory.
Go to your own folder and clone the course repository there.
git clone https://github.com/MBDP-bioinformatics-courses/MBDP_Metagenomics_2022.git
Check what was downloaded and go to that folder. Then again check what is inside. All the scripts are to be run in this folder.
The raw data used on this course can be found in /scratch/project_2001499/Data/
.
Instead of copying the data we will use links to this folder in all of the needed tasks.
Why don't we want 14 students copying data to their own folders?
We have both Nanopore long-read data and Illumina short-read data from one of the sewage treatment plants in the article. Make separate folders for both data types to your own data folder, 01_DATA
.
cd 01_DATA
mkdir Nanopore
mkdir Illumina
Then go first to the Nanopore
folder and make a softlink to the raw long-read data.
After making the softlink, check the content of the folder with ls
.
cd Nanopore
ln -s /scratch/project_2001499/Data/Nanopore/SRR11673980.fastq.gz .
cd ..
Then the same for the short-read data.
Again, after making the softlink, check the content of the folder with ls
.
cd Illumina
ln -s /scratch/project_2001499/Data/Illumina/*.fastq.gz .
cd ..
How many files (links) did you have in the Nanopore folder? What about the Illumina?
Puhti uses a scheduling system called SLURM. Most jobs are sent to the queue, but smaller jobs can be run interactively.
Interactive session is launched with sinteractive
.
You can specify the resources you need for you interactive work interactively with sinteractive -i
. Or you can give them as options to sinteractive
.
You always need to specify the accounting project (-A
, --account
). Otherwise for small jobs you can use the default resources (see below).
Option | Function | Default | Max |
---|---|---|---|
-i, --interactive | Set resources interactively | ||
-t, --time | Reservation in minutes or in format d-hh:mm:ss | 24:00:00 | 7-00:00:00 |
-m, --mem | Memory in Mb | 2000 | 76000 |
-j, --jobname | Job name | interactive | |
-c, --cores | Number of cores | 1 | 8 |
-A, --account | Accounting project | ||
-d, --tmp | $TMPDIR size (in GiB) | 32 | 760 |
-g, --gpu | Number of GPUs | 0 | 0 |
Read more about interactive use of Puhti.
Mini manual for screen:
screen -S NAME
- open a screen and give it a session name NAMEscreen
- open new screen without specifying any namescreen -ls
- list all open sessionsctrl + a + d
- to detach from a session (from inside the screen)screen -r NAME
- re-attach to a detached session using the namescreen -rD
- re-attach to a attached sessionexit
- close the screen and kill all processes running inside the screen (from inside the screen)
The QC for the Nanopore reads can be done with NanoPlot. It is a plotting tools for long read sequencing data and alignments. You can read more about it from here: NanoPlot.
This run will require computing resources, you can use sinteractive
to log in to a computing node:
sinteractive -A project_2001499 -m 50G -c 4
NanoPlot is not pre-installed to Puhti, but has been installed for the course under /scratch/project_2001499/envs/nanoQC
and can be run from there.
Generate graphs for visualization of reads quality and length distribution
/scratch/project_2001499/envs/nanoQC/bin/NanoPlot \
-o 01_DATA/nanoplot_out \
-t 4 \
-f png \
--fastq softlink-to/raw_nanopore_reads.fastq.gz
Transfer to the output from NanoPlot (NanoPlot-report.html
) to your own computer and open it with any browser.
Based on the results from NanoPlot:
- How many reads do we have?
- How much data we have overall?
- How the read length distribution looks like and what is the mean read length?
- How about the read quality?
- Can you see any pattern between read length and read quality?
As in the article, we will use Filtlong and Porechop for quality and adapter trimmming of the nanopore reads.
As we have quite a lot of data, trimming also takes a while. Submitting it as batch job means we can use more resources and the job shoud also run faster.
Make a batch job script using the example and store the batch job script under scripts
.
But remember to check all file paths before you submit the job.
#!/bin/bash -l
#SBATCH --job-name trimming
#SBATCH --output 00_LOGS/trimming-%j.out
#SBATCH --error 00_LOGS/trimming-%j.err
#SBATCH --time 24:00:00
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 16
#SBATCH --mem 80G
#SBATCH --account project_2001499
#SBATCH --gres=nvme:200
/scratch/project_2001499/envs/nanoFiltering/bin/filtlong \
--min_length 4000 \
--min_mean_q 80 \
soflink-to/your_raw_nanopore_reads.fastq.gz |\
gzip > 02_TRIMMED_DATA/SRR11673980_trimmed.fastq.gz
/scratch/project_2001499/envs/nanoFiltering/bin/porechop \
-i 02_TRIMMED_DATA/SRR11673980_trimmed.fastq.gz \
-o 02_TRIMMED_DATA/SRR11673980_chop.fastq.gz \
--threads $SLURM_CPUS_PER_TASK
When you are sure everything is ok, submit the job with sbatch
.
sbatch scripts/YOUR_SCRIPT.sh
After you have submitted the job, the system will print you the job id. Write that down.
You can also check the status of your job with squeue
.
squeue -l -u $USER
And when the job is finished, you can use seff
to see how much resources were actually used.
seff JOBID
After the trimming has finished and everythin looks ok, we can move on.
It is always good idea to check that the trimming step did what it was supposed to do. So we'll the QC step on the trimemd data.
sinteractive -A project_2001499 -m 50G -c 4
/scratch/project_2001499/envs/nanoQC/bin/NanoPlot \
-o 02_TRIMMED_DATA/NANOPLOT \
-t 4 \
-f png \
--fastq 02_TRIMMED_DATA/SRR11673980_chop.fastq.gz
And if it looks good as well, we can move on to the assembly step.
Fo the assembly we will use Flye. Flye is a long-read de novo assembler that handles also metagenomic data.
Before you start the assembly, have a look at the Flye manual, escpecially the parts about Nanopore data and metagenome assembly.
After you have read the manual you can make a batch job script for Flye in the scripts
folder.
Batch job script for assembly with metaFlye:
#!/bin/bash -l
#SBATCH --job-name assembly
#SBATCH --output 00_LOGS/assembly-%j.out
#SBATCH --error 00_LOGS/assembly-%j.err
#SBATCH --time 24:00:00
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 40
#SBATCH --mem 300G
#SBATCH --account project_2001499
#SBATCH --gres=nvme:200
/scratch/project_2001499/envs/Flye/bin/flye \
--nano-raw 02_TRIMMED_DATA/SRR11673980_chop.fastq.gz \
--threads $SLURM_CPUS_PER_TASK \
--meta \
--out-dir 03_ASSEMBLY
Submit the job to the queue.
sbatch scripts/YOUR_SCRIPT_NAME
Using the commands from above, you can check if your job is already running.
QC for the raw data takes few minutes, depending on the allocation.
QC does not require lot of memory and can be run on the interactive nodes using sinteractive
.
Open interactive node and load the biokit module:
sinteractive -A project_2001499 -c 4
module load biokit
Run FastQC with the files stored in the Illumina folder. What does the -o
and -t
flags refer to?
mkdir 01_DATA/FASTQC
fastqc -o 01_DATA/FASTQC 01_DATA/Illumina/*.fastq.gz -t 4
FastQC makes a report for each file and as we have several, we can use MultiQC to combine them into one report.
MultiQC is not included in biokit, so we need to load it separately. It is wise to first unload all other modules with module purge
.
module purge
module load multiqc/1.12
multiqc --interactive -o 01_DATA/ 01_DATA/FASTQC/*
Copy the resulting HTML file to your local machine. Have a look at the QC report with your favourite browser.
After inspecting the output, it should be clear that we need to do some trimming.
What kind of trimming do you think should be done?
We will do the trimming with cutadapt. Cutadapt module has to be loaded separately.
module purge
module load cutadapt/3.5
Have a look at the help pages for cutadapt.
cutadapt -h
The adapter sequences that you want to trim are located after -a
and -A
.
What is the difference with -a
and -A
?
And what is specified with option -p
or -o
?
And how about -m
and -j
?
You can also find the answers from Cutadapt manual.
Cutadapt can handle paired-end reads, but since we have several files, we will make a for loop that trims each sample separately. The for loop will replace the ${file}
in the script with each of the sample IDs specified on the first line.
for file in SRR11674041 SRR11674042 SRR11674043
do
cutadapt \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-o 02_TRIMMED_DATA/${file}_trimmed_R1.fastq.gz \
-p 02_TRIMMED_DATA/${file}_trimmed_R2.fastq.gz \
01_DATA/Illumina/${file}_1.fastq.gz \
01_DATA/Illumina/${file}_2.fastq.gz \
--minimum-length 50 \
--cores 4 \
> 00_LOGS/cutadapt_${file}.log
done
You could now check each of the cutadapt log files and answer:
- How many read pairs we had originally?
- How many reads contained adapters?
- How many read pairs were removed because they were too short?
- How many base calls were quality-trimmed?
- Overall, what is the percentage of base pairs that were kept?
Then make a new folder (FASTQC
) for the QC files of the trimmed data and run fastQC and multiQC again as you did before trimming:
FastQC:
module purge
module load biokit
mkdir 02_TRIMMED_DATA/FASTQC
fastqc -o 02_TRIMMED_DATA/FASTQC 02_TRIMMED_DATA/*_trimmed_R?.fastq.gz -t 4
MultiQC:
module purge
module load multiqc/1.12
multiqc --interactive -o 02_TRIMMED_DATA/ 02_TRIMMED_DATA/FASTQC/*
Copy the resulting HTML file to your local machine as earlier and look how well the trimming went.
Did you find problems with the sequences? We can further proceed to quality control using Prinseq.
To leave the interactive node, type exit
.
It is possible to polish the long-read assembly using either the original reads, as was done in the article, or by using better quality data, such as short-reads, from the same sample.
However, the polishing step is very slow, so we won't run it on the course. We will rely on the polishing step done by metaFlye.
But as an example, here are the steps for polishing our assembly using the trimmed short-read data from the same sample.
# module load bowtie2/2.4.4
# bowtie2-build 03_ASSEMBLY/assembly.fasta \
# 04_POLISH/assembly
# bowtie2 -1 02_TRIMMED_DATA/SRR11674041_trimmed_R1.fastq.gz \
# -2 02_TRIMMED_DATA/SRR11674041_trimmed_R2.fastq.gz \
# -S 04_POLISH/assembly.sam \
# -x 04_POLISH/assembly \
# --threads $SLURM_CPUS_PER_TASK \
# --no-unal
# module purge
# module load samtools/1.6.1
# samtools view -F 4 -bS 04_POLISH/assembly.sam | samtools sort > 04_POLISH/assembly.bam
# samtools index 04_POLISH/assembly.bam
# module purge
# module load biojava/1.8
# java -Xmx128G -jar /scratch/project_2001499/envs/pilon/pilon-1.24.jar \
# --genome 03_ASSEMBLY/assembly.fasta \
# --bam 04_POLISH/assembly.bam \
# --outdir 04_POLISH/ \
# --output pilon \
# --changes
After the metagenomic assembly has finished we will use the metagenomic version of Quality Assessment Tool for Genome Assemblies, Quast for (comparing and) evaluating our assembly.
module purge
module load quast/5.2.0
metaquast.py \
-o 03_ASSEMBLY/QUAST \
03_ASSEMBLY/assembly.fasta \
--max-ref-number 0 \
--fast \
--threads 4
Now you can move the file 03_ASSEMBLY/QUAST/report.tsv
to your computer and look for the quality control files in the web browser of your preference.
Next we will also analyze individual reads in addition to the assembly based approaches. Which files would you use for this?
We will use a tool called Metaphlan4 to analyze these reads. From the paired end Illumina reads only R1 reads are used for the following analyses.
#!/bin/bash -l
#SBATCH --job-name metaphlan
#SBATCH --output 00_LOGS/metaphlan-%j.out
#SBATCH --error 00_LOGS/metaphlan-%j.err
#SBATCH --time 06:00:00
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 8
#SBATCH --mem 50G
#SBATCH --account project_2001499
# load metaphlan4
module load metaphlan/4.0.2
# compare the three fastq-files against metaphlan databases we have downloaded and formatted to the databases-folder
for file in SRR11674041 SRR11674042 SRR11674043
do
metaphlan \
02_TRIMMED_DATA/${file}_trimmed_R1.fastq.gz \
--input_type fastq \
--bowtie2db /scratch/project_2001499/databases/metaphlan/ \
--bowtie2out 07_METAPHLAN/${file}.bowtie2.bz2 \
--nproc $SLURM_CPUS_PER_TASK \
-o 07_METAPHLAN/${file}_metaphlan.txt \
done
Once finished we need to merge the files
module load metaphlan/4.0.2
cd 07_METAPHLAN
merge_metaphlan_tables.py ./*_metaphlan.txt > merged_abundance_table.txt
This table can be opened as a spreadsheet (using a program like Excel), any gene expression analysis program, less (example below), or visualized graphically as per subsequent tutorial sections:
less -S merged_abundance_table.txt
The first few lines look like:
#mpa_vJan21_CHOCOPhlAnSGB_202103
clade_name SRR11674041_metaphlan SRR11674042_metaphlan SRR11674043_metaphlan
k__Bacteria 98.57217 96.94795 97.23816
k__Archaea 1.42783 3.05205 2.76184
k__Bacteria|p__Proteobacteria 47.18289 25.00974 29.88884
k__Bacteria|p__Bacteroidetes 16.60339 18.75608 3.5224
k__Bacteria|p__Actinobacteria 14.8151 33.82423 45.40604
k__Bacteria|p__Firmicutes 7.48232 7.56776 6.52967
k__Bacteria|p__Nitrospirae 6.9101 7.82463 8.53506
k__Bacteria|p__Candidatus_Bipolaricaulota 2.44231 2.11629 2.0278
k__Bacteria|p__Thermotogae 2.01616 1.18616 0.67207
k__Archaea|p__Euryarchaeota 1.42783 3.05205 2.76184
k__Bacteria|p__Synergistetes 0.58856 0.30465 0.41901
k__Bacteria|p__Candidatus_Cloacimonetes 0.29277 0.16749 0.03304
k__Bacteria|p__Cyanobacteria 0.09528 0.09195 0.0
k__Bacteria|p__Candidatus_Saccharibacteria 0.05373 0.06808 0.19078
After this let's make a species level abundance table
grep -E "s__|clade" merged_abundance_table.txt \
| grep -v "t__" \
| sed "s/^.*|//g" \
| sed "s/clade_name/species/g" \
| sed "s/SRS[0-9]*-//g" \
Species level abundance table with full taxonomy.
awk '$1 ~ "clade_name" || $1 ~ "s__" {print $0}' merged_abundance_table.txt | grep -v "|t__" > merged_species_table.txt
We will visualise the metaphlan results using R.
We will use RStudio thru Puhti web interface.
And R packages: vegan, tidyverse, phyloseq and microViz.
Only microViz needs to be installed in Puhti. You can find instructions on how to install your own packages to Puhti from here.
Log in to www.puhti.csc.fi and choose RStudio from the apps. Use the default resources and click Launch
.
Commands for R:
# setup path to installed packages
.libPaths(c("/projappl/project_2001499/project_rpackages_r421", .libPaths()))
libpath <- .libPaths()[1]
# load packages
library(vegan)
library(tidyverse)
library(phyloseq)
library(microViz)
# read in the data, MAKE SURE THE PATHS IS CORRECT
species <- read_tsv(file="/scratch/project_2001499/YOUR_FOLDER_NAME/MBDP_Metagenomics_2022/07_METAPHLAN/merged_species_table.txt")
# number of species detected
species %>%
select(-clade_name) %>%
specnumber(MARGIN=2)
# Shannon diversity
species %>%
select(-clade_name) %>%
diversity(MARGIN=2)
## make phyloseq object
# taxonomy table
TAX <- species %>%
select(clade_name) %>%
separate(clade_name, "[|]", into=c("kingdom", "phylum", "class", "order", "family", "genus", "species")) %>%
as.matrix %>%
tax_table()
# count table
OTU <- species %>%
select(-clade_name) %>%
otu_table(taxa_are_rows=TRUE)
# phyloseq object
ps <- phyloseq(OTU, TAX)
# have a look at the phyloseq objeect
ps
# interactive visualization with microViz package
ord_explore(ps)
If you want to learn more, go to microViz or phyloseq manual and explore.
Other great resources for microbiome data analysis in R (by Leo Lahti et al.):
- miaverse
- Orchestrating Microbiome Analysis
- Introduction to microbiome data science
- Introduction to the microbiome R package
And a text book with examples:
Anvi'o is an analysis and visualization platform for omics data. We will use anvi'o for binning contigs into metagenome-assembled genomes (MAGs).
You should definitely take a look at their website and maybe even join their discord channel.
Before creating the contigs database in anvi'o, we need to do some reformatting for our assmebly file.
The program removes contigs shorter than 1 000 bp and simplifyes the sequence names.
module load anvio/7.1
anvi-script-reformat-fasta \
--min-len 1000 \
--simplify-names \
-o 05_ANVIO/contigs.fasta \
--report-file 05_ANVIO/reformat_report.txt \
03_ASSEMBLY/assembly.fasta
The first step in our genome-resolved metagenomics pipeline is the contruction of contigs database from our metagenomic contigs. During this step anvi'o calls genes, calculates the nucleotide composition for each contigs and annotates the identified genes in three different steps.
We will run the first steps as a batch job.
Batch job script. Save it to the scripts
folder.
#!/bin/bash -l
#SBATCH --job-name anvi-contigs-db
#SBATCH --output 00_LOGS/anvi-contigs-db-%j.out
#SBATCH --error 00_LOGS/anvi-contigs-db-%j.err
#SBATCH --time 03:00:00
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 20
#SBATCH --mem 30G
#SBATCH --account project_2001499
#SBATCH --gres=nvme:200
module purge
module load anvio/7.1
# generate contigs DB
anvi-gen-contigs-database \
-f 05_ANVIO/contigs.fasta \
-o 05_ANVIO/CONTIGS.db \
-n LongReadAssembly \
-T $SLURM_CPUS_PER_TASK
# annotate marker genes
anvi-run-hmms \
-c 05_ANVIO/CONTIGS.db \
-T $SLURM_CPUS_PER_TASK
# annotate COGs
anvi-run-ncbi-cogs \
-c 05_ANVIO/CONTIGS.db \
--cog-data-dir /scratch/project_2001499/databases/anvio/ \
-T $SLURM_CPUS_PER_TASK
# annotate single-copy core genes
anvi-run-scg-taxonomy \
-c 05_ANVIO/CONTIGS.db \
--scgs-taxonomy-data-dir /scratch/project_2001499/databases/anvio/ \
-T $SLURM_CPUS_PER_TASK
Submit the job.
sbatch scripts/YOUR_SCRIPT_NAME
The differential coverage for each contig is calculated by mapping sequencing reads to the assembly.
We will use Bowtie2 to map the short-read Illumina data to our assembly (the anvio-reformatted version of it).
Mapping batch job script, again save it to the scripts
folder:
#!/bin/bash -l
#SBATCH --job-name anvi-mapping
#SBATCH --output 00_LOGS/anvi-mapping-%j.out
#SBATCH --error 00_LOGS/anvi-mapping-%j.err
#SBATCH --time 2:00:00
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 40
#SBATCH --mem 80G
#SBATCH --account project_2001499
#SBATCH --gres=nvme:200
module load bowtie2/2.4.4
bowtie2-build --threads $SLURM_CPUS_PER_TASK 05_ANVIO/contigs.fasta 05_ANVIO/contigs
for file in SRR11674041 SRR11674042 SRR11674043
do
module load bowtie2/2.4.4
bowtie2 \
-1 02_TRIMMED_DATA/${file}_trimmed_R1.fastq.gz \
-2 02_TRIMMED_DATA/${file}_trimmed_R2.fastq.gz \
-x 05_ANVIO/contigs \
-S 05_ANVIO/${file}.sam \
--threads $SLURM_CPUS_PER_TASK \
--no-unal
module purge
module load samtools
samtools view -@ $SLURM_CPUS_PER_TASK -F 4 -bS 05_ANVIO/${file}.sam |\
samtools sort -@ $SLURM_CPUS_PER_TASK > 05_ANVIO/${file}.bam
samtools index -@ $SLURM_CPUS_PER_TASK 05_ANVIO/${file}.bam
rm 05_ANVIO/${file}.sam
done
Submit the job.
sbatch scripts/YOUR_SCRIPT_NAME
When the contigs database and all three mappings are ready, we can make the profile databases and finally merge all three profile databases into one.
Once again, save the batch job script to the scripts
folder.
#!/bin/bash -l
#SBATCH --job-name anvi-profiling
#SBATCH --output 00_LOGS/anvi-profiling-%j.out
#SBATCH --error 00_LOGS/anvi-profiling-%j.err
#SBATCH --time 2:00:00
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 20
#SBATCH --mem 80G
#SBATCH --account project_2001499
#SBATCH --gres=nvme:200
for file in SRR11674041 SRR11674042 SRR11674043
do
module load anvio/7.1
anvi-profile \
-i 05_ANVIO/${file}.bam \
-c 05_ANVIO/CONTIGS.db \
-S ${file} \
--min-contig-length 5000 \
-o 05_ANVIO/${file}_PROFILE \
-T $SLURM_CPUS_PER_TASK
done
anvi-merge \
-o 05_ANVIO/SAMPLES-MERGED \
-c 05_ANVIO/CONTIGS.db \
--enforce-hierarchical-clustering \
05_ANVIO/*_PROFILE/PROFILE.db
Submit the job.
sbatch scripts/YOUR_SCRIPT_NAME
While the profile job is running, we can quickly estimate the taxonomic composition of our assembled metagenome based on the single-copy cores genes found in there.
module load anvio/7.1
anvi-estimate-scg-taxonomy \
-c 05_ANVIO/CONTIGS.db \
--metagenome-mode \
--output-file 05_ANVIO/scg-taxonomy.txt
Have a look at the output.
less -S 05_ANVIO/scg-taxonomy.txt
Then everything should be ready and we can start the manual binning part.
We will go thru the first steps together. And provide you the port number that you will need in the next steps.
sinteractive -A project_2001499 --mem 20G
module load anvio/7.1
export ANVIOPORT=YOUR_PORT_HERE
Start interactive interface
anvi-interactive \
-c 05_ANVIO/CONTIGS.db \
-p 05_ANVIO/SAMPLES-MERGED/PROFILE.db \
--server-only \
--port-number $ANVIOPORT
After making the pre-clustering, start refining the clusters.
anvi-refine \
-c 05_ANVIO/CONTIGS.db \
-p 05_ANVIO/SAMPLES-MERGED/PROFILE.db \
--server-only \
--port-number $ANVIOPORT \
--collection-name PreCluster \
--bin-id Bin_1
When all cluster have been refined a bit more, we can make a new collection from these.
anvi-rename-bins \
-c 05_ANVIO/CONTIGS.db \
-p 05_ANVIO/SAMPLES-MERGED/PROFILE.db \
--collection-to-read PreCluster \
--collection-to-write PreBins \
--prefix Preliminary \
--report-file 05_ANVIO/PreBin_report.txt
And then we can make a summary of the cluster or bins we have so far.
anvi-summarize \
-c 05_ANVIO/CONTIGS.db \
-p 05_ANVIO/SAMPLES-MERGED/PROFILE.db \
--collection-name PreBins \
--output-dir 05_ANVIO/PreBins_SUMMARY \
--quick-summary
Download the output folder to your own computer and have a look at the index.html
file in the output folder.
Do you have any promising bins?
Then proceed with refining the most promising bins that you have with anvi-refine
. Remember to specify the new collection name and check the new bin names.
The goal is to get at least few good bins that we can call MAGs.
Then when you're happy with your work, you can run the anvi-summarize
command on your collection without the --quick-summary
option.
Refine a bin
anvi-refine
For refining you need to give contigs database -c 05_ANVIO/CONTIGS.db
, profile db -p 05_ANVIO/SAMPLES-MERGED/PROFILE.db
, the collection you created in the previous step --collection-name PreBins
and name of the preliminary bin you want to refine -b Preliminary_Bin_00002
Rename all bins and make a new collection from those
anvi-rename-bins
Summarize a collection
anvi-summarize
After binning you hopefully have some good quality MAGs. The next step is to give some names to these metagenome assembled genomes. We will use Genome Taxonomy Database (GTDB) and a tool called GTDB-Tk for this.
GTDB-Tk needs a database for the taxonomic annotation and that has been already done. Instructions are found from here.
And the actual commands were:
############# (THIS HAS BEEN DONE ALREADY) #########################
## download gtdb database
# wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_v2_data.tar.gz
# tar -xzf gtdbtk_v2_data.tar.gz
############# (THIS HAS BEEN DONE ALREADY) #########################
If you're connected to a interactive node, exit from that with exit
. We need more resources for the next step.
sinteractive -A project_2001499 --mem 70G -c 6
First we need to store to location of the database to $GTDBTK_DATA_PATH
environment variable. Otherwise GTDB-Tk won't find it and will complain.
export GTDBTK_DATA_PATH=/scratch/project_2001499/databases/GTDB/release207_v2/
Then we can have a look at the different options of the tool.
singularity exec --bind $GTDBTK_DATA_PATH:$GTDBTK_DATA_PATH,$PWD:$PWD,$TMPDIR:/tmp \
/scratch/project_2001499/envs/gtdb-tk.sif \
gtdbtk -h
And then actually run the program with few selected MAGs. The MAG sequences should all be in one folder. You can either copy the genomes in this folder.
singularity exec --bind $GTDBTK_DATA_PATH:$GTDBTK_DATA_PATH,$PWD:$PWD,$TMPDIR:/tmp \
/scratch/project_2001499/envs/gtdb-tk.sif \
gtdbtk classify_wf \
-x fasta \
--genome_dir PATH/TO/GENOME/FOLDER \
--out_dir OUTPUT/FOLDER \
--cpus 6 \
--tmpdir /tmp
Now we can annotate some of our MAGs using Bakta.
Allocate some resources for the job.
sinteractive -A project_2001499 -c 4
And then run Bakta on your favourite MAGs.
/scratch/project_2001499/envs/bakta/bin/bakta \
INPUT \
--db /scratch/project_2001499/databases/bakta/db/ \
--prefix GENOME_NAME \
--threads 4 \
--output OUTPUT
After this let's modify one of the output files. Go to your OUTPUT
folder (the name you gave in the previous step) and run the following one liner.
grep -o "KEGG:K....." *.gff3 | tr ":" "\t" > mygenome_kegg_ids.txt
Then let's visualize the metabolic pathways with KEGG-mapper
The assembled bulk metagenome contains also viral sequences. There are various bioinformatics tools for detecting viral sequences in metagenomes, and they are based on different algorithms and thus, perform differently. During this course, we will use Virsorter2 and Lazypipe. We will also look at the What-the-Phage pipeline sample results (you will get ready output files).
Questions to think about:
- Why does the metagenome contain also viral DNA?
- Which types of viruses can be seen from a bulk metagenome obtained from the environmental DNA sample? Can we find, e.g., RNA viruses there?
Make a directory called Virsorter2
in your 06_VIROMICS
directory. Make a batch job script there.
Sample script:
Make sure you change the $USER
in export INPUT_PATH=/scratch/project_2001499/$USER/MBDP_Metagenomics_2022/03_ASSEMBLY/
to your own directory name.
#!/bin/bash
#SBATCH --job-name=virsorter2
#SBATCH --account=project_2001499
#SBATCH --time=2-00:00:00
#SBATCH --cpus-per-task=30
#SBATCH --mem=15G
#SBATCH --partition=small
#SBATCH --error=virsorter2_err_%j.txt
export INPUT_PATH=/scratch/project_2001499/$USER/MBDP_Metagenomics_2022/03_ASSEMBLY/
apptainer exec \
--bind $INPUT_PATH:$INPUT_PATH,$PWD:$PWD \
/scratch/project_2001499/envs/virsorter2/virsorter2.sif \
virsorter run \
-w virsorter2.out \
-i $INPUT_PATH/assembly.fasta \
--min-length 1500 \
--include-groups "dsDNAphage,ssDNA,RNA,NCLDV,lavidaviridae" \
-j 30 \
all
where, virsorter.out
is the output (results) directory, you can name it yourself, and the assembly.fasta
is the assembly input file. The option --include-groups
specifies virus groups to search for (Virsorter version 2.2.3, which we have installed in Puhti, has only dsDNAphage and ssDNA virus groups to search for by default, we can add more). Explore more about the options from the manual or by calling
apptainer exec --bind $PWD:$PWD scratch/project_2001499/envs/virsorter2/virsorter2.sif virsorter run -h
Note that you need to specify the path to your assembly file.
Submit a batch job.
The useful output files: final-viral-boundary.tsv
, final-viral-score.tsv
, and final-viral-combined.fa
. Copy them to your computer and explore.
final-viral-boundary.tsv file contains information about contigs that were identified as viral. Full sequences identified as viral have suffix "||full"; partial sequences identified as viral have suffix "||{i}index_partial" ("{I}" can be numbers starting from 0 to max number of viral fragments found in that contig).
- How many contigs were annotated as viral? How many full and partial?
- Which virus groups were predicted?
- The group column is the classifier of viral group that gives a high score, but can it be used as a reliable classification?
- Which groups can Virsorter2 identify (check the publication) and which would you expect to see in our sample? Compare this to the ones predicted.
final-viral-score.tsv shows the score of each viral sequences across groups. The score is ranging from 0 to 1, and higher means more like to be viral.
- Check the score for different contigs, how many have 1.000?
- Check the number of hallmark genes found for different contigs: are there many viral contings with 0 viral genes and what is the maximum number found?
- Why some contigs with 0 viral genes are still annotated as viral? Check the Virsorter2 work principles from the publication.
- What is the length of the contigs? The smallest vs the biggest? What is the average bacteriophage genome length in principle?
final-viral-combined.fa contains the viral contigs sequences in the fasta format. They can be used for the downstream analyses.
We will use CheckV to assess the quality and completeness of the obtained viral contigs.
CheckV needs the database, it is in /scratch/project_2001499/checkv-db
(and this is specified in the script below).
Move to your Virsorter2 output directory in your 06_VIROMICS
directory and run CheckV interactively from there:
sinteractive -A project_2001499 -m 10G -c 8
Run the job:
apptainer exec --bind $PWD:$PWD,/scratch/project_2001499/checkv-db/db:/db \
/scratch/project_2001499/envs/checkV/checkV.sif \
checkv end_to_end \
final-viral-combined.fa \
checkv_out \
-t 8 \
-d /db
Note that you need to specify the CheckV output directory (checkv_out
in this sample script).
The useful files: quality_summary.tsv
, completeness.tsv
, and complete_genomes.tsv
. Copy them to your computer and explore.
quality_summary.tsv is the main output file.
- Are there any proviruses predicted?
- Are most contigs of low, medium or high quality? See how the quality corresponds to completeness.
- Are there 100% complete viral genomes listed?
- Are there contigs with kmer_freq > 1? This indicates that the viral genome is represented multiple times in the contig, which is quite rare.
- What warnings does the table contain?
completeness.tsv shows how the completeness was estimated, i.e. AAI- and HMM-based completeness. See the publication and FAQs for more details about these two methods.
complete_genomes.tsv shows all viral genomes identified as complete ones.
- How many complete genomes are there in your dataset?
- What are the confidence levels for complete genomes predictions?
In practice, if you continue with downstream applications (not during this course), you might want to get rid of possible false positives in the final set of contigs identified as viral. For example, a subset of viral contigs that have at least 1 viral gene or at least 10 kbp long and 50% complete can be selected. The exact selection criteria would depend on specific research questions you have and thus, on the type of analysis you would like to perform. Here you can find a useful protocol for manual curation of Virsorter2 output.
Lazypipe is a pipeline for identifying virus sequences in metagenomes, developed at the University of Helsinki. It is available as a preinstalled module in Puhti. The input for the pipeline is Next Generation Sequencing data. We will use Illumina reads from the same type of samples as the long reads data that you used for the assembly with Metaflye (and Virsorter2).
Make a directory called Lazypipe
in your 06_VIROMICS
directory and call for the Lazypipe batch job from there. Note that you need to specify the path to the Illumina reads.
module load r-env-singularity
module load biokit
module load lazypipe
cp /appl/soft/bio/lazypipe/2.0/lazypipe/default.config.yaml config.yaml
lazypipe.pl
sbatch-lazypipe -1 /PATH_TO_TRIMMED_ILLUMINA_READS/SRR11674042_trimmed_R1.fastq.gz
You will be interactively asked for information that is needed to construct a batch job: accounting project:
- maximum duration of the job (default 24 hours) - choose 24 hours
- memory reservation (default 8G) - choose 20G
- number of computing cores to use (default 8)
- email notifications - provide your email address to be notified when the batch job is finished.
Useful output files: qc.readsurv.jpeg
, abund_table.xlsx
, krona_graph.html
, and contigs.annot.xlsx
. Download them to your computer and explore.
qc.readsurv.jpeg is the quality control (QC) plot tracking retained reads after each pipeline step.
abund_table.xlsx shows which taxa were identified in the sample and how many read pairs (readn) and contigs (consign) were assigned to some taxon. Search for viruses.
- What viral taxa can be found?
- Do they have many contigs assigned?
krona_graph.html displays estimated taxon abundancies. This graph is interactive: you can zoom into different groups and see their abundance as %. Select the viruses piece of the graph to see which virus groups were identified.
- What is the abundance of viruses in the studied sample?
- Which of the identified virus groups are the most and the least abundant?
- Which bacterial groups are the most abundant?
contigs.annot.xlsx contains all info for the classified contigs separately for viruses (not bacteriophages), bacteria, bacteriophages, eukaryotes, and unknown. Explore the viruses and bacteriophages tabs.
- Which putative ORF functions could be found for viral (including bacteriophages) sequences?
Viral contigs for the possible downstream analyses are found from contigs_vi.fa.
“What the Phage” (WtP) is a parallel multitool pipeline for phage prediction combined with annotation and classification. It includes 11 phage-predicting tools that are all run in parallel. The user can then compare the results of the different prediction tools summarised in charts and tables and make own decisions which dataset to proceed with.
Due to time limits and some technical challenges associated with running this pipeline, we will only see sample output files during this course. However, you could try it later yourself, following the authors' instructions as well as CSC notes about this pipeline installation.
The sample output files for this course were obtained from the same Illumina reads dataset, as used for the Lazypipe. The reads were assembled using the Megahit assembler. The output summary file to download and explore: /scratch/project_2001499/WtP_output/final_report.html
.
Explore different tabs of the report. It might be easier to view the tables when they are downloaded as .xlsx or .csv, rather than using the browser.
Overview shows a plot summarizing the prediction performance of each tool for the studied sample. Blue bars on the left show the total number of identified phage contigs per tool. Black bars show the number of contigs that each tool (or a combination of tools) has uniquely identified. A dot matrix below refers to the combinations of tools that identified the same contigs as phages.
- Which tools were applied for phage predictions?
- Which tool gave the highest number of phage contigs? Which gave the lowest?
- Are the predicted phage contigs consistent between different tools (see the dots matrix)?
Phage prediction table contains a list of all predicted phage contigs and shows which tool gave which prediction and with which score.
- Do predictions from different tools overlap?
CheckV output lists all the predicted contigs and shows their quality based on CheckV.
- What is the typical phage contig length in this dataset? What is the lowest and highest?
- Are most contigs of high/low/medium quality?
- Are there complete contigs in the list?
- Are there proviruses?
- Are there contigs with no viral genes?
Phage annotations shows predicted functions obtained for phage ORFs.
- Which functions were predicted?
- Could you try to categorise them (e.g. DNA modification, virion structural genes, etc.)?
- Which functional predictions are specific for viruses?
Taxonomic phage classification shows taxonomical predictions for phage contigs.
- Could any contigs be classified?
To sum up, how would you proceed with the WtP results taking into account the possibility of having false positives? For further analyses, would you
- take all predictions? Or
- take predictions only from some tools? Or
- make a subset based on the quality check?
Compare the data you got from Virsorter2, What-the-Phage and Lazypipe.
- What have you learnt about the viruses present in the studied datasets with each of these tools?
- What would you consider as pluses and minuses in the usage and the final data types when comparing the tested tools?
- If you need to study viruses in a metagenome in future, which of the tested tools would you select for your own work?
We are not proceeding with further viromics analyses during this course, but if you would like to test other viral identification tools and pipelines, see references in the lecture slides. For downstream applications, you could check, for example:
- vConTACT v.2.0 for taxonomic assignments of the identified virus genomes;
- iVirus 2.0, an analytic platform with many tools, protocols and public datasets;
- IMG/VR, the largest collection of viral sequences obtained from metagenomes;
- Review of in silico methods to link viral sequences to their hosts.