From 716020442e5416d5f8507ae9efd0ad35d0ad3bae Mon Sep 17 00:00:00 2001
From: "C. Titus Brown" <titus@idyll.org>
Date: Mon, 29 Apr 2024 03:04:48 -0700
Subject: [PATCH] MRG: add more text (#8)

* add more text

* add extra exercise

* add citation

* add more text
---
 docs/amr.md                         | 25 +++++++++++--
 docs/comparing-metagenomes.md       | 54 +++++++++++++++++++++++------
 docs/index.md                       |  9 ++++-
 docs/single-metagenomes-taxonomy.md | 15 ++++++--
 4 files changed, 88 insertions(+), 15 deletions(-)

diff --git a/docs/amr.md b/docs/amr.md
index a805019..a9e563d 100644
--- a/docs/amr.md
+++ b/docs/amr.md
@@ -63,8 +63,9 @@ And, finally, run AMRfinder on the proteins:
 ```
 amrfinder -p CD136.assembly.faa -t 16 -o CD136.amrfinder.tsv --plus
 ```
+(This will take under a minute.)
 
-This will produce a spreadsheet named `CD136.amrfinder.tsv` that
+AMRfinder will produce a spreadsheet named `CD136.amrfinder.tsv` that
 contains a number of columns - you can see the list like so, using
 `csvtk headers`:
 
@@ -79,5 +80,25 @@ Run:
 csvtk -t cut -f "% Coverage of reference sequence","HMM description" CD136.amrfinder.tsv 
 ```
 
-<!-- @CTB say something output the files.  -->
+and you will see:
+```
+% Coverage of reference sequence        HMM description
+89.41   CfxA family broad-spectrum class A beta-lactamase
+87.59   23S ribosomal RNA methyltransferase Erm
+52.84   NA
+100.00  macrolide efflux MFS transporter Mef(En2)
+100.00  lincosamide nucleotidyltransferase Lnu(AN2)
+100.00  CepA family extended-spectrum class A beta-lactamase
+```
+
+The first column here is the amount of the known (reference) sequence
+that is present in the metagenome, and the second is the description of
+the match.
+
+Note: If you wanted to get the abundance of these in the metagenome,
+you would have to find the DNA contig that the relevant gene was on,
+using the column "Protein identifier", and then map the metagenome
+reads to it to get the abundance. This is because assembly collapses
+the abundance of the output contigs, and you have to recover it through
+other means.
 
diff --git a/docs/comparing-metagenomes.md b/docs/comparing-metagenomes.md
index 42f3665..c9c010a 100644
--- a/docs/comparing-metagenomes.md
+++ b/docs/comparing-metagenomes.md
@@ -1,8 +1,17 @@
 # Comparing metagenomes
 
+The tutorial uses [sourmash](https://sourmash.readthedocs.io/) to do
+comparisons of multiple metagenomes based on weighted and unweighted
+k-mer content.
+
+In this tutorial, you will learn how to create distance matrices and
+ordination plots from metagenome content. Importantly, this tutorial
+is *reference* and *annotation* free - it will work equally well on
+any metagenome.
+
 ## First, create a conda software environment and a working directory.
 
-To install software, run:
+To install the necessary software, run:
 ```
 mamba create -n smash -y sourmash scikit-learn
 conda activate smash
@@ -14,14 +23,12 @@ mkdir ~/compare-metag
 cd ~/compare-metag
 ```
 
-
 ## Comparing based on content
 
-<!-- * reference free, annotation free @CTB -->
-
 Here we are going to use the
+[`sourmash compare`](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-compare-compare-many-signatures) and
 [`sourmash plot`](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-plot-cluster-and-visualize-comparisons-of-many-signatures)
-command to compare and cluster many metagenomes based on their content - not their annotation or assemblies.
+commands to compare and cluster many metagenomes based on their content.
 
 As with the [single metagenome analysis](single-metagenomes-taxonomy.md), we have two options here: with, or without abundance information.
 
@@ -114,19 +121,46 @@ If you plot this via MDS, you'll see a clear separation:
 Points to discuss:
 
 * what does this all mean, in ~microbial terms? Hint: ask Mani to
-  revist how the test data sets were generated!
+  revist how the test data sets were generated! Alternatively,
+  go on to the next section!
+  
+## Extra: examining taxonomy
 
-<!--
+If we quickly run our [taxonomy analysis](single-metagenomes-taxonomy.md) on
+one of the other samples, we can maybe start to see some of the reasons for
+the differences in diversity but not richness:
 
-## Comparing based on taxonomy
+```
+mamba activate tax
 
+sourmash scripts fastgather ../data/tutorial_other/CD240.sig.zip \
+    ../databases/gtdb-rs214-k31.zip -o CD240.x.gtdb-rs214.fastgather.csv -c 16
 
+sourmash gather ../data/tutorial_other/CD240.sig.zip \
+    ../databases/gtdb-rs214-k31.zip -o CD240.x.gtdb-rs214.gather.csv \
+    --picklist CD240.x.gtdb-rs214.fastgather.csv:match_name:ident
+    
+sourmash tax metagenome -g CD240.x.gtdb-rs214.gather.csv \
+    -t ../single-metag/gtdb-rs214.lineages.sqldb -F human
 ```
-mamba create -y -n workshop-r r-base r-tidyverse r-vegan r-ape r-rcolorbrewer
 
+You should see:
 ```
+sample name    proportion   cANI   lineage
+-----------    ----------   ----   -------
+CD240             42.2%     94.0%  d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides uniformis
+CD240             19.5%     94.5%  d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides fragilis
+CD240             12.6%     94.1%  d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Tannerellaceae;g__Parabacteroides;s__Parabacteroides distasonis
+CD240             11.7%     91.2%  d__Bacteria;p__Bacillota_A;c__Clostridia;o__Oscillospirales;f__Acutalibacteraceae;g__Ruminococcus_E;s__Ruminococcus_E bromii_B
+CD240             11.4%     -      unclassified
+CD240              2.6%     91.4%  d__Bacteria;p__Bacillota_A;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__Faecalibacterium;s__Faecalibacterium prausnitzii_D
+```
+
+That's right - both samples have similar species, but the abundances of those
+species are quite different.
 
--->
+Note that in this case that's not an accident: the dataset was created
+specifically to contain only five species ;).
 
 ---
 
diff --git a/docs/index.md b/docs/index.md
index 446721c..9981467 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,6 +1,7 @@
 # Introduction
 
-<!-- @CTB stuff about workshop -->
+These are tutorials for the PIG-PARADIGM workshop on metagenomics,
+Apr 29th, 2024, given at Wageningen.
 
 Tutorials:
 
@@ -12,3 +13,9 @@ Tutorials:
 
 Data originally from
 [the MIntO tutorial data](https://zenodo.org/records/6369313).
+
+## More information
+
+Authors: Anneliek ter Horst and C. Titus Brown
+
+See the GitHub repo at [ngs-docs/2024-pig-paradigm-workshop](https://github.com/ngs-docs/2024-pig-paradigm-workshop).
diff --git a/docs/single-metagenomes-taxonomy.md b/docs/single-metagenomes-taxonomy.md
index 32e6e9c..61170e3 100644
--- a/docs/single-metagenomes-taxonomy.md
+++ b/docs/single-metagenomes-taxonomy.md
@@ -1,5 +1,18 @@
 # Analyzing a single metagenome for taxonomy
 
+The tutorial uses [sourmash](https://sourmash.readthedocs.io/) to do
+various k-mer based analyses of Illumina shotgun metagenome content.
+
+In this tutorial, you will learn:
+
+* how to look at what genomes share content with a metagenome;
+* how to examine the abundance of metagenome content without a reference;
+* how to summarize the taxonomic content of a metagenome;
+
+We will be using the taxonomic classification system as benchmarked in
+[Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0),
+which is both very *sensitive* and quite *specific*.
+
 ## Creating a working directory
 
 Run:
@@ -90,8 +103,6 @@ Points to discuss:
   content is present in the reference database.  Some of this is
   probably erroneous data or host contamination.
   
-<!-- @CTB details: discuss weighted/unweighted more? and... what's in a metagenome, anyway? -->
-
 ### K-mer abundance histogram
 
 Let's examine this data set further. First, let's take a look at the