Update contents

bioinfodlsu · Jan 18, 2025 · 7b8a29b · 7b8a29b
1 parent 8754fe2
commit 7b8a29b
Showing 14 changed files with 463 additions and 749 deletions.
diff --git a/antibiotic-resistance.md b/antibiotic-resistance.md
@@ -1 +1 @@
-# ARG
+Under construction
diff --git a/data-analysis-overview.md b/data-analysis-overview.md
@@ -1,13 +1,22 @@
 # Data analysis overview
 The raw input data is in the form of short sequences of length around 100 bp.
-Prior to the actual data analyis, it is generally advised to clean and trim the reads, but these are rather mechanical, fairly easy to do, and not really interesting. We will skip ahead to the more interesting parts of the analysi workflow.
+Prior to the actual data analyis, it is generally advised to clean and trim the reads, but these are rather mechanical, fairly easy to do, and not really interesting. We will skip ahead to the more interesting parts of the analysis workflow.
 
 The workflow, of course,  depends on the research objectives. Some of the typical questions that researchers try to answer are:
 
-## Assembly-free 
+1. What are the different species of bacteria that are present in the sample?
+2. In what relative abundance are they present?
+3. Are there differences in the bacterial community profiles among sites ?
+4. What kind of antibiotic resistant genes are present? In what proportion?
+5. What kind of mobile genetic elements are present? In what proportion?
+
 
+## Assembly-based approach
+One approach is to first **assemble** the reads, which means to stitch the reads together to reconstruct the chromosome which was fragmented during the sequencing process.
+This is a challenging task and is computationally demanding. 
+We will not delve into this approach in this workshop.
+
+
+## Assembly-free
+What we will explore is the assembly-free approach, in which the reads are compared against reference databases to answer the questions posed above. In particular, we will look at Questions 1 -- 3.
 
-## Assembly-based
-$$
-x
-$$
diff --git a/data-generation.md b/data-generation.md
@@ -1,11 +1,27 @@
-# Data generation process
+# Metagenomics and the data generation process
+
+## What is metagenomics?
+For the purposes of this workshop, we define metagenomics as the application of high-throughput sequencing to DNA extracted directly from environmental, uncultured samples. 
+For example, later we will be looking at samples of microbial community found in hospital waste water.
 
 ## From samples to sequences
+The figure below shows how we go from environment samples, in this case hospital wastewater, to seqeunce data.
+
+![metagenomics-sample-to-seq](imgs/metagenomics-sample-seq-slim-jpg.jpeg)
+
+Metagenomic sampling produces a lot of sequence data, which is the starting point of bioinformatics analysis.
+
 
 ## How the statistician sees it
 Metagenomics data is compositional data.
-Most analysis will be based on relative abundance.
+
+There is meaning only in the relative abundances observed in the sample.
+
+![compositional-data](imgs/compositional.jpeg)
 
 ## Discussion
 Are you planning to use metagenomics for your study?
-What is the experiment design?
+What is the experiment design?
+
+## Further reading
+- Gkiir et al, [Microbiome Datasets Are Compositional: And This is Not Optional](https://doi.org/10.3389/fmicb.2017.02224), Front. Microbiol.,2017. 
diff --git a/data-sneak-peek.ipynb b/data-sneak-peek.ipynb
@@ -5,16 +5,16 @@
    "id": "ae154a92-a5a7-4ff7-a98c-f610e30b7939",
    "metadata": {},
    "source": [
-    "# Data sneak peek\n",
+    "# Data\n",
     "\n",
-    "For hands-on activity, we will make use of downsampled dataset obtained from the following study:\n",
+    "For hands-on activity, we will make use a part of the dataset generated by the following study:\n",
     "\n",
     "Metagenomic Analysis of the Abundance and Composition of Antibiotic Resistance Genes in Hospital Wastewater in Benin, Burkina Faso, and Finland. Markkanen MA, Haukka K, Pärnänen KMM, Dougnon VT, Bonkoungou IJO, Garba Z, Tinto H, Sarekoski A, Karkman A, Kantele A, Virta MPJ. mSphere 8(1): e0053822 (2023 Feb)\n",
     "\n",
     "The full dataset is available here: https://www.ebi.ac.uk/ena/browser/view/PRJEB47975 . \n",
     "\n",
-    "For this workshop, we will use a subset of the dataset and downsample each read set.\n",
-    "## Select \n",
+    "For this workshop, we will use only 8 of the following (heavily-downsampled) samples.\n",
+    "\n",
     "| Country |  Sample ID | ENA Accession ID |\n",
     "| ------- | ---------- | ---------------- | \n",
     "| Finland | FH1 | ERR7015395 | \n",
@@ -24,18 +24,131 @@
     "| Benin   | BH1 | ERR7015311 |\n",
     "| Benin   | BH2 | ERR7015312 |\n",
     "| Benin | BH3 | ERR7015313 |\n",
-    "| Benin | BH4 | ERR7015315 |"
+    "| Benin | BH4 | ERR7015315 |\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ad88a8b8-cf74-4093-ac89-52085c7a3697",
+   "metadata": {},
+   "source": [
+    "# Data Sneak Peek\n",
+    "If you followed the instructions in the Section [Getting Started](getting-started.md) correctly, you should be able to see the files by doing:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "1afe8e35-8739-49d9-87eb-5aad588c2355",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "BH1_1.fastq.gz\tBH3_1.fastq.gz\tFH1_1.fastq.gz\tFH3_1.fastq.gz\tmetadata.txt\n",
+      "BH1_2.fastq.gz\tBH3_2.fastq.gz\tFH1_2.fastq.gz\tFH3_2.fastq.gz\n",
+      "BH2_1.fastq.gz\tBH4_1.fastq.gz\tFH2_1.fastq.gz\tFH4_1.fastq.gz\n",
+      "BH2_2.fastq.gz\tBH4_2.fastq.gz\tFH2_2.fastq.gz\tFH4_2.fastq.gz\n"
+     ]
+    }
+   ],
+   "source": [
+    "! ls data/metagenome_samples"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95f194d8-7969-4efb-a3c5-4ff0aab21768",
+   "metadata": {},
+   "source": [
+    "The sequence data is in `fastq' format, and for each sample, there is a pair of fastq files, e.g. BH1_1.fastq and BH1_2.fastq. The pair comes from the fact that when a long DNA molecule is sequenced, it is first fragmented into pieces, and each piece is read from two ends. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe6102d0-12e9-4c2f-accb-7d64f7f0c8a6",
+   "metadata": {},
+   "source": [
+    "You can have a look inside one of the files by opening up a terminal (click the blue + button on the left top and click on Terminal) and run the following command:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "29d15bf9-8721-4fbc-a141-6e98160b58f7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "@ERR7015311.1 A00464:196:HFFTTDSXY:2:1101:10004:15687/1\n",
+      "GGCCATGTCGGCGCGCTCGGCGGGGAACTCGCGGATGTCGGCGCCGGCGGCGAAGTTGCCTCCCTCGCCGCGCACGATCACGCAGCGCAGCGCATTGTCGGCCGCCAGCTGGTCGAACACCGCGCGCAGCTCGCCCCACATGCCCACGGTG\n",
+      "+\n",
+      "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF\n",
+      "@ERR7015311.2 A00464:196:HFFTTDSXY:2:1101:10004:16063/1\n",
+      "GCCAGCATCGCGAGCTCGGCTTCGACCACGGCGATGTCGGCCTTGTCCTTGGCGTTGCGCAGCGCCTCTTCAGCGCGCTGGCGCGCTTCCAGGGCACGGGCCTCGTCCAGGTCGGCGGCACGGATGGCCGTGTCGGCCAGGACCGTGACGC\n",
+      "+\n",
+      "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF\n",
+      "@ERR7015311.3 A00464:196:HFFTTDSXY:2:1101:10004:18505/1\n",
+      "ACGCGGTGTTCGAAGGCGCGGTCATCGCCGTGGCGGCGCATCACCTTTATCCCGCCGTGCCGTACTGGCTGGTGGCGCTGGCGGTGGTGATTTACAGCGTGCTGCTGATCTTCGGCAGCGTGCAGCGCTGGTTGGACAAGTTCAACGGCGT\n",
+      "\n",
+      "gzip: stdout: Broken pipe\n"
+     ]
+    }
+   ],
+   "source": [
+    "! zcat data/metagenome_samples/BH1_1.fastq.gz | head "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4a978700-8081-4897-99dd-71192ab2606e",
+   "metadata": {},
+   "source": [
+    "A block of four lines correspond to one sequence entry often called a `read'. The first line starting with @ contains the read ID. The next line contains the actual sequence. In the example above, it is 100 nucleotides long. The fourth line provides information about the quality of the estimate of the charaater at each position of the read. \n",
+    "There are these many files in one file:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "d80c652d-a933-417e-bcff-d27530e85cc5",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "4000000\n"
+     ]
+    }
+   ],
+   "source": [
+    "! zcat data/metagenome_samples/BH1_1.fastq.gz | wc -l"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "80c4431c-d848-4f05-bbe7-39346e72f20a",
+   "id": "9035da8b-dbed-4df1-9b38-e25ee06c86d8",
    "metadata": {},
    "source": [
-    "You can obtain the dataset by doing a wget. \n",
+    "So that means, there are 1 million reads in one file.\n",
+    "\n",
+    "The original dataset has 54.9 million reads as can be seen here https://www.ebi.ac.uk/ena/browser/view/ERR7015311.  \n",
     "\n",
-    "For the purposes of this workshop, we will  downsample each set to 1 million paired reads."
+    "As you can see, we have drastically downsampled for this workshop."
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a3e00bfc-b911-4b94-8cec-f241e7dc7411",
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {

diff --git a/getting-started.md b/getting-started.md
@@ -2,18 +2,11 @@
 
 To get started, you will need to go through the following steps.
 
-1. Download the Jupyter notebooks prepard for this workshop.
-2. Download the dataset for this workshop.
-3. Install Docker.
-4. Pull the Docker image prepared for this workshop.
-5. Spin up a Docker container.
-
-
 ## 1. Download Jupyter notebooks
-Jupyter notebooks needed for this workshop are provided in the GitHub repository: https://github.com/bioinfodlsu/metagenomics-workshop. Click the (green) Code button, then download zip, then unzip, and place the `metagenomics-workshop-main` folder to your preferred location. 
+Jupyter notebooks needed for this workshop are provided in the GitHub repository: https://github.com/bioinfodlsu/metagenomics-workshop. Click the (green) Code button to download zip. Unzip, and place the `metagenomics-workshop-main` folder at your preferred location. 
 
 ## 2. Download dataset 
-A dataset containing pre-downloaded databases and some metagenomic samples can be downloaded from here. 
+A dataset containing pre-downloaded databases and some metagenomic samples can be downloaded from [here](https://drive.google.com/drive/folders/1tznvhlFp6oowjlWsESU6kSLEhT8G41kv?usp=share_link). 
 Download, unzip, and place the `data` folder inside `metagenomics-workshop-main` folder.
 
 ## 3. Install Docker
@@ -58,14 +51,15 @@ In the Command Prompt/Terminal, execute the following to download/pull the Docke
 
 `docker pull ghcr.io/bioinfodlsu/metagenomics-workshop/deploy:latest`
 
-### 5. Spin up a Docker container
+## 5. Spin up a Docker container
 In the Command Prompt/Terminal, launch a Docker container by executing the following, replacing `path_to_metagenomics-workshop-main` with the actual path on your system. If you are at the top-level of the `metagenomics-workshop-main` folder, the path is simply a dot `.`.
 
 ```
 docker run -it --rm -p 8888:8888 -v path_to_metagenomics-workshop-main:/home/jovyan/work ghcr.io/bioinfodlsu/metagenomics-workshop/deploy:latest
 ```
 Once the container is running, a link to the Jupyter Lab interface, including the authentication token (e.g., http://127.0.0.1:8888/?token=your_token), will appear in the terminal or command prompt after starting the container. Copy and paste this link into your browser.
 
+A JupyterLab interface should appear. On the left pane is the file browser. If you click on the folder `work', you should be able to see the raw version of this page as well as other notebooks and pages.
 <!---
 Here’s a breakdown of the directories being mounted:
 - **Data**: Your local path `path_to_your_data_directory` containing the data will be mounted to  `/home/jovyan/data` in the container. The data should be downloaded from the data folder located in this Drive Link (https://drive.google.com/drive/folders/1pfcwepIvSYmJ_wBp668jbVYR8nekrSF3?usp=sharing). Ensure that the required data files are placed in this directory before running the container.

diff --git a/going-beyond.md b/going-beyond.md
@@ -4,4 +4,3 @@ Under Construction.
 ## Assembly-based analysis workflows
 
 ## Study design considerations
-Perhaps this should have been discussed the first
diff --git a/imgs/.DS_Store b/imgs/.DS_Store
diff --git a/intro.md b/intro.md
@@ -4,7 +4,14 @@
 Metagenomics is the application of high-throughput sequencing to DNA extracted directly from environmental, uncultured samples. 
 With sequencing becoming more accessible to labs in the Philippines, we are witnessing increasing use of metagenomics to study microbial communities in environmental, ecological, agricultural, epidemiological, and clinical settings.
 The power of metagenomics, however, comes with the challenging task of handling and analyzing large volumes of sequence data.
-This one-day, hands-on workshop will use the case of shotgun metagenomic samples of hospital waste-water to demonstrate bioinformatics workflows for typical tasks such as identifying microbial composition and diversity, making statistical comparisons between samples, detecting presence and diversity of antibiotic resistance genes, etc.
+This hands-on workshop will use the case of shotgun metagenomic samples of hospital waste-water to demonstrate bioinformatics workflows for typical tasks such as identifying microbial composition and diversity, making statistical comparisons between samples, detecting presence and diversity of antibiotic resistance genes, etc.
+
+## Instructor
+Anish M.S. Shrestha, Ph.D. \
+Head, Bioinformatics Lab, AdRIC, De La Salle University Manila.
+
+with contributions to the learning material from:
+Daphne Go and Paul Yu.
 
 ## Target Participants
 This workshop would be most beneficial to life scientists and early-career bioinformaticians who have or plan to use metagenomics-based approaches in their research. Prior exposure to Unix command line and R would be helpful. 
@@ -27,26 +34,16 @@ The workshop is composed of the following modules:
 
 ## Learning material
 The content is organized as a Jupyter Book, which is a collection of Jupyter Notebooks.
-A Jupyter Notebook is a shareable document that can contain computer code.
-Click the hamburger menu on the top-left to access each notebook, which you will be run on the server set up for this workshop. 
-Alternatively, you can also run them locally on your machine. See here for instructions for installing Jupyter Notebook on your system.
+Go to [Getting Started](getting-started.md) for instructions on how to access a copy of the material, data, and software environment.
 
 
 ## Some resources for preparation
 - Survey/review articles on metagenomics :
     - Quince et al.: Shotgun metagenomics, from sampling to analysis, Nature Biotechnology, 35:9 (2017)
 - A basic tutorial on R can be found [here](https://github.com/bioinfodlsu/basic-r-tutorial).
-- Guide to use Jupyter notebook locally (optional)
+- [Jupyter Lab Guide](https://jupyterlab.readthedocs.io/en/latest/)
 
-## Instructor
-Anish M.S. Shrestha, Ph.D. \
-Head, Bioinformatics Lab, AdRIC, De La Salle University Manila.
-
-with contributions to the learning material from:
 
-Daphne Go \
-Paul Yu \
-Jiaan Santos 
 
 
 ```{tableofcontents}
Original file line number	Diff line number	Diff line change
		@@ -4,4 +4,3 @@ Under Construction.
		## Assembly-based analysis workflows

		## Study design considerations
		Perhaps this should have been discussed the first