Nyvac096Microbiome
Jacob A. Cram jcram@umces.edu cramjaco@gmail.com
This directory accompanies the manuscript "The human gut microbiota associated with baselin eimmune status and response to HIV vaccines". As of this writing this manuscript is accepted but not published at Plos ONE. This directory contains materials to run both the upstream processing of the microbiome data (demultiplexing with Qiime, sequence variant assignment with DADA2, phylogenetic tree with phangorn, SV taxonomic assignment with DADA2) and downstream analysis (statistics in a notebook file in R).
The downstream analysis can be run without re-doing the upstream portion. We default to using the files generated by the upstream analysis from our initial run, for consistancy between runs. There appears to be some variability in the results that one gets between upstream runs.
Notes on R
I have not had success with all of the subsequent dependencies when using condas
to install R.
https://unix.stackexchange.com/questions/149451/install-r-in-my-own-directory
When I tested these scripts I built R 3.6.1 from source on a clean virtual box containing Ubuntu 18.04.
wget http://cran.rstudio.com/src/base/R-3/R-3.6.1.tar.gz
The following packages were required (I installed with apt) for my R build.
build-essential fort77 xorg-dev liblzma-dev libblas-dev gfortran gcc-multilib gobjc++ libreadline-dev libbz2-dev libcurl4-openssl-dev texlive-fonts-extra texinfo default-jdk libssl-dev libxml2-dev t1-xfree86-nonfree ttf-xfree86-nonfree ttf-xfree86-nonfree-syriac xfonts-75dpi xfonts-100dpi libcairo2-dev
wget http://cran.rstudio.com/src/base/R-3/R-3.6.1.tar.gz
tar xvf R-3.6.1.tar.gz
cd R-3.6.1
./configure --prefix=$HOME/R
make && make install
And of course, add R to path. I did this by adding
export PATH=$PATH:$HOME/R-3.6.1/bin
to my .bashrc
file and rebooting
No longer requires jupyter notebook. Required for python scripts in the upstram analyis.
I have used both local rstudio and rstudio server for this. Most recently rstudio server 1.2.1335
- To run the demultiplexing, you will need to install qiime1. I recommend using anacondas to set up the following environment
conda create -n qiime1 numpy=1.10 python=2.7 qiime matplotlib=1.4.3 mock nose -c bioconda
When I ran the upstream analyis, all work was done in fall 2017 on R version 3.4.1. I have not re-run this upstream analysis since then.
###Rstudio or rstudio server.
Some dependencies that were required on my system -- I have root access and so used sudo apt install
. If you are doing this on a cluster, you may need to install many of these locally or get your system administrator to do it for you.
To make the igraph r package able to run, you need to modify your anacondas directory slightly, as per this github issue igraph/rigraph#275 (comment)
To do this, navigate in the terminal to your anaconda directory. In my case this is done with
cd ~/anaconda3
and then deactivate all local copies of libgfortran.so.4.0.0
find . -name "libgfortran.so.4.0.0" -execdir mv {} {}_off ';'
Now you are ready to install r packages. I've set up pacrat to do this for you. In theory, all you have to do is navigate to the project directory
cd ~/Nyvac_096_Microbiome
And then run R
from the terminal.
The packrat library should bootstrap itself and then install all of the necessary R packages.
If that doesn't happen, try running
install.packages('packrat')
and then restoring from snapshot
packrat::restore()
There are some packages that I don't call with library, rather I just address functions in them by specifying the package name eg rsample::bootstraps()
. These need to be installed manually. Or maybe packrat will start tracking them. You may need to run the following:
install.packages(c('rsample'))
I'm still looking for these
Note - I had been trying to use condas
to install R packages, but didn not have success
Activate irkernel from within R to connect it to jupyter notebook.
Jupyter notebook must be installed and then the system restarted before this command will work
IRkernel::installspec()
Upstream analysis is not necessary to redo the downstream analysis.
The order for this analysis is:
- demultiplex
- call SVs
- make tree
- generate taxonomic information.
These scripts can be called in order by calling, from the scripts\
directory
all_upstream.sh
On systems running slurm (such as Fated entities created in order to log onto a website and spam or otherwise wreak havoc upon it. To guard against this eventuality, websites have implemented CAPTCHAs, a challenge used to prove the user is a human and not an automated program. A typical CAPTCHA might distort a random sequence of letters and numbers and put it in a strange and/or mixed font and ask a user to type it, or it might show a set of pictures and ask the user which ones contain fire hydrants; these tasks are meant to be easy for humans but obscenely difficult for computers. CAPTCHAs are a recurring theme on xkcd.
CAPTCHAs run by Google are also used to train artificial intelligences to get better at these difficult tasks, such as reading poorly-scanned text or identifying objects of interest on the road (the latter being the subject of 1897: Self Driving).
This comic jokes about a malicious CAPTCHA which is being used to train an AI to dominate the world. In order to red Hutch's rhino cluster), you can call sbatch scripts/upstream.sbatch
in order to request a 16 node cluster. This should take about 8 hours to run. The slow step is remaking the phylogenetic tree. If I was going to do this again from scratch, I'd probably use raxml.
all_upstream.sh just calls other scripts, those pieces can be run as follows:
Individual pieces can be run as follows:
- To demultiplex, run
sh scripts/demultBothPlates.sh
- The next three scripts must be run inside of the nyvac-lab-2 environment
source attach nyvac-lab-2
- To remake dada2 sequence varients run
Rscript dada2work-March2018Run.R
. One can also open the r script and run it in any R interpreter. (This is true of all of the subsequent R steps. Such a process makes for substantially easier debugging. - To make the phylogenetic tree
Rscript makeTree.R
- To generate taxonomic information first acquire necessary training data by running
sh pull_training.sh
. Then runRscript dada2taxonomy-March2018Run.R
.
This can be run independently of the downstream analysis. It defaults to using data from the data\
directory. In theory, all one should need to do is open the Mar2018_096.ipynb file in jupyter notebook or jupyter lab and run all of the cells.
If you want to run it on re-analyzed data, find comment out the line #upOriginal <- TRUE
and uncomment the line upOriginal <- FALSE
.
If you want the script to run faster, set jnperm <- 9999
, the cost of this is that the p-values are not calculated as precisely. If you want p-values that don't fluctuate from run to run, set jnperm <- 99999
and maybe go get lunch or similar while the file is running.
In order to use the breakaway package by adw36, which I need to calculate richness (and confedence intervals, and to run appropriate statistics), I need an R version > 3.5. This branch is for the newist version 3.6.1.
This change lead to new bugs, now resolved, and took care of some old bugs. I have re-written the readme to acomidate these things