Skip to content

PandemicPreventionInstitute/SHED

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sequencing Hub for Environmental Data

This repository holds the bioinformatic pipeline used by the Sequencing Hub for Environmental Data (SHED).

Installation instructions

Installing the required programs on a PPI machine is a little bit complicated. There are pipeline-specific installation instructions in the readme.md in the backend/ directory. This readme.md contains instructions for prerequisites for installation on a PPI machine.

The pipeline is designed to work with Conda and Mamba (a derivative of Conda). Both need to be installed.

The pipeline will not work properly unless it's in a Mamba environment, not a Conda environment. This should be accomplished by generating a clean environment with nothing installed (i.e., no default installs).

conda create --name myenv --no-default-packages

Then, activate this environment and, in your home directory, run the following command to download a compressed version of the SRA toolkit: curl -R --output sratoolkit.tar.gz http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-mac64.tar.gz Note, this command is similar to the suggested tar fetch from the install instructions on the SRA install instructions website but the -R flag is added to allow for redirects. Otherwise the generated file is not actually a zipped version of the required files, rather it is an empty file that will cause subsequent steps to fail.

You can also download a prebuilt binary from the SRA Github

Next, as the SRA install instructions suggest, unzip the file: tar -vxzf sratoolkit.tar.gz

And append it to your path: export PATH=$PATH:$PWD/sratoolkit.3.0.0-mac64/bin

BE SURE that this path points correctly to the unpacked version of the SRA toolkit. The version of the SRA toolkit should be at least 3.0.0 to ensure that the required functionality is available. Packages managers like Homebrew and Anaconda/Bioconda are currently installing versions below 3.0, necessitating manual install.

Running this export command will work until the shell is closed, but to cause the export to automatically occur when activating the environment in the future, it should be added to the Anaconda environment startup files. This can be accomplished by doing the following:

  • Locate the directory for the conda environment by (in the activated environment) running echo $CONDA_PREFIX.
  • Move to the directory by the same name as the conda directory (cd $CONDA_PREFIX) and create the following folders and files:
mkdir -p ./etc/conda/activate.d
mkdir -p ./etc/conda/deactivate.d
touch ./etc/conda/activate.d/env_vars.sh
touch ./etc/conda/deactivate.d/env_vars.sh
  • Edit the ./etc/conda/activate.d/env_vars.sh so that it looks like the following:
#!/bin/sh

export PATH="$PATH:/Users/zsusswein/sratoolkit.3.0.0-mac64/bin"

But replace zsusswein in the path with your own username.

Although some of the required packages will install automatically when running the pipeline in backend/ (e.g., fastx_toolkit), SRA-toolkit will not. Everything else will be automatically installed by the pipeline to the appropriate version on the first run once the appropriate mamba environment is set up. See backend/readme.md for more specific setup information.

Developer settings

If you're going to be working on the pipeline, please set up pre-commit hooks. Install the pre-commmit package manager and install pre-commit into the SHED/ directory after cloning a local copy of the repository. As usual, please work on a branch off of dev and open a PR with any changes.