Open source tumor amplicon pipeline, ie an alternative Bioinformatic Pipeline for AmpliconDS, that works for any ampliconDS library given a proper manifest file
This program is designed to run through a NextSeq or MiSeq run directory looking for fastq files located in ${current directory or specified directory}/Data/Intensities/BaseCalls/ Note: MiSeq folders start with the structure YYMMDD_machinename_NNNN where machinename starts with a 'M'ex: 140729_M01382_0050_000000000-AAE8K and NextSeq folder start with the structure YYMMDD_machinename_NNNN where machinename starts with a 'N'ex: 140729_N01382_0050_000000000-AAE8K AlAmpDS expects YYMMDD_machinename with machinename starting with 'M' or 'N'. Modification of code or changing the name of folders is necessary if run on hiseq
The main script file to run is runAltPipeline.sh.
bash /<OTA-pipeline directory>/runAltPipeline -h #to get help and see the different parameters
bash /<OTA-pipeline directory>/runAltPipeline -s /<OTA-pipeline directory>/trusight_tumor_pipeline.sh > output_alt_pipeline_run.txt 2>&1&
nohup sh /<OTA-pipeline directory>/runAltPipeline.sh -debugging true -validation true> output_alt_pipeline_run.txt 2>&1&
It is highlest suggested to make script alias to make running the pipeline easier
cd ~
vim ./.bashrc
in the bashrc file under the # User specific aliases and functions section (modify as appropriate for your machine)
alias runAltPipeline='nohup bash /<OTA-pipeline directory>/runAltPipeline.sh > output_alt_pipeline_run.txt 2>&1&'
alias debugRunAltPipeline='nohup bash /<OTA-pipeline directory>/runAltPipeline.sh -debugging true -validation true> output_alt_pipeline_run.txt 2>&1&'
alias validationRunAltPipeline='nohup sh /<OTA-pipeline directory>/runAltPipeline.sh -validation true > output_alt_pipeline_run.txt 2>&1&'
where runAltPipeline is the default, debugRunAltPipeline and validationRunAltPipeline do not get rid of temporary files, debugRunAltPipeline has less restrictions region depth (to use when testing pipeline with very small artifical fastqs)
Note:when running the above code the user needs to be in the top directory of a NextSeq or MiSeq folder
as a home_dir was not specified
PIPELINLE_DIR, this variable needs to be set in ./.bashrc file
cd ~
vim ./.bashrc
###add this under the alias section, modify to point to the main directory folder of this repository
PIPELINE_DIR=/home/ec2-user/ampDsTs;export PIPELINE_DIR
THREADS - number of threads to use when calling functions that support multi-threaded workflow (default 25)
this parameter can be changed by specifying the -threads parameter when calling runAltPipeline.sh
MEMORY - integer: the amount of memory to specify for the java virtual manager to use: default 16
active_case_limit - integer: number of cases to process at one time, default is 8
nohup sh $PIPELINE_DIR/runAltPipeline.sh -threads 25 -memory 16 -active_case_limit 8 > output_alt_pipeline_run.txt 2>&1&'
There is a script file called download_dependencies.sh that will help you download all of these programs if running on ubuntu, similar code for Red-hat is commented out which can be removed if necessary. Please note that this file will NOT download GATK and Annovar as those programs have license agreements. To launch the script in the terminal type, this simple script will download all dependencies in the directory it currently resides in
bash download_dependencies.sh
bash -variables need to be set in ~./.bashrc file -some of the code uses bash syntax so need to make sure bash installed on linux distribution
To install, proceed to install in the order below
git
sudo yum install git #Red-hat
sudo apt-get install git #ubuntu
zip
sudo yum install unzip #Red-hat
sudo apt-get install unzip #ubuntu
java
sudo yum install java-1.8.0-openjdk-devel #Red-hat
sudo apt-get install openjdk-8-jdk #Ubuntu
wget
sudo yum install wget #Red-hat
sudo apt-get install wget #Ubuntu
gcc
sudo yum install gcc #red-hat
sudo apt-get install gcc #ubuntu
python-devel
sudo yum install python-devel #Red-hat
sudo yum install python-dev #Ubuntu
zlib
sudo yum install zlib-devel #Red-hat
sudo apt-get install zlib1g-dev #ubuntu
g++
sudo yum install gcc-c++ #red-hat
sudo apt-get install g++ #ubuntu
curses
sudo apt-get install libncurses5-dev libncursesw5-dev #ubuntu
yum install ncurses-devel ncurses #red-hat
download the git repository
gitclone https://github.com/schneiderthomas/AltAmpDs
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm #red-hat
sudo yum install epel-release-7.noarch.rpm #red-hat
sudo yum install python-pip #red-hat
sudo apt-get install python-pip #ubuntu
sudo pip install biopython==1.66
sudo pip install pysam==0.8.4
sudo pip install pyvcf==0.6.7
sudo pip install pandas==0.16.2
sudo pip install regex==2015.3.18
to display dialog boxes from shell script (to let tech know that processing is done)
sudo yum install zenity #red-hat
sudo apt-get install zenity #ubuntu
xterm
sudo yum install xterm #red-hat
sudo yum install xorg-x11-xauth.x86_64 xorg-x11-server-utils.x86_64 dbus-x11.x86_64 #red-hat
sudo apt-get install xterm xorg dbus #ubuntu
bcl2fastq (v2.17) to convert files from bcl to fastq files
#optional, already provided as zip
#Red-hat
wget 'ftp://webdata2:webdata2@ussd-ftp.illumina.com/downloads/software/bcl2fastq/bcl2fastq2-v2.17.1.14-Linux-x86_64.zip'
unzip bcl2fastq2-v2.17.1.14-Linux-x86_64.zip
yum localinstall bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm
#if unbuntu
sudo apt-get install alien dpkg-dev debhelper build-essential #needed for unbuntu
sudo alien bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm
sudo dpkg -i bcl2fastq2-v2.17.1.14-Linux-x86_64.deb
these programs need to be downloaded and/or compiled and their resulting directories need to be placed in this directory, a more recent version may be used but there may be some compatibility issues with the pipeline as it is
GATK 3.5 (VERY IMPORTANT AT LEAST 3.5)
annovar - 2014-11-12
freebayes v0.9.20
bcftools-1.2
FastQC v0.11.3
htslib-1.2.1
IGVTools 2.3.57
picard 2.10
samtools_1.2
snpeff 4.1g 2015-05-17
varscan v2.3.9
bwa 0.7.10
vcflib v.1.0.0
CoverageQC - for debugging
bedtools2 -> Version 2.26.0
Trimmomatic 0.33
Please note annovar and GATK have license agreements must be accepted before you download them and therefore they cannot be downloaded using the above script. Instructions to download these files are below:
#annovar please download annovar, version 2014-11-12 was used originally (therefore is the preferred version to ensure compatibility), to download Annovar click here. After downloading annovar place the annovar folder entitled "annovar" in the current directory Note: the original splicing threshold for annovar is to 2, this can be modified if one goes to file table_annovar.pl and modifies the line
$sc = "annotate_variation.pl -geneanno -buildver $buildver -dbtype $protocol -hgvs -outfile $tempfile.$protocol -exonsort $queryfile $dbloc";
to
$sc = "annotate_variation.pl -geneanno -buildver $buildver -dbtype $protocol -splicing_threshold 5 -hgvs -outfile $tempfile.$protocol -exonsort $queryfile $dbloc";
version 3.5 is being used for this pipeline get the latest software here if download version higher than 3.5, need to change line 34 in amplicon_ds_pipeline.sh as appropriate
In this repository there is a folder called ART, in here you will find shell that can be used to create artifical FASTQ files similar to an ampliconDS run. ART version ChocolateCherryCake-03-19-2015 was used in these scripts.
Download the latest ART program here.
will be downloaded if use download_dependencies.sh script
will download_dependencies.sh install clinvar, cosmic, exac, snp and 1000g in the annovar directory, see download_dependencies.sh if curious
-the shell script which runs through the current directory (unless given) and feeds files to the pipeline shell script (location can be specified with -s command but default parameters are at the
top of the shell script which can be changed if one moves the directory
- the directory has a folder structure
BaseDirectory -> Data -> Intensities -> BaseCalls
will exit if this is not seen
- there needs to be an even number of fastq files (not including the Undetermined FASTQ files) because there always be either two fastq files (or 8 when a NextSeq Folder with no lane splitting) in Amplicon DS pipeline, will exit if does not see this
- if no FASTQ files are present then there needs to be tiffs in BaseDirectory-> Images folder or bcl files in BaseFolder -> Data -> Intensities -> BaseCalls -> L001 & L002 & L003 & L004 so bcl2fastq can turn the images or bcl filtes to fastq files
- The BaseFolder name has to start to like 160518_N or 160518_M where the first part is a number and then there is an underscore and either an N letter or a M letter (this tells the script if it is dealing with a NextSeq or MiSeq folder), if the name does not start
like this it should exit in an error