A series of scripts to automate DNA sequence aligning and quality control workflows via list-based batch submission and parallel processing
For greater detail about everything, please see the wiki for this repository
sequence_handling
is a series of scripts to automate and speed up DNA sequence aligning and quality control through the use of our workflow outlined here. This repository contains two general kinds of scripts: Shell Scripts and Batch Submission Scripts, with one exception.
The former group is designed to be run directly from the command line. These serve as partial dependency installers, a way to generate a list for batch submission, QSub starters, and others that have issues with either running in parallel or using the Portable Batch System due to memory issues. Running any of these scripts without any arguments generates a usage message for more details. Each script is named entirely in lower-case letters.
The latter group is designed to run the workflow in batch and in parallel. These scripts use a list of sequences, with full sequence paths, as their input and utilize GNU Parallel to speed up the analysis and work they are designed for. Due to the length of time and resources needed for these scripts to run, they are designed to be submitted to a job scheduler, specifically the Portable Batch System. Each script is named using capital and lower-case letters.
Finally, there is one script that is neither designed to run directly from the shell nor submitted to a job scheduler. This script, plot_cov.R
is designed to be called by Plot_Coverage.sh
for creating coverage plots. This is done automatically; one does not need to change this script unless they wish to change the graphing parameters.
NOTE: the latter group of scripts and
read_mapping_start.sh
are designed to use the Portable Batch System and run on the Minnesota Supercomputing Institute. Heavy modifications will need to be made if not using these systems.
Piping one sample alone through this workflow can take over 12 hours to completely run. Most sequence handling jobs are not dealing with one sample, so the amount of time to run this workflow increases drastically. List-based batch submission simplifies the amount of typing that one has to do, and enables parallel processing to decrease time spent waiting for samples to finish. An example list is shown velow
/home/path_to_sample/sample_001_R1.fastq.gz
/home/path_to_sample/sample_001_R2.fastq.gz
/home/path_to_sample/sample_003_R1.fastq.gz
/home/path_to_sample/sample_003_R2.fastq.gz
Parallel processing decreases the amount of time by running multiple jobs at once and keeping track of which are done, which are running, and which have yet to be run. This workflow, with the list-based batch submissions and parallel processing, both simplifies and quickens the process of sequence handling.
No, with the one exception of Plot_Coverage.sh
and plot_cov.R
, no two scripts are entirely dependent on one another. While all these scripts are designed to easily use the output from one to the next, these scripts are not required to achive the end result of sequence_handling
. If you prefer tools other than the ones used within this workflow, you can modify or replace any or all of the scripts offered in sequence_handling
. This creates a pseudo-modularity for the entire workflow that allows for customization for each and every user.
Due to the pseudo-modularity of this workflow, specific dependencies for each individual script are listed below. Some general dependencies for the workflow as a whole are listed here:
- A quality trimmer, such as Seqqs, Sickle, and Scythe
- Tools for plotting results, such as R
- SAM file processing utilities, such as SAMTools and Picard
- A quality control mechanism, such as FastQC
- A read mapper, such as The Burrows-Wheeler Aligner (BWA)
- GNU Parallel
Please note that this is not a complete list of dependencies. Check below for specific dependencies for each desired script.
When running these scripts on the Minnesota Supercomputing Institute's (MSI) resources, most dependencies are included through MSI's module system. These modules are set to be automatically called by each script that calls upon them. However, some dependencies are not available through MSI; please check each script for which dependencies need to be installed separately.
NOTE: Running any of these scripts without arguments generates a usage message for greater detail about how to use them
The installer.sh
script installs Seqqs, Sickle, and Scythe for use with the Quality_Triming.sh
script. It also has options for installing Bioawk, SAMTools and R, all dependencies for various scripts within this package.
The installer.sh
script depends on Git, Wget, the GNU Compiler Collection (GCC), and GNU Make to run.
The sample_list_generator.sh
script creates a list of samples using a directory tree for its searching. This will find all samples in a given directory and its subdirectories. Only use this if you are using all samples within a directory tree. sample_list_generator.sh
is designed to be run from the command line directly.
The sample_list_generator.sh
script has no external dependencies.
The read_counts.sh
script calls Bioawk to get accurate counts for read number for a list of samples. Output is written to a tab-delimited file file with sample name drawn from the file name for the list of samples.
The read_counts.sh
script depends on Bioawk to run.
The read_mapping_start.sh
script generates a series of QSub submissions for use with the Portable Batch System on MSI's resources. starts a series of BWA sessions to map reads back to a reference genome.
The read_mapping_start.sh
script depends on the Portable Batch System and BWA to run.
NOTE: Each of these scripts contains usage information within the script itself. Furthermore, all values for these scripts are hard-coded into the script itself. Please open each script using your favourite text editor (ex. Vim, Sublime Text, Visual Studio Code, etc.) to read usage information and set values
The Assess_Quality.sh
script runs FastQC on the command line on a series of samples organized in a project directory for quality control. In addition, a list of all output zip files will be generated for use with the Read_Depths.sh
script. Our recommendation is using this both before and after quality trimming and before read mapping. This script is designed to be run using the Portable Batch System.
The Assess_Quality.sh
script depends on FastQC, the Portable Batch System, and GNU Parallel to run.
The Read_Depths.sh
script utilizes the output from FastQC to calculate the read depths for a batch of samples and outputs them into one convenient text file.
The Read_Depths.sh
script depends on the Portable Batch System and GNU Parallel to run.
The Quality_Trimming.sh
script runs trim_autoplot.sh
(part of the Seqqs repository on GitHub) on a series of samples organized in a project directory.. In addition to requiring Seqqs to be installed, this also requires GNU Parallel to be installed on the system.
The Quality_Trimming.sh
script depends on Sickle, Scythe, Seqqs, R, the Portable Batch System, and GNU Parallel to run.
The SAM_Processing_SAMTools.sh
script converts the SAM files from read mapping with BWA to the BAM format using SAMTools. In the conversion process, it will sort and deduplicate the data for the finished BAM file, also using SAMTools. Alignment statistics will also be generated for both raw and finished BAM files. A list of finished BAM files will be generated at the end of this script.
The SAM_Processing_SAMTools.sh
script depends on SAMTools, the Portable Batch System, and GNU Parallel to run.
The SAM_Processing_Picard.sh
script converts the SAM files from read mapping with BWA to the BAM format using SAMTools. In the conversion process, it will sort and deduplicate the data for the finished BAM file, using Picard. Alignment statistics will also be generated for both raw and finished BAM files. A list of finished BAM files will be generated at the end of this script.
NOTE: This script is extremely resource intensive, please use with caution.
NOTE: This script has not been tested, use with caution
The SAM_Processing_Picard.sh
script depends on SAMTools, Picard, the Portable Batch System, and GNU Parallel to run.
The Coverage_Map.sh
script generates coverage maps from BAM files using BEDTools. This map is in text format and is used for making coverage plots. In addition to generating coverage maps, this script will create a list of all the coverage maps generated for use in other scripts.
The Coverage_Map.sh
script depends on BEDTools, the Portable Batch System, and GNU Parallel to run.
The Plot_Coverage.sh
script creates plots using R based off of coverage maps. It will generate three plots: one showing coverage across the genome, one showing coverage across exons, and one showing coverage across genes. This script uses plot_cov.R
to generate the plots.
The Plot_Coverage.sh
script depends on the plot_cov.R
script, R, the Portable Batch System, and GNU Parallel to run.
The plot_cov.R
script is the graphical brains behind the Plot_Coverage.sh
script. The latter will automatically call upon the former to create the coverage plots based off coverage maps. It is not necessary to open this script directly, except for making modifications to the graphical parameters.
The plot_cov.R
script has no external dependencies.
GeneralizeDONE!read_counts.sh
for any project.Add better list-out methodsDONE!Fix memory issues withRead_Mapping.sh
Redesign read mapping scriptsDONE!Add coverage map script to workflowFinish integratingDONE!Coverage_Map.sh
with the rest of the pipelineGetDONE!Plot_Coverage.sh
andplot_cov.R
integrated into the pipelineAdd information aboutDONE!plot_cov.R
to the READMEAdd script to easily convert SAM files fromRead_Mapping.sh
to BAM files forCoverage_Map.sh
DONE!ish...DONE!Add Deduplication scriptGetDeduplication.sh
SAM_Processing_Picard.sh
workingAdd read mapping statistics viaDONE! This is integrated intosamtools flagstat
SAM_Processing_SAMTools.sh
- Incorporate variant calling scripts into the pipeline
- keep README updated