Python package to run a Sequence Quality Control pipeline, based on workflows defined by IGMM.
- FastQC
- Cutadapt
- Python 3.x
- Pyyaml
We recommend using the conda package manager, and making use of virtual environments. This tool also exists in the bioconda channel. This has the benefit of automatically installing all pre-requisites when installing this tool.
There are two main ways to install the package.
$ conda create -n my_env -c bioconda python=3
This creates a clean Python3 environment in which to install and run the tool. If you have a conda environment you already wish to use, make sure you add the bioconda channel to the environment, or your conda package as a whole.
$ conda install bioexcel_seqqc
This one line will install BioExcel_SeqQC and all of it's dependencies.
If you wish to install manually, follow the steps below. We still recommend using some kind of virtual environment. Before running the workflow, install the pre-requisite tools and ensure they are contained in your $PATH
$ git clone https://github.com/bioexcel/BioExcel_SeqQC.git
$ cd BioExcel_SeqQC
$ python setup.py install
Once installed, there are several ways to use the tool. The easiest is to call the executable script, which runs the whole workflow based on several options and arguments the user can modify. Find these using
$ bioexcel_seqqc -h
An example of basic usage of the pipeline is:
$ bxcl_seqqc --files in1.fa in2.fa --threads 4 --outdir ./output
The tool runs an automated set of checks based on output from FastQC. The default decision making is based on our partner preference, but these can be changed. First, output an example configuration file (which contains the default values):
$ bxcl_seqqc --printconfig
The file lists the summary outputs from FastQC, and what decisions to make depending on whether the files should be trimmed, rechecked, and take into account whether they have been trimmed automatically.
In addition to the executable version, the tool is installed as a Python package, so each stage can be imported as a module into other scripts, if the user wishes to perform more unique/complicated/expanded workflows. Each function creates and returns a python subprocess.
import bioexcel_seqqc
import bioexcel_seqqc.runfastqc as rfq
import bioexcel_seqqc.runtrim as rt
# Do things before running FastQC
fqc_process = rfq.run_fqc(infiles, fqcdir, tmpdir, threads)
fqc.wait()
# Do things after FastQC, and before trimming low quality reads
trim_process = rt.trimQC(infiles, trimdir, threads):
trim_process.wait()
Our pipeline consists of three main stages: runfastqc, checkfastqc and runtrim. Each stage exists as a python module as shown above. Each module contains specific functions that execute the tools listed. The diagram below shows each of these stages, with colour coding to show which tools are used in each module, as well as useful output files. For this work, the module checkfastqc was developed specifically to remove the human intervention required to check output from fastqc before continuing with trimming/further analysis.
Each module can also be executed independently of the main executable workflow. For example, if a situation occurs that causes cutadapt to fail, the runtrim stage can be executed from the command line as
$ python -m bioexcel_align.runtrim <arguments>