-
Notifications
You must be signed in to change notification settings - Fork 10
Customisation
Falco is a software framework, designed to be used for the analysis of RNA-seq data, that harnesses the power and the flexibility of publicly available cloud computing resources. The base version of Falco provides a choice of two aligner software tools (STAR and HISAT), and two feature quantification software tools (featureCount and HTSeq). This provides a number of alternatives for the user to choose from in terms of aligner/quantification combinations. However, there may be situations where the user desires to use an aligner or quantification tool that is not already provided with Falco. This documentation describes how to integrate such customisations with Falco.
The main Falco program is written in the Python programming language, and is located (relative to its installation directory) at:
source/spark_runner/run_pipeline_multiple_files_combined.py
Any changes to this file should be completed by an experienced Python programmer.
Whether adding a custom aligner or feature quantification program, the custom software needs to be made available to Falco. When Falco launches a cluster of cloud computing resources, it automatically installs the required software on each node of the cluster. This process is controlled by a script located at:
source/cluster_creator/install_software.sh
The name of this script is also configurable in the config file:
emr_cluster.config
The field is in the [EMR]
section of the config file, and the field itself is called bootstrap_scripts
. Also in
this configuration file, is the software_installer_location
field, which specifies the AWS S3 location for all the
software required to be copied to the cluster nodes.
To make your software available to Falco, you will need to:
- upload the software in the form of a .tar.gz file or similar to a location in AWS S3.
- note that any .gz files will automatically be unzipped as part of the Falco installation process.
- add code to the
install_software.sh
file to install the software. - you may wish to add a symbolic link to your software after it is installed ; see existing examples in the installation script.
To add a new aligner to Falco, a new function needs to be added to the main Python program for Falco.
The inputs to the function are:
- the unique sample id
- a list of input fastq file names (max 2 elements for paired-end data)
- alignment output directory
The outputs of the function are:
- .sam file name (alignment data)
- a list with alignment quality control statistics - each element is a tuple in the format of (<sample_id> \t QC_<tool_name>_<metric_name>, metric_value)
The body of the function does the following:
- issues a unix shell command to execute the alignment software
- checks for errors
- collects quality control statistics
- returns the required output (Optional)
Below shows a skeleton Python code for the align_reads
function:
def align_reads_custom(sample_name, file_names, alignment_output_dir):
# If paired read flag is required
# paired_read = True if len(file_names) == 2 else False
print("Aligning reads...")
# Construct the shell command for executing the alignment tool
aligner_args = "{app_folder}/aligner/aligner_bin {aligner_extra_args} {index_folder} {fastq_file_names} {output_folder}".\
format(app_folder=APPLICATION_FOLDER,
aligner_extra_args="" if parser_result.aligner_extra_args is None else parser_result.aligner_extra_args,
index_folder=GENOME_REFERENCES_FOLDER + "/aligner_index",
fastq_file_names=" ".join(file_names),
output_folder=alignment_output_dir)
print("Command: " + aligner_args)
# Execute the shell command
aligner_process = Popen(shlex.split(aligner_args), stdout=PIPE, stderr=PIPE)
aligner_out, aligner_error = aligner_process.communicate()
# Check for error using return code
if aligner_process.returncode != 0:
raise ValueError("Aligner failed to complete (Non-zero return code)!\n"
"Aligner stdout: {std_out} \Aligner stderr: {std_err}".format(std_out=aligner_out,
std_err=aligner_error))
print('Completed reads alignment')
aligner_qc_output = [] # Optional - collect quality control metrics produced by aligner
sam_file_name_output = "Aligned.out.sam"
return sam_file_name_output, aligner_qc_output
The aligner_args
variable contains the shell command to be executed.
To add new feature quantification tool to Falco, a new function needs to be added to the main Python program for Falco.
The inputs to the function are:
- the unique sample id
- the path to the alignment output
- a boolean indicating if the samples are paired-end read or single-end read
- counter output directory
The outputs of the function are:
- a list of the counter output - each element is a tuple in the format of (<sample_id> \t <gene_identifier>, count)
- a list with count quality control statistics - each element is a tuple in the format of (<sample_id> \t QC_<tool_name>_<metric_name>, metric_value)
The body of the function does the following:
- issues a unix shell command to execute the feature quantification software
- checks for errors
- collects quality control statistics (Optional)
- returns the required output
Below shows a skeleton Python code for the count_reads
function:
def count_reads_featurecount(sample_name, aligned_output_filepath, paired_reads, counter_output_dir):
print("Counting reads...")
# Construct the shell command for executing the quantification tool
counter_args = "{app_folder}/counter/counter_bin {counter_extra_args} {aligned_file} {genome_ref_folder}/{annotation_file} {output_folder}".\
format(app_folder=APPLICATION_FOLDER,
counter_extra_args="" if parser_result.counter_extra_args is None else parser_result.counter_extra_args,
aligned_file=aligned_output_filepath,
genome_ref_folder=GENOME_REFERENCES_FOLDER + "/genome_ref",
annotation_file=parser_result.annotation_file,
output_folder=counter_output_dir)
print("Command: " + counter_args)
# Execute the shell command
counter_process = Popen(shlex.split(counter_args), stdout=PIPE, stderr=PIPE)
counter_out, counter_error = counter_process.communicate()
# Check for error using return code
if counter_process.returncode != 0:
raise ValueError("Counter failed to complete! (Non-zero return code)\nCounter stdout: {} \n"
"Counter stderr: {}".format(counter_out, counter_error))
# Extract the gene counts output
counter_output = []
# This example assumes the output is stored in a file called counts.txt in the output directory.
# The format of the counts.txt is gene_name \t gene_count
with open(counter_output_dir + "/counts.txt") as f:
for index, line in enumerate(f):
if index < 1: # Ignore header line
continue
line = line.strip().split()
if len(line) == 0:
print(line)
gene, count = line[0], line[-1]
counter_output.append((sample_name + "\t" + gene, int(count)))
counter_qc_output = [] # Optional - collect quality control metrics produced by counter
return counter_output, counter_qc_output
In addition to the above changes, the user will need to modify the alignmentCountStep
function on the main Falco
Python file for running the analysis (run_pipeline_multiple_files.py
) to enable the usage of the customised
functions that have been added. Refer to the existing code for guidance.