Skip to content

Customisation

Andrian edited this page Sep 20, 2019 · 2 revisions

Customising Falco

Falco is a software framework, designed to be used for the analysis of RNA-seq data, that harnesses the power and the flexibility of publicly available cloud computing resources. The base version of Falco provides a choice of two aligner software tools (STAR and HISAT), and two feature quantification software tools (featureCount and HTSeq). This provides a number of alternatives for the user to choose from in terms of aligner/quantification combinations. However, there may be situations where the user desires to use an aligner or quantification tool that is not already provided with Falco. This documentation describes how to integrate such customisations with Falco.

The main Falco program is written in the Python programming language, and is located (relative to its installation directory) at:

source/spark_runner/run_pipeline_multiple_files_combined.py

Any changes to this file should be completed by an experienced Python programmer.

Making External Software available to Falco

Whether adding a custom aligner or feature quantification program, the custom software needs to be made available to Falco. When Falco launches a cluster of cloud computing resources, it automatically installs the required software on each node of the cluster. This process is controlled by a script located at:

source/cluster_creator/install_software.sh

The name of this script is also configurable in the config file:

emr_cluster.config

The field is in the [EMR] section of the config file, and the field itself is called bootstrap_scripts. Also in this configuration file, is the software_installer_location field, which specifies the AWS S3 location for all the software required to be copied to the cluster nodes.

To make your software available to Falco, you will need to:

  • upload the software in the form of a .tar.gz file or similar to a location in AWS S3.
  • note that any .gz files will automatically be unzipped as part of the Falco installation process.
  • add code to the install_software.sh file to install the software.
  • you may wish to add a symbolic link to your software after it is installed ; see existing examples in the installation script.

Changing the Aligner

To add a new aligner to Falco, a new function needs to be added to the main Python program for Falco.

The inputs to the function are:

  • the unique sample id
  • a list of input fastq file names (max 2 elements for paired-end data)
  • alignment output directory

The outputs of the function are:

  • .sam file name (alignment data)
  • a list with alignment quality control statistics - each element is a tuple in the format of (<sample_id> \t QC_<tool_name>_<metric_name>, metric_value)

The body of the function does the following:

  • issues a unix shell command to execute the alignment software
  • checks for errors
  • collects quality control statistics
  • returns the required output (Optional)

Below shows a skeleton Python code for the align_reads function:

def align_reads_custom(sample_name, file_names, alignment_output_dir):
    # If paired read flag is required
    # paired_read = True if len(file_names) == 2 else False

    print("Aligning reads...")
    
    # Construct the shell command for executing the alignment tool
    aligner_args = "{app_folder}/aligner/aligner_bin {aligner_extra_args} {index_folder} {fastq_file_names} {output_folder}".\
        format(app_folder=APPLICATION_FOLDER,
               aligner_extra_args="" if parser_result.aligner_extra_args is None else parser_result.aligner_extra_args,
               index_folder=GENOME_REFERENCES_FOLDER + "/aligner_index",
               fastq_file_names=" ".join(file_names),
               output_folder=alignment_output_dir)
    print("Command: " + aligner_args)
    
    # Execute the shell command
    aligner_process = Popen(shlex.split(aligner_args), stdout=PIPE, stderr=PIPE)
    aligner_out, aligner_error = aligner_process.communicate()
    
    # Check for error using return code
    if aligner_process.returncode != 0:
        raise ValueError("Aligner failed to complete (Non-zero return code)!\n"
                         "Aligner stdout: {std_out} \Aligner stderr: {std_err}".format(std_out=aligner_out, 
                                                                                       std_err=aligner_error))

    print('Completed reads alignment')

    aligner_qc_output = [] # Optional - collect quality control metrics produced by aligner
    sam_file_name_output = "Aligned.out.sam"

    return sam_file_name_output, aligner_qc_output

The aligner_args variable contains the shell command to be executed.

Changing the Feature Quantification tool

To add new feature quantification tool to Falco, a new function needs to be added to the main Python program for Falco.

The inputs to the function are:

  • the unique sample id
  • the path to the alignment output
  • a boolean indicating if the samples are paired-end read or single-end read
  • counter output directory

The outputs of the function are:

  • a list of the counter output - each element is a tuple in the format of (<sample_id> \t <gene_identifier>, count)
  • a list with count quality control statistics - each element is a tuple in the format of (<sample_id> \t QC_<tool_name>_<metric_name>, metric_value)

The body of the function does the following:

  • issues a unix shell command to execute the feature quantification software
  • checks for errors
  • collects quality control statistics (Optional)
  • returns the required output

Below shows a skeleton Python code for the count_reads function:

def count_reads_featurecount(sample_name, aligned_output_filepath, paired_reads, counter_output_dir):
    print("Counting reads...")
    
    # Construct the shell command for executing the quantification tool
    counter_args = "{app_folder}/counter/counter_bin {counter_extra_args} {aligned_file} {genome_ref_folder}/{annotation_file} {output_folder}".\
        format(app_folder=APPLICATION_FOLDER,
               counter_extra_args="" if parser_result.counter_extra_args is None else parser_result.counter_extra_args,
               aligned_file=aligned_output_filepath,
               genome_ref_folder=GENOME_REFERENCES_FOLDER + "/genome_ref",
               annotation_file=parser_result.annotation_file,
               output_folder=counter_output_dir)
    print("Command: " + counter_args)
    
    # Execute the shell command
    counter_process = Popen(shlex.split(counter_args), stdout=PIPE, stderr=PIPE)
    counter_out, counter_error = counter_process.communicate()
    
    # Check for error using return code
    if counter_process.returncode != 0:
        raise ValueError("Counter failed to complete! (Non-zero return code)\nCounter stdout: {} \n"
                         "Counter stderr: {}".format(counter_out, counter_error))

    # Extract the gene counts output
    counter_output = []
    
    # This example assumes the output is stored in a file called counts.txt in the output directory.
    # The format of the counts.txt is gene_name \t gene_count
    with open(counter_output_dir + "/counts.txt") as f: 
        for index, line in enumerate(f):
            if index < 1:  # Ignore header line
                continue

            line = line.strip().split()
            if len(line) == 0:
                print(line)
            gene, count = line[0], line[-1]
            counter_output.append((sample_name + "\t" + gene, int(count)))

    counter_qc_output = [] # Optional - collect quality control metrics produced by counter

    return counter_output, counter_qc_output

Modify Falco Options

In addition to the above changes, the user will need to modify the alignmentCountStep function on the main Falco Python file for running the analysis (run_pipeline_multiple_files.py) to enable the usage of the customised functions that have been added. Refer to the existing code for guidance.

Home