Skip to content

Latest commit

 

History

History
221 lines (151 loc) · 7.01 KB

utils.md

File metadata and controls

221 lines (151 loc) · 7.01 KB

File utils.py - Documentation

Methods - Overview

name description
save_model Save a pysster.Model object.
load_model Load a pysster.Model object.
save_data Save a pysster.Data object.
load_data Load a pysster.Data object.
annotate_structures Annotate secondary structure predictions with structural contexts.
predict_structures Predict secondary structures for RNA sequences.
get_performance_report Get a performance overview of a classifier.
plot_roc Get ROC curves for every class.
plot_prec_recall Get Precision-Recall curves for every class.
save_as_meme Save sequence (or structure) motifs in MEME format.
run_tomtom Compare a MEME file against a database using TomTom.

save_model

def save_model(model, file_path)

Save a pysster.Model object.

This function creates two files: a pickled version of the pysster.Model object and an hdf5 file of the actual keras model (e.g. if file_path is 'model' two files are created: 'model' and 'model.h5')

parameter type description
model pysster.Model A Model object.
file_path str A file name.

load_model

def load_model(file_path)

Load a pysster.Model object.

parameter type description
file_path str A file containing a pickled pysster.Model object (file_path.h5 must also exist, see save_model()).
returns type description
model pysster.Model A Model object.

save_data

def save_data(data, file_path)

Save a pysster.Data object.

The object will be pickled to disk.

parameter type description
file_path str A file name.

load_data

def load_data(file_path)

Load a pysster.Data object.

parameter type description
file_path str A file containing a pickled pysster.Data object.
returns type description
data pysster.Data The Data object loaded from file.

annotate_structures

def annotate_structures(input_file, output_file)

Annotate secondary structure predictions with structural contexts.

Given dot-bracket strings this function will annote every character as either 'H' (hairpin), 'S' (stem), 'I' (internal loop/bulge), 'M' (multi loop), 'F' (5-prime) or 'T' (3-prime). The input file must be a fasta formatted file and each sequence and structure must span a single line:

>header
CCCCAUAGGGG
((((...)))) (-3.3)

This is the default format of e.g. RNAfold. The output file will contain the annotated string:

>header
CCCCAUAGGGG
SSSSHHHSSSS

parameter type description
input_file str A fasta file containing secondary structure predictions.
output_file str A fasta file with secondary structure annotations.

predict_structures

def predict_structures(input_file, output_file, num_processes=None, annotate=False)

Predict secondary structures for RNA sequences.

This is a convenience function to get quick RNA secondary structure predictions. The function will try to use the ViennaRNA python bindings or the RNAfold binary to perform predictions. If neither can be found the function raises a RuntimeError. Using the ViennaRNA python bindings is preferred as it is faster.

Entries of the output file look as follows if annotate = False:

>header
CCCCAUAGGGG
((((...)))) (-3.3)

If annotate = True the annotated structure string instead of the dot-bracket string will be printed:

>header
CCCCAUAGGGG
SSSSHHHSSSS

Have a look at the annotate_structures() function for more information about annotated structure strings.

Warning: Due to the way Python works spinning up additional processes means copying the complete memory of the original process, i.e. if the original processes already uses 5 GB of RAM each additional process will use 5 GB as well.

parameter type description
input_file str A fasta file with RNA sequences.
output_file str A fasta file with sequences and structures.
num_processes int The number of parallel processes to use for prediction. (default: number of available cores)
annotate bool Output the annotated structure string instead of the dot-bracket string. (default: false)

get_performance_report

def get_performance_report(labels, predictions)

Get a performance overview of a classifier.

The report contains precision, recall, f1-score, ROC-AUC and Precision-Recall-AUC for every class (in a 1 vs. all approach) and weighted averages (weighted by the the number of sequences 'n' in each class).

parameter type description
labels numpy.ndarray A binary matrix of shape (num sequences, num classes) containing the true labels.
predictions numpy.ndarray A matrix of shape (num sequences, num classes) containing predicted probabilites.
returns type description
report str Summary table of the above mentioned performance measurements.

plot_roc

def plot_roc(labels, predictions, file_path)

Get ROC curves for every class.

In the case of more than two classes the comparisons will be performed in a 1 vs. all approach (i.e. you get one curve per class).

parameter type description
labels numpy.ndarray A binary matrix of shape (num sequences, num classes) containing the true labels.
predictions numpy.ndarray A matrix of shape (num sequences, num classes) containing predicted probabilites.
file_path str The file the plot should be saved to.

plot_prec_recall

def plot_prec_recall(labels, predictions, file_path)

Get Precision-Recall curves for every class.

In the case of more than two classes the comparisons will be performed in a 1 vs. rest approach (i.e. you get one curve per class).

parameter type description
labels numpy.ndarray A binary matrix of shape (num sequences, num classes) containing the true labels.
predictions numpy.ndarray A matrix of shape (num sequences, num classes) containing predicted probabilites.
file_path str The file the plot should be saved to.

save_as_meme

def save_as_meme(logos, file_path)

Save sequence (or structure) motifs in MEME format.

parameter type description
logos [pysster.Motif] A list of Motif objects.
file_path str The name of the output text file.

run_tomtom

def run_tomtom(motif_file, output_folder, database, options = None)

Compare a MEME file against a database using TomTom.

Default options string: "-min-overlap 5 -verbosity 1 -xalph -evalue -thresh 0.1"

parameter type description
motif_file str A MEME file.
output_folder str The folder the TomTom output will be saved in.
database str A MEME file serving as the database to compare against.
option str Command line options passed to TomTom.