The Model class represents a convolutional neural network and provides functions for network training and visualization of learned features (sequence/structure motifs). The basic architecture of the network consists of a variable number of convolutional and max pooling layers followed by a variable number of dense layers. These layers are interspersed by dropout layers after the input layer and after every max pooling and dense layer. Network weights in all layers are regularized using a max norm constraint. Early stopping is implemented with respect to the loss on the validation data.
The network uses the Adam optimizer. In case of a single-label classification a softmax activation is used for the output layer together with a categorical crossentropy loss. In case of a multi-label classification a sigmoid activation and a binary crossentropy loss is used. All other layers use ReLU activations.
The network can be tuned using the following hyperparameters which can be provided through the 'params' parameter of the __init__ function:
parameter | default | description |
---|---|---|
conv_num | 2 | number of convolutional/pooling layers |
kernel_num | 30 | number of kernels in each conv layer |
kernel_len | 25 | length of kernels |
pool_size | 2 | size of pooling windows |
pool_stride | 2 | step size of pooling operation |
dense_num | 1 | number of dense layers |
neuron_num | 100 | number of neurons in each dense layer |
dropout_input | 0.1 | dropout portion after input |
dropout_conv | 0.3 | dropout portion after pooling layers |
dropout_dense | 0.6 | dropout portion after dense layers |
batch_size | 128 | batch size during training |
learning_rate | 0.0005 | learning rate of Adam optimizer |
patience_lr | 5 | number of epochs without validation loss improvement before halving learning rate |
patience_stopping | 15 | number of epochs without validation loss improvement before stopping training |
epochs | 500 | maximum number of training epochs |
kernel_constraint | 3 | max-norm weight constraint |
Not all parameters are equally important when doing a hyperparameter grid search. The ones with a strong influence are usually conv_num (range 1-3), kernel_num (range 50-300) and the dropout parameters (around 0.1 for the input and 0.2-0.6 otherwise).
Note: with each convolutional/pooling stack the length of your sequences will be reduced. E.g. starting with sequences of length 300 and kernels of length 25 will result in sequences of length 300-25+1=276 after the first convolutional layer. A default pooling layer will halve this number further to 138. If you use too many convolutional/pooling stacks you will get an error, because your sequence length will be <= 0.
For advanced users we offer the option to add recurrent layers (RNN) between the convolutional and the dense block. Two kinds of layers are possible: Long Short Term Memory (LSTM) or Gated Recurrent Units (GRU). They can be tuned using the following hyperparameters provided through the 'params' parameter as above:
parameter | default | description |
---|---|---|
rnn_type | None | "LSTM" or "GRU" (strings) are possible layers at the moment |
rnn_num | 1 | number of RNN layers |
rnn_units | 32 | number of output dimensions of each RNN layer |
rnn_bidirectional | True | True or False (bool) whether layers should be bidirectional |
rnn_dropout_input | 0.2 | dropout portion for input connections |
rnn_dropout_recurrent | 0.0 | dropout portion for recurrent connections |
From our experience RNN layers increase the runtime a lot, but the predictive performance only a little or not at all, therefore use them with caution. If you want to get rid of the convolutional or dense block, you can simply set "conv_num" or "dense_num" to 0. However, motif visualization will not be possible anymore if the first network layer is not a convolutional layer.
name | description |
---|---|
__init__ | Initialize the model with the given parameters. |
print_summary | Print an overview of the network architecture. |
train | Train the model. |
predict | Get model predictions for a subset of a Data object. |
get_max_activations | Get the network output of the first convolutional layer. |
visualize_kernel | Get a number of visualizations and an importance score for a convolutional kernel. |
visualize_all_kernels | Get visualizations for all first-layer convolutional kernels. |
plot_clustering | Perform a hierarchical clustering on both sequences and kernels. |
visualize_optimized_inputs | Visualize what every node in the network has learned. |
def __init__(self, params, data, seed = None)
Initialize the model with the given parameters.
Example: providing the params dict {'conv_num': 1, 'kernel_num': 20, 'dropout_input': 0.0} will set these 3 parameters to the provided values. All other parameters will have default values (see above). A data object must be provided to infer the input shape and number of classes.
By default, multiple layers of the same type will share dependent parameters: {"dense_num": 3, "neuron_num": 100} creates a model with 100 neurons in each of the three dense layers.
To specify parameters for individual layers tuples must be provided: {"dense_num": 3, "neuron_num": (300, 100, 30)} creates a model in which the first layer has 300 neurons, the second 100 and the third 30.
Another example: the following model has two convolutional layers (the first layer has 10 kernels of length 30, the second layer 20 kernels of length 3) and two dense layers (first dense layer has 100 neurons, the second 10).
{"conv_num": 2, "kernel_num": (10, 20), "kernel_len": (30, 3),
"dense_num": 2, "neuron_num": (100, 10)}
parameter | type | description |
---|---|---|
params | dict | A dict containing hyperparameter values. |
data | pysster.Data | The Data object the model should be trained on. |
seed | int | Seed for the random initialization of network weights. |
def print_summary(self)
Print an overview of the network architecture.
def train(self, data, verbose = True)
Train the model.
The model will be trained and validated on the training and validation set provided by the Data object.
parameter | type | description |
---|---|---|
data | pysster.Data | The Data object the model should be trained on. |
verbose | bool | If True, progress information (train/val loss) will be printed throughout the training. |
def predict(self, data, group)
Get model predictions for a subset of a Data object.
The 'group' argument can have the value 'train', 'val', 'test' or 'all'. The returned array has the shape (number of sequences, number of classes) and contains predicted probabilities.
parameter | type | description |
---|---|---|
data | pysster.Data | A Data object. |
group | str | The subset of the Data object that should be used for prediction. |
returns | type | description |
---|---|---|
predictions | numpy.ndarray | An array containing predicted probabilities. |
def get_max_activations(self, data, group)
Get the network output of the first convolutional layer.
The function returns the maximum activation (the maximum output of a kernel) for every kernel - input sequence pair. The return value is a dict containing the entries 'activations' (an array of shape (number of sequences, number of kernels)), 'labels' (an array of shape (number of sequences, number of classes)) and 'group' (the subset of the Data object used).
The 'group' argument can have the value 'train', 'val', 'test' or 'all'.
parameter | type | description |
---|---|---|
data | pysster.Data | A Data object. |
group | str | The subset of the Data object that should be used. |
returns | type | description |
---|---|---|
results | dict | A dict with 3 values ('activations', 'labels, 'group', see above) |
def visualize_kernel(self, activations, data, kernel, folder, colors_sequence={}, colors_structure={})
Get a number of visualizations and an importance score for a convolutional kernel.
This function creates three (or four) output files: 1) a sequence(/structure) motif that the kernel has learned to detect, 2) a histogram/activation plot showing the positional enrichment of said motif for every class, 3) violin plots showing the maximum activation distributions for every class (higher values == better, this is a proxy for global class enrichment) and 4), in case additional position-wise features are used, a line plot for each feature showing mean and standard deviation (see load_additional_positionwise_data() in the Data API).
The output files are named "motif_kernel_x.png", "position_kernel_x.png", "activations_kernel_x.png" and "additional_features_kernel_x.png".
How it works: Given an input sequence, a first layer kernel produces an output vector (called activations) of length sequence_length - kernel_length + 1. The position of the maximum activation can therefore be directly mapped back to the input sequence and a subsequence of the length of the kernel can be extracted from the input sequence. Applying this approach to every input sequence yields a number of subsequences that can be used for the construction of a motif. Subsequences are only considered if the maximum activation exceeds a certain threshold, in this case the maximum of the mean maximum activations per class. Only subsequences from the top class are used to construct the motif (up to 750 subsequences).
The histograms show the positions of the maximum activation, i.e. the positions the subsequences were extracted from. The activation plots show the mean activation and standard deviation for all sequence positions. Both plots are only based on sequences that led to a maximum activation higher than the threshold. Histogram and mean activation plot are usually identical, but in case the histogram is very sparse the mean activation plot might be easier to look at.
The violin plots show how the maximum activation values are distributed for each class, indicating global class enrichment.
The function returns a Motif object (or a tuple of Motif objects for RNA sequence/structure motifs) and an importance score that indicates how important this kernel was for the classification (higher values == more important). The score is computed as maximum of the mean maximum activations per class minus minimum of the mean maximum activations per class. The idea is that kernels that show a big differences across classes (i.e. kernels that are strongly enriched in some classes and little to none in other classes) are more important for the network to deliver correct predictions.
parameter | type | description |
---|---|---|
activations | dict | The return value of the get_max_activations function. |
data | pysster.Data | The Data object that was used to compute the maximum activations. |
kernel | int | The kernel that should be visualized (first kernel is 0) |
folder | str | A valid folder path. Plots will be saved here. |
colors_sequence | dict of char->str | A dict with individual alphabet characters as keys and hexadecimal RGB specifiers as values. (see Motif object documentation for details). |
colors_structure | dict of char->str | A dict with individual alphabet characters as keys and hexadecimal RGB specifiers as values. (see Motif object documentation for details). |
returns | type | description |
---|---|---|
results | (pysster.Motif, float) or ((pysster.Motif, pysster.Motif), float) | A Motif object (or a tuple of Motifs for sequence/structure motifs) and the importance score. |
def visualize_all_kernels(self, activations, data, folder, colors_sequence={}, colors_structure={})
Get visualizations for all first-layer convolutional kernels.
This functions creates the same four output files as visualize_kernel() (see there for details), but for all kernels of the first convolutional layer. It also creates a "summary.html" file showing all plots for each kernel side-by-side. Kernels are sorted by the global importance score.
The function returns a list holding Motif objects for each kernel (similar to visualize_kernel()). This list is not sorted by importance score (i.e. kernel 0 comes first)
parameter | type | description |
---|---|---|
activations | dict | The return value of the get_max_activations function. |
data | pysster.Data | The Data object that was used to compute the maximum activations. |
folder | str | A valid folder path. Plots and HTML summary will be saved here. |
colors_sequence | dict of char->str | A dict with individual alphabet characters as keys and hexadecimal RGB specifiers as values. (see Motif object documentation for details). |
colors_structure | dict of char->str | A dict with individual alphabet characters as keys and hexadecimal RGB specifiers as values. (see Motif object documentation for details). |
returns | type | description |
---|---|---|
results | [pysster.Motif] or [(pysster.Motif, pysster.Motif)] | A list of Motif objects (or a list of tuples of Motifs for sequence/structure cases). |
def plot_clustering(self, activations, output_file, classes = None)
Perform a hierarchical clustering on both sequences and kernels.
Given the maximum activations for each sequence - kernel pair (the output of the get_max_activations() method) a hierarchical clustering using Ward's method and the Euclidean distance is performed. Values are standardized before clustering. To compute the clustering only for a subset of classes (often it looks quite messy for all classes) you can provide a list of integers through the 'classes' argument (e.g. [0, 3] to only plot sequences belonging to class 0 and 3). By default all sequences of all classes are used. Clustering is only possible for single-label classifications.
parameter | type | description |
---|---|---|
activations | dict | A dict with keys 'activations' and 'labels' (the return value of get_max_activations()). |
output_file | str | Path of the PNG output file. |
classes | [int] | List of integers indicating which classes should be clustered (default: all). |
def visualize_optimized_inputs(self, data, layer_name, output_file, bound=0.1, lr=0.02, steps=600, colors_sequence={}, colors_structure={}, nodes=None)
Visualize what every node in the network has learned.
Given fixed network parameters it is possible to visualize what individual nodes (e.g. kernels in conv layers and neurons in dense layers) have learned during model training by specifically maximizing the output of these nodes with respect to an input sequence (starting with a random PWM of the length of an input sequence). In brief: this function learns a single input sequence (in the form of a PWM) that maximizes the output of a specific network node using a l2-norm penalized gradient ascent optimization.
Warning: This kind of visualization has been applied before to image classification networks and while the resulting images are usually somewhat recognizable they are still very hard to interpret (e.g. https://distill.pub/2017/feature-visualization/). For a PWM to be useful it has to be very precise, but this is unfortunately not the case for many data sets and results are very messy, especially for RNA secondary structure motifs. Therefore this function should not be considered for any biological interpretations. Please use the visualize_kernel() method for more reliable visualizations. Nevertheless, visualization of all layers of a network can be interesting if you are interested in how the neural network works per se.
If needed, the bound, lr and steps parameters can be used to tune the information content of the PWM and the convergence of the optimization (higher values == higher information content).
Each row in the output file corresponds to a node of the layer.
parameter | type | description |
---|---|---|
data | pysster.Data | The Data object used to train the model. |
layer_name | str | Name of the network layer that should be optimized (see print_summary()) |
output_file | str | Path of the PNG output file. |
bound | float | A float > 0. The PWM will be initialized by drawing from a uniform distribution with lower and upper bounds - and + bound. |
lr | float | A float > 0. Learning rate of the gradient ascent optimization. |
steps | int | An int > 0. Number of optimization iterations. |
colors_sequence | dict of char->str | A dict with individual alphabet characters as keys and hexadecimal RGB specifiers as values. (see Motif object documentation for details). |
colors_structure | dict of char->str | A dict with individual alphabet characters as keys and hexadecimal RGB specifiers as values. (see Motif object documentation for details). |
nodes | [int] | List of integers indicating which nodes of the layer should be optimized (default: all). |