Skip to content

Latest commit

 

History

History
132 lines (77 loc) · 16.4 KB

File metadata and controls

132 lines (77 loc) · 16.4 KB

Orthogonal Matching Pursuit Method

The orthogonal matching pursuit (OMP) algorithm used in this folder differs from the conventional algorithm in how the background is fitted and how the next atom (gene in this case) is selected. This document expalins how.

The function that carries out the OMP is o.get_omp_coefs. It takes SpotColors normalised by channel and fits the background eigenvectors followed by successive genes, as long as they exceed a certain threshold.

Fitting Background

The background is fitted using the function o.get_spot_residual_background. This assumes that there are o.nBP background eigenvectors and each one is just a strip in a color channel i.e. o.ompBackgroundChannelStrips=true. An example for channel 1 is shown below:

The background fitting procedure will be illustrated by fitting the above background vector to the following (channel normalised) spot color:

The basic procedure, shown below, is to find a weight factor that normalises the contribution from each round so that no one round dominates. Multiply both background code and spot color by this weight. Then the final coefficient for the background code is the dot product of the weighted spot color with the weighted background code.

The weight factor for fitting the channel B background vector in round r is:

W_{B,r} = \frac{1}{(|s_{B,r}|+\lambda)^\sigma}

s_{B,r}} is the spot color in channel B, round r. \lambda is o.ompWeightShift which is just a small number to stop W_{B,r} blowing up for small s_{B,r}}. \sigma is o.ompWeightPower, the lower \sigma, the greater the contributions of the rounds where s_{B,r}} is large.

For \lambda=0.01 and \sigma=0.9, the weight factor is shown in the top row. The third row just shows the second row multiplied by the weight factor. If \lambda=0 and \sigma=1, then the weighted spot color shown would have an absolute value of 1 for all rounds in channel 1.

The final coefficient of the background vector for channel B is given by: C_B = \frac{\sum_{r=1}^7W_{B,r}^2s_{B,r}g_{B,r}}{\sum_{r=1}^7W_{B,r}^2g_{B,r}^2}

g_{B,r} is the value of the background vector for channel B in round r. In the procedure shown, this is equivalent to multiplying the two weighted codes in the third row together, then dividing by the square of the weighted background. The result of this is shown as the Weighted Dot Product in the bottom row. C_B is then found by summing this. It is clear from the formula that if s_{B,r} = \mu g_{B,r} then C_B = \mu for any value of \mu.

The spot color found after fitting the background is shown in the middle plot below. The values of C_b for all channels, b, are kept constant after this initial fitting which differs from the usual OMP method where they are re-fit after each subsequent gene is added. The reason for this is that sometimes an unusual background can be fit to justify further genes.

Fitting Genes

After fitting the background, we need to decide which gene to fit next. This is done by the function o.get_weight_gene_dot_product2. It finds a modified dot product DP_g for each gene, g, and then the next gene to be fit is the gene with the largest value.

For the spot color shown above, the next gene to be fit is Aldoc, with a bled code shown below.

The below plot illustrates the procedure for finding DP_g for Aldoc to be 6.098, with a particular focus on the contribution from round 7. The basic procedure is to find a dot product between the spot color and the gene code for each round and then summing these individual dot products with a weighting indicating the strength of the gene in each round.

First, for each gene, we need to find a weight for each round. This is based on the gene efficiency as shown in the top left plot above. The formula is:

w_{gr} = \sqrt{\frac{1}{1+e^{-\alpha(\varepsilon_{g,r}-\beta)}}}

\varepsilon_{g,r} is the gene efficiency for gene g in round r. \alpha is ompNormBledCodeScale, the larger this, the sharper the step function. \beta is ompNormBledCodeShift, this controls where the step function is centered. For the plot shown, \alpha=7 and \beta=0.5.

We next square and normalise this as shown in the top right plot. The formula is:

W_{gr}^2 = \frac{nR \times w_{gr}^2}{\sum_{r=1}^{nR}w_{gr}^2}

Here, nR = o.nRounds, and it is normalised such that \sum_{r=1}^{nR}W_{gr}^2=nR.

Now we need to find the dot product for each round. For a particular round R, this is:

DP_{g,R} = \frac{\sum_{b=1}^7s''_{b,R}g_{b,R}}{\Bigg(\sqrt{\sum_{b=1}^7{s''_{b,R}^2}}+\lambda\Bigg)^{\sigma}\sqrt{\sum_{b=1}^7{g_{b,R}^2}}}

Here, s'' refers to the current spot residual, so for this example it is the spot color - background but if we are on the next iteration it would be spot color - background - first gene. We include both \lambda and \sigma again to stop a blow up and to give slightly greater contribution to the more intense rounds. If \lambda=0 and \sigma=1, then we are just finding the cosine of the angle between the vectors.

In the procedure shown, the second row shows s''_R and g_R with R=7. The third row shows the spot weight which is:

\frac{1}{\Bigg(\sqrt{\sum_{b=1}^7{s''_{b,R}^2}}+\lambda\Bigg)^{\sigma}}

and the gene weight which is:

\frac{1}{\sqrt{\sum_{b=1}^7{g_{b,R}^2}}}.

The next row shows the two preceding rows multiplied by each other i.e.

\frac{s''_R}{\Bigg(\sqrt{\sum_{b=1}^7{s''_{b,R}^2}}+\lambda\Bigg)^{\sigma}} and \frac{g_R}{\sqrt{\sum_{b=1}^7{g_{b,R}^2}}}.

The bottom left plot then shows these two multiplied by each other. Summing over all values in this image then gives DP_{g,R} = 0.894.

The final dot product is then given by:

DP_g = \sum_{r=1}^7W_{gr}^2DP_{gr}

In the bottom right plot shown, NormRoundWeight is W_{gr}^2, RoundDotProduct is DP_{gr} and RoundContribution is W_{gr}^2DP_{gr}.

Finding coefficients for selected genes

Now we know the best gene to add, we need to find a coefficiet for it. This is done in the standard OMP least squares way. If we are considering the first gene to be added then:

C_g = \frac{\sum_{b=1}^7\sum_{r=1}^7s'_{br}g_{br}}{\sum_{b=1}^7\sum_{r=1}^7g_{br}^2}

Where s' refers to the spot residual post background removal i.e. spot color - background. It is always this for every iteration as we refit the genes we have previously found when subsequent genes are added.

If we are on a later iteration with more than one gene, then the MATLAB backslash operation is used to solve:

C_G = \bigg(G^TG\bigg)^{-1}G^Ts'

where G is a 49 x nGenesAdded matrix where each column is the bled code of a gene that is added and C_G = [C_{g1}, C_{g2} ...].

Genes are continually added in this way until there are o.ompMaxGenes added to explain a particular spot color or:

\Delta Residual = Residual_{i-1} - Residual_{i} < ResidualThresh

Where:

Residual = \sqrt{\sum_{b=1}^7\sum_{r=1}^7s''_{br}^2}

ResidualThresh is based on the second largest value in the initial spot color. The process is more easily explained through the example spot we have been using, this has ResidualThresh = 0.036:

Here, i=1 refers to just Aldoc so:
Residual_i = 0.46, Residual_{i-1} = 1.12 and \Delta Residual = 0.66. \Delta Residual > ResidualThresh so we proceed and accept Aldoc.

i=2:
Residual_i = 0.24, Residual_{i-1} = 0.46 and \Delta Residual = 0.22. \Delta Residual > ResidualThresh so we proceed and accept Trp53i11.

i=3:
Residual_i = 0.21, Residual_{i-1} = 0.24 and \Delta Residual = 0.03. \Delta Residual < ResidualThresh so we reject Hapln1 and end.