Skip to content

Latest commit

 

History

History
86 lines (54 loc) · 9.82 KB

File metadata and controls

86 lines (54 loc) · 9.82 KB

Post Orthogonal Matching Pursuit Thresholding

The orthogonal matching pursuit (OMP) algorithm takes SpotColors normalised by channel and fits the background eigenvectors followed by successive genes, as long as they exceed a certain threshold. However, not all the spots found by this method should be used as genes for cell calling. We must perform further thresholding on these spots.

This thresholding is performed using three variables: ompSpotIntensity2, ompNeighbNonZeros and ompSpotScore. It is required that all three of these are above a low threshold and at least one is above a higher threshold. Also, for spots where the gene under consideration is not the gene with the largest coefficient, stronger thresholds must be overcome.

ompSpotIntensity2

ompSpotIntensity2 is found using the function get_spot_intensity. It is median (4th largest value) of the SpotColor normalised by channel in the 7 rounds/channels indicated by the unbled code of the gene the spot was assigned to (ompSpotCodeNo). For example, the below (normalised) spot was assigned to Aldoc, which has an unbled code indicated by the green squares. The median of these 7 squares is round 6, channel 5 which has an intensity of 0.1388.

ompNeighbNonZeros

The OMP algorithm takes each pixel in the data and fits a coefficient for each gene (most of which will be zero). From these coefficients, we can build a coefficient image for each gene as shown below (red means positive and blue means negative coefficient, while white means 0):

From this image, the final spots are the local maxima indicated by the green circles. Clearly, some of these (Aldoc and Id2) are much more obvious than others (Plp1, Enpp2). We quantify this obviousness by counting the number of pixels surrounding the local maxima that have a non zero coefficient for that gene. Specifically, we fit the filter shown below at each local maxima and then count the positive coefficients in the red region, ompNeighbNonZeros(:,1) and positive or negative coefficients in the blue region, ompNeighbNonZeros(:,2).

These are then combined in a way to put more emphasis on the nearer positive coefficients:

NeighbNonZeros = o.ompNeighbNonZeros(:,1)*o.ompNeighbNearPosNeighbMultiplier+o.ompNeighbNonZeros(:,2);

ompSpotScore

ompSpotScore is a bit more complicated but the basic idea is a score base on the difference between the fit with and without the gene under consideration. It is obtained using the function get_omp_score.

Influence of Gene

The first step is to get the OverallScore which is the difference between the absolute L1 error (difference between SpotColor and predicted SpotColor) when the prediction does not use the gene under consideration (we are using the same spot shown in SpotIntensity so the gene is Aldoc) and when it does. OverallScore = AbsoluteErrorNoAldoc - AbsoluteErrorWithAldoc.

I.e. for the plot on the left, the predicted code has the Aldoc coefficient, as shown by the white circle below, set to 0 while all other coefficients remain those indicated below. For the second plot above, all coefficients are exactly those indicated below.

The large positive OverallScore in round 2, channel 1 indicates that Aldoc is required to explain the intensity there. On the other hand, the negative OverallScore in round 4, channel 2 indicates that the intensity there is better explained without Aldoc.

Use specific rounds/channels for each gene

We next, modify the OverallScore to only include rounds/channels in the unbled code of Aldoc. We also only include rounds where the gene efficiency of Aldoc exceeds ompScore_GeneEfficiencyThresh. The mask is shown on the left (gene efficiency for Aldoc fails threshold in rounds 1 and 5) and the resultant ModScore is shown in the middle:

The third plot showing NormModScore is ModScore/AbsoluteErrorNoAldoc i.e. the second plot here divided by the first plot of the OverallScore image. We do this normalisation because we want a score independent of intensity (we do intensity thresholding separately using ompSpotIntensity2).

Weight larger errors more and exclude overlapping rounds/channels

The next part finds a multiplier so we get a boost for explaining rounds/channels with the largest error before we include Aldoc. This multiplier, AbsErrorFactor is given by:

AbsErrorFactor = AbsoluteErrorNoAldoc./prctile(AbsoluteErrorNoAldoc',o.ompScore_LargeErrorPrcntileThresh)'

This is shown in the first plot below, and we then clamp it between 1 and ompScore_LargeErrorMax as indicated in the second plot.

We then modify this to neglect rounds that already have a positive gene which passes a low threshold. For example, from the two coefficient images, it is clear that this Aldoc spot overalps with an Id2 spot. From the image below, showing the SpotColor with Id2 unbled code highlighted), we can see that the round 4, color channel 3 is in the unbled code of both Aldoc and Id2. Thus we set round 4, channel 3 to 0 in the Final AbsErrorFactor shown in the third plot above. The idea behind this is that each gene must have a unique contribution - it must explain something independent of what other genes are present.

Combine Factors

Finally, we combine all the factors. We first turn NormModScore into an exponential through: ExpNormModScore = (exp(NormModScore*log(2))-1), as shown in the first plot below. We do this to put a lower bound on how bad a single round can be (the log(2) factor is so maximum and minimum possible ExpNormModScore have the same absolute value).

We then multiply this by the Final AbsErrorFactor to give the second plot. Lastly, we multiply this by a normalisation factor, nRoundsUsedNorm so as not to penalise spots where less rounds are used i.e. here only 4 of the 7 rounds are used so nRoundsUsedNorm = 7/4. We also clamp the final score between -o.ompScore_LargeErrorMax and o.ompScore_LargeErrorMax. This gives the final plot, and the sum of the final plot is 7.71 which is the final score.