Normalized Hamming Distance #256

surajg4 · 2021-03-25T22:27:36Z

It would be better to have a normalization options in distance metric for BCR support. Still not clear on how to use the abstract class DistanceCalculator (tutorial could help)

grst · 2021-03-26T08:11:12Z

Pinging @ktpolanski, who added the Hamming Distance Feature: What do you think about the normalization?

Regarding the DistanceCalculator: What's your question in particular? I think it should be feasible to implement a custom distance calculator that inherits from the abstract base class by looking at the examples and docstrings in metrics.py.

ktpolanski · 2021-03-26T12:40:12Z

Does normalisation imply dividing by the length of the sequence? If so, then that puts us pretty close to what Dandelion does, no?

zktuong · 2021-03-30T09:32:56Z

Does normalisation imply dividing by the length of the sequence? If so, then that puts us pretty close to what Dandelion does, no?

This would put it more in line with what immcantation's ShaZam does. Both scirpy and dandelion groups the sequences by distance first before performing the calculations, hence doesn't require length normalization.

The normalization in ShaZam is done as follows:

if (normalize == "len") { 
   dist_mat <- dist_mat / seq_length

for dist_mat, they use a sliding window approach to bin each sequence into 5mers (replacing gaps with Ns, and padding Ns to the first and last 5mer to get the size the same), and then hamming distance is calculated for each pair of 5mers in order. The total distance is then sumed, and subsequently divided by sequence length according to the function above.

This should allow BCRs with different lengths to be grouped as a clonotype if they pass the similarity cut-off, but keep it a substitution-only context. immcantation reccomends a model-based approach for the cut-off, by looking for bimodal pattern of the distribution of the normalized hamming distance, but also leaves it up to the user to define a manual cut-off based on visual inspection of the histogram.

It's frequently used and I can understand its appeal as it allows for more relaxed/unbiased grouping and discovery of potential related BCR patterns that use the same V- and J-genes. It does "violate" the same length requirement for BCR SHM that textbooks teaches us but you could potentially argue it's due to technical things like sequencing error.

An easy way to do all this is to parse to AIRR format and access the immcantation tools directly, or parse to dandelion where wrappers for these immcantation tools are available, and then read it back to scirpy. Or you could try and implement it in some fashion here but would require a new class to handle it.

ktpolanski · 2021-03-30T10:31:25Z

I think we're all talking about slightly different things. If we take the Hamming distance and divide it by sequence length, we'll obtain a "percent of mismatches" measure, which is what dandelion does. That is, if this is what OP meant by "normalised Hamming".

grst mentioned this issue Jan 23, 2024

Speed up hamming distance calculation and add normalized hamming distance #481

Closed

grst added this to scirpy-dev May 28, 2024

grst moved this to On Hold in scirpy-dev May 28, 2024

grst mentioned this issue Aug 13, 2024

Hamming distance implementation with Numba #512

Merged

grst closed this as completed in #512 Aug 19, 2024

github-project-automation bot moved this from On Hold to Done in scirpy-dev Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalized Hamming Distance #256

Normalized Hamming Distance #256

surajg4 commented Mar 25, 2021

grst commented Mar 26, 2021

ktpolanski commented Mar 26, 2021

zktuong commented Mar 30, 2021

ktpolanski commented Mar 30, 2021

Normalized Hamming Distance #256

Normalized Hamming Distance #256

Comments

surajg4 commented Mar 25, 2021

grst commented Mar 26, 2021

ktpolanski commented Mar 26, 2021

zktuong commented Mar 30, 2021

ktpolanski commented Mar 30, 2021