Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalized Hamming Distance #256

Closed
surajg4 opened this issue Mar 25, 2021 · 4 comments · Fixed by #512
Closed

Normalized Hamming Distance #256

surajg4 opened this issue Mar 25, 2021 · 4 comments · Fixed by #512

Comments

@surajg4
Copy link

surajg4 commented Mar 25, 2021

It would be better to have a normalization options in distance metric for BCR support. Still not clear on how to use the abstract class DistanceCalculator (tutorial could help)

@grst
Copy link
Collaborator

grst commented Mar 26, 2021

Pinging @ktpolanski, who added the Hamming Distance Feature: What do you think about the normalization?


Regarding the DistanceCalculator: What's your question in particular? I think it should be feasible to implement a custom distance calculator that inherits from the abstract base class by looking at the examples and docstrings in metrics.py.

@ktpolanski
Copy link
Contributor

Does normalisation imply dividing by the length of the sequence? If so, then that puts us pretty close to what Dandelion does, no?

@zktuong
Copy link
Contributor

zktuong commented Mar 30, 2021

Does normalisation imply dividing by the length of the sequence? If so, then that puts us pretty close to what Dandelion does, no?

This would put it more in line with what immcantation's ShaZam does. Both scirpy and dandelion groups the sequences by distance first before performing the calculations, hence doesn't require length normalization.

The normalization in ShaZam is done as follows:

if (normalize == "len") { 
   dist_mat <- dist_mat / seq_length

for dist_mat, they use a sliding window approach to bin each sequence into 5mers (replacing gaps with Ns, and padding Ns to the first and last 5mer to get the size the same), and then hamming distance is calculated for each pair of 5mers in order. The total distance is then sumed, and subsequently divided by sequence length according to the function above.

This should allow BCRs with different lengths to be grouped as a clonotype if they pass the similarity cut-off, but keep it a substitution-only context. immcantation reccomends a model-based approach for the cut-off, by looking for bimodal pattern of the distribution of the normalized hamming distance, but also leaves it up to the user to define a manual cut-off based on visual inspection of the histogram.

It's frequently used and I can understand its appeal as it allows for more relaxed/unbiased grouping and discovery of potential related BCR patterns that use the same V- and J-genes. It does "violate" the same length requirement for BCR SHM that textbooks teaches us but you could potentially argue it's due to technical things like sequencing error.

An easy way to do all this is to parse to AIRR format and access the immcantation tools directly, or parse to dandelion where wrappers for these immcantation tools are available, and then read it back to scirpy. Or you could try and implement it in some fashion here but would require a new class to handle it.

@ktpolanski
Copy link
Contributor

I think we're all talking about slightly different things. If we take the Hamming distance and divide it by sequence length, we'll obtain a "percent of mismatches" measure, which is what dandelion does. That is, if this is what OP meant by "normalised Hamming".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
4 participants