Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Allow computing percent-based scores from diffs #10

Closed
Tracked by #35
neoncitylights opened this issue Dec 29, 2022 · 4 comments · Fixed by #40
Closed
Tracked by #35

feature: Allow computing percent-based scores from diffs #10

neoncitylights opened this issue Dec 29, 2022 · 4 comments · Fixed by #40
Assignees
Labels
lvl-2-medium Medium-ranking issue p1-low Priority 1: Generally no one plans to work on the task, but it would be nice if someone decides to. t-feature-request Type: Idea/request of an enhancement towards a library/framework

Comments

@neoncitylights
Copy link
Contributor

Note: This task is blocked by #9.

It's possible to retrieve the distance (as an integer), representing number of operations to get from string A to string B. It's also possible to retrieve a list of each individual diff-operations with diff().

In order to really compare similarity though, it'd be nice to get the percent (from 0.0 to 1.0, representing 0% and 100%), which would return a floating-point (f32 would fit this).

For example, the similarity score between cattle and battle would be 5/6, or 83.333% (repeating). Likewise, the difference score between them would be 1/6, or 16.666% (repeating).

@neoncitylights neoncitylights changed the title feature: allow computing percent-based scores from diffs feature: Allow computing percent-based scores from diffs Dec 29, 2022
@neoncitylights neoncitylights added lvl-2-medium Medium-ranking issue p1-low Priority 1: Generally no one plans to work on the task, but it would be nice if someone decides to. t-feature-request Type: Idea/request of an enhancement towards a library/framework labels Dec 29, 2022
@notalfredo
Copy link
Member

Would this apply to any two words ? Or do these two words have to be the same length ?

@neoncitylights
Copy link
Contributor Author

Any two words. With words that aren't the same length, the total would be the max length of the two words. So for example:

  • kitten (6 letters)
  • Kittens (7 letters)

The similarity score would be a 5.5/7, because:

  • the max length to measure is 7 letters
  • Deduct 0.5 because k was uppercased to K
  • Deduct 1.0 because s was inserted at the end

@notalfredo notalfredo self-assigned this Jan 1, 2023
@notalfredo
Copy link
Member

I have at test implementations at the moment. I have similarity_score and difference_score functions implemented at the comment. They work by first by calling get_operation_matrix which just gives us a matrix of choice operations. (Ive used that same matrix to get vector operations in other functions). Since we are computing percent-based scores from the operations this matrix works well. I simply use this matrix to find out what operations happened and then with that I can compute the similarity/difference score

@neoncitylights
Copy link
Contributor Author

neoncitylights commented Jan 5, 2023

I have at test implementations at the moment. I have similarity_score and difference_score functions implemented at the comment. They work by first by calling get_operation_matrix which just gives us a matrix of choice operations. (Ive used that same matrix to get vector operations in other functions). Since we are computing percent-based scores from the operations this matrix works well. I simply use this matrix to find out what operations happened and then with that I can compute the similarity/difference score

The idea sounds nice, but I don't think this would scale well. This would only make it possible to compute a similarity/difference score from a Levenshtein distance algorithm, but this type of value can be computed generally, like the apply_diff() function. Ultimately, we just need to pass in a Vec<StringDiffOp>, and compute the cost per operation :)


An implementation would actually be as simple as:

fn compute_similarity(diff_ops: &[StringDiffOp], total_len: usize) -> f32 {
    (diff_ops.len() as f32) / (total_len as f32)
}

fn compute_difference(diff_ops: &[StringDiffOp], total_len: usize) -> u32 {
    1.0 - compute_similarity(diff_ops, total_len)
}

The number of differences, divided by the total number of items in a type. This helps with 2 things:

Ideally a function should do as little as possible, and these two would actually achieve it that way. However, I was thinking, that we could represent the code in a way where get_operations_matrix doesn't have to be recomputed every function call, and in a way that's more idiomatic and structured. (some method implementations abbreviated for clarity). So, I actually opened #36! Would love to talk about it when you have some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lvl-2-medium Medium-ranking issue p1-low Priority 1: Generally no one plans to work on the task, but it would be nice if someone decides to. t-feature-request Type: Idea/request of an enhancement towards a library/framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants