feature: Allow computing percent-based scores from diffs #10

neoncitylights · 2022-12-29T10:22:18Z

Note: This task is blocked by #9.

It's possible to retrieve the distance (as an integer), representing number of operations to get from string A to string B. It's also possible to retrieve a list of each individual diff-operations with diff().

In order to really compare similarity though, it'd be nice to get the percent (from 0.0 to 1.0, representing 0% and 100%), which would return a floating-point (f32 would fit this).

For example, the similarity score between cattle and battle would be 5/6, or 83.333% (repeating). Likewise, the difference score between them would be 1/6, or 16.666% (repeating).

The text was updated successfully, but these errors were encountered:

notalfredo · 2022-12-30T18:59:10Z

Would this apply to any two words ? Or do these two words have to be the same length ?

neoncitylights · 2022-12-30T19:17:23Z

Any two words. With words that aren't the same length, the total would be the max length of the two words. So for example:

kitten (6 letters)
Kittens (7 letters)

The similarity score would be a 5.5/7, because:

the max length to measure is 7 letters
Deduct 0.5 because k was uppercased to K
Deduct 1.0 because s was inserted at the end

notalfredo · 2023-01-01T19:31:03Z

I have at test implementations at the moment. I have similarity_score and difference_score functions implemented at the comment. They work by first by calling get_operation_matrix which just gives us a matrix of choice operations. (Ive used that same matrix to get vector operations in other functions). Since we are computing percent-based scores from the operations this matrix works well. I simply use this matrix to find out what operations happened and then with that I can compute the similarity/difference score

neoncitylights · 2023-01-05T07:55:09Z

I have at test implementations at the moment. I have similarity_score and difference_score functions implemented at the comment. They work by first by calling get_operation_matrix which just gives us a matrix of choice operations. (Ive used that same matrix to get vector operations in other functions). Since we are computing percent-based scores from the operations this matrix works well. I simply use this matrix to find out what operations happened and then with that I can compute the similarity/difference score

The idea sounds nice, but I don't think this would scale well. This would only make it possible to compute a similarity/difference score from a Levenshtein distance algorithm, but this type of value can be computed generally, like the apply_diff() function. Ultimately, we just need to pass in a Vec<StringDiffOp>, and compute the cost per operation :)

An implementation would actually be as simple as:

fn compute_similarity(diff_ops: &[StringDiffOp], total_len: usize) -> f32 {
    (diff_ops.len() as f32) / (total_len as f32)
}

fn compute_difference(diff_ops: &[StringDiffOp], total_len: usize) -> u32 {
    1.0 - compute_similarity(diff_ops, total_len)
}

The number of differences, divided by the total number of items in a type. This helps with 2 things:

by accepting a &[StringDiffOp], we can still account for weighted scoring and make specific matches for each case (feature: Allow computing fine-grained/weighted-based scores #11)
by accepting a total_len (instead of a &str and calling len()), we can account for any data type besides just strings

Ideally a function should do as little as possible, and these two would actually achieve it that way. However, I was thinking, that we could represent the code in a way where get_operations_matrix doesn't have to be recomputed every function call, and in a way that's more idiomatic and structured. (some method implementations abbreviated for clarity). So, I actually opened #36! Would love to talk about it when you have some time.

fixes #10 and fixes #11

neoncitylights changed the title ~~feature: allow computing percent-based scores from diffs~~ feature: Allow computing percent-based scores from diffs Dec 29, 2022

neoncitylights added lvl-2-medium Medium-ranking issue p1-low Priority 1: Generally no one plans to work on the task, but it would be nice if someone decides to. t-feature-request Type: Idea/request of an enhancement towards a library/framework labels Dec 29, 2022

neoncitylights mentioned this issue Dec 29, 2022

feature: Allow computing fine-grained/weighted-based scores #11

Closed

notalfredo self-assigned this Jan 1, 2023

This was referenced Jan 5, 2023

Roadmap for differ.rs #35

Open

refactor: better architecture for differ library #36

Closed

neoncitylights mentioned this issue Jan 7, 2023

refactor: better architecture for differ library #38

Merged

notalfredo added a commit that referenced this issue Jan 14, 2023

feat: similarity and difference score

54b7ac3

fixes #10 and fixes #11

notalfredo mentioned this issue Jan 14, 2023

feat: similarity and difference score #40

Merged

neoncitylights closed this as completed in #40 Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: Allow computing percent-based scores from diffs #10

feature: Allow computing percent-based scores from diffs #10

neoncitylights commented Dec 29, 2022

notalfredo commented Dec 30, 2022

neoncitylights commented Dec 30, 2022

notalfredo commented Jan 1, 2023

neoncitylights commented Jan 5, 2023 •

edited

Loading

feature: Allow computing percent-based scores from diffs #10

feature: Allow computing percent-based scores from diffs #10

Comments

neoncitylights commented Dec 29, 2022

notalfredo commented Dec 30, 2022

neoncitylights commented Dec 30, 2022

notalfredo commented Jan 1, 2023

neoncitylights commented Jan 5, 2023 • edited Loading

neoncitylights commented Jan 5, 2023 •

edited

Loading