refactor: better architecture for differ library #36

neoncitylights · 2023-01-05T07:48:02Z

With #10 open (for computing scores), and #11 (for computing weighted scores), I'd like to propose something that I think would be more scalable than the current design of the library. It still similarly models the current way of how a developer would go about implementing StringDiffAlgorithm, but ultimately the trait gets remodeled into a new struct, Diff, which allows avoid calling helper methods e.g get_operations_matrix() multiple times.

Some notes:

This should also help make implementing feature: Allow to alignment between protein or nucleotide sequences #16 scalable in a way that code does not have to be duplicated.
Diff and DiffScoreConfig intentionally do not have the String prefix unlike the current types for the case where we allow computing different data types besides just strings (with generic types). What we can do though is rename the current types to remove the String prefix, and then introduce these types.

DiffScoreConfig

The DiffScoreConfig is a type of structure that only needs to get created once, and then passed around as a reference to avoid copying (since the size of a struct will be quite big initially and will ). This type implements the Default trait, to provide sensible default weighted values for different type of operations.

An alternative is to use a HashMap, but this would involve creating a key type for K like another enum (since StringDiffOpKind is specifically for storing the values for each operation), as well as void re-hashing and re-allocations (HashMap is a DST). By having specific properties instead, the DiffScoreConfig struct will have a guaranteed fized-size at compile-time.

pub struct DiffScoreConfig {
   pub sub_cost: f32,
   pub indel_cost: f32,
   pub transpose_cost: f32,
   // future properties here as needed
}

impl Default for DiffScoreConfig {
   fn default() -> Self { /* ... */ }
}

// used like:
let mut score_config = DiffScoreConfig::default();
score_config.sub_cost = 0.75f32;
score_config.indel_cost = 1.0f32;

// or, if the defaults is good enough for the user,
// it doesn't need to be mutable
let score_config = DiffScoreConfig::default();

Diff

The Diff struct would be the return type given by diffing algorithms. It has an ops property which holds a slice (&[T]), which are immutable pointers. It also holds a total_len, to know how to compute the score.

It does not however store an instance of a DiffScoreConfig; that's because a Diff type will be returned by the hamming + levenshtein distance algorithms (and future algorithms). This would mean require having to pass in DiffScoreConfig as an argument for everything, and because DiffScoreConfig is designed to be mutable.

The similarity() and difference() are methods instead of properties since they would take time to compute, The job of a constructor is only to initialize the state/fields of a type.

The jobs of the distance algorithms are purely to compute the list of difference operations, and the user won't always need to know what the similarity and difference scores are.

pub struct Diff {
   pub ops: &[StringDiffOp],
   pub total_len: usize,
}

impl Diff {
    pub fn new(diffs: &[StringDiffOp], total_len: usize) -> Self { Self {} }
    pub fn distance(&self) -> usize {
       self.diffs.len()
    }

    pub fn similarity(&self, score: &DiffScoreConfig) -> f32 { /* ... */ }
    pub fn difference(&self, score: &DiffScoreConfig) -> f32 { /* ... */ }
}

Algorithms

This leaves the StringDiffAlgorithm, and the hamming distance and levenshtein. Further thinking, I think it would actually make more sense to remove the StringDiffAlgorithm trait, and turn both hamming + levenshtein algorithms from structs into pure functions. Helper methods can still be only available internally (pub(crate)). This would leave the signatures of the public API to look like the following. Of course, both hamming.rs and levenshtein.rs would still be separate.

// hamming.rs
pub fn hamming<'a>(&self, s1: &'a str, s2: &'a str) -> Diff {}

// levenshtein.rs
pub fn levenshtein<'a>(&self, s1: &'a str, s2: &'a str) -> Diff {}

The text was updated successfully, but these errors were encountered:

fixes #36

This was referenced Jan 5, 2023

feature: Allow computing percent-based scores from diffs #10

Closed

Roadmap for differ.rs #35

Open

feature: Allow diffing between data types besides strings #37

Open

notalfredo self-assigned this Jan 5, 2023

notalfredo added a commit that referenced this issue Jan 6, 2023

refactor: better architecture for differ library

8e7d892

fixes #36

notalfredo mentioned this issue Jan 6, 2023

refactor: better architecture for differ library #38

Merged

neoncitylights pushed a commit that referenced this issue Jan 13, 2023

refactor: better architecture for differ library (#38)

cf74525

fixes #36

neoncitylights closed this as completed in #38 Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: better architecture for differ library #36

refactor: better architecture for differ library #36

neoncitylights commented Jan 5, 2023 •

edited

Loading

refactor: better architecture for differ library #36

refactor: better architecture for differ library #36

Comments

neoncitylights commented Jan 5, 2023 • edited Loading

DiffScoreConfig

Diff

Algorithms

neoncitylights commented Jan 5, 2023 •

edited

Loading