Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: better architecture for differ library #36

Closed
Tracked by #35
neoncitylights opened this issue Jan 5, 2023 · 0 comments · Fixed by #38
Closed
Tracked by #35

refactor: better architecture for differ library #36

neoncitylights opened this issue Jan 5, 2023 · 0 comments · Fixed by #38
Assignees
Labels
dev-refactoring Cleaning up, restructuring and improving existing code lvl-2-medium Medium-ranking issue p3-high Priority 3: Someone is planning to work on this task very soon or immediately. s0-in-progress Open: This task is being worked on right now (like discussing, or implementing)

Comments

@neoncitylights
Copy link
Contributor

neoncitylights commented Jan 5, 2023

With #10 open (for computing scores), and #11 (for computing weighted scores), I'd like to propose something that I think would be more scalable than the current design of the library. It still similarly models the current way of how a developer would go about implementing StringDiffAlgorithm, but ultimately the trait gets remodeled into a new struct, Diff, which allows avoid calling helper methods e.g get_operations_matrix() multiple times.

Some notes:

  • This should also help make implementing feature: Allow to alignment between protein or nucleotide sequences  #16 scalable in a way that code does not have to be duplicated.
  • Diff and DiffScoreConfig intentionally do not have the String prefix unlike the current types for the case where we allow computing different data types besides just strings (with generic types). What we can do though is rename the current types to remove the String prefix, and then introduce these types.

DiffScoreConfig

The DiffScoreConfig is a type of structure that only needs to get created once, and then passed around as a reference to avoid copying (since the size of a struct will be quite big initially and will ). This type implements the Default trait, to provide sensible default weighted values for different type of operations.

An alternative is to use a HashMap, but this would involve creating a key type for K like another enum (since StringDiffOpKind is specifically for storing the values for each operation), as well as void re-hashing and re-allocations (HashMap is a DST). By having specific properties instead, the DiffScoreConfig struct will have a guaranteed fized-size at compile-time.

pub struct DiffScoreConfig {
   pub sub_cost: f32,
   pub indel_cost: f32,
   pub transpose_cost: f32,
   // future properties here as needed
}

impl Default for DiffScoreConfig {
   fn default() -> Self { /* ... */ }
}

// used like:
let mut score_config = DiffScoreConfig::default();
score_config.sub_cost = 0.75f32;
score_config.indel_cost = 1.0f32;

// or, if the defaults is good enough for the user,
// it doesn't need to be mutable
let score_config = DiffScoreConfig::default();

Diff

The Diff struct would be the return type given by diffing algorithms. It has an ops property which holds a slice (&[T]), which are immutable pointers. It also holds a total_len, to know how to compute the score.

It does not however store an instance of a DiffScoreConfig; that's because a Diff type will be returned by the hamming + levenshtein distance algorithms (and future algorithms). This would mean require having to pass in DiffScoreConfig as an argument for everything, and because DiffScoreConfig is designed to be mutable.

The similarity() and difference() are methods instead of properties since they would take time to compute, The job of a constructor is only to initialize the state/fields of a type.

The jobs of the distance algorithms are purely to compute the list of difference operations, and the user won't always need to know what the similarity and difference scores are.

pub struct Diff {
   pub ops: &[StringDiffOp],
   pub total_len: usize,
}

impl Diff {
    pub fn new(diffs: &[StringDiffOp], total_len: usize) -> Self { Self {} }
    pub fn distance(&self) -> usize {
       self.diffs.len()
    }

    pub fn similarity(&self, score: &DiffScoreConfig) -> f32 { /* ... */ }
    pub fn difference(&self, score: &DiffScoreConfig) -> f32 { /* ... */ }
}

Algorithms

This leaves the StringDiffAlgorithm, and the hamming distance and levenshtein. Further thinking, I think it would actually make more sense to remove the StringDiffAlgorithm trait, and turn both hamming + levenshtein algorithms from structs into pure functions. Helper methods can still be only available internally (pub(crate)). This would leave the signatures of the public API to look like the following. Of course, both hamming.rs and levenshtein.rs would still be separate.

// hamming.rs
pub fn hamming<'a>(&self, s1: &'a str, s2: &'a str) -> Diff {}

// levenshtein.rs
pub fn levenshtein<'a>(&self, s1: &'a str, s2: &'a str) -> Diff {}
@neoncitylights neoncitylights added dev-refactoring Cleaning up, restructuring and improving existing code lvl-2-medium Medium-ranking issue p3-high Priority 3: Someone is planning to work on this task very soon or immediately. s0-in-progress Open: This task is being worked on right now (like discussing, or implementing) labels Jan 5, 2023
@notalfredo notalfredo self-assigned this Jan 5, 2023
notalfredo added a commit that referenced this issue Jan 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev-refactoring Cleaning up, restructuring and improving existing code lvl-2-medium Medium-ranking issue p3-high Priority 3: Someone is planning to work on this task very soon or immediately. s0-in-progress Open: This task is being worked on right now (like discussing, or implementing)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants