refactor: better architecture for differ library #36
Labels
dev-refactoring
Cleaning up, restructuring and improving existing code
lvl-2-medium
Medium-ranking issue
p3-high
Priority 3: Someone is planning to work on this task very soon or immediately.
s0-in-progress
Open: This task is being worked on right now (like discussing, or implementing)
With #10 open (for computing scores), and #11 (for computing weighted scores), I'd like to propose something that I think would be more scalable than the current design of the library. It still similarly models the current way of how a developer would go about implementing
StringDiffAlgorithm
, but ultimately the trait gets remodeled into a new struct,Diff
, which allows avoid calling helper methods e.gget_operations_matrix()
multiple times.Some notes:
Diff
andDiffScoreConfig
intentionally do not have theString
prefix unlike the current types for the case where we allow computing different data types besides just strings (with generic types). What we can do though is rename the current types to remove theString
prefix, and then introduce these types.DiffScoreConfig
The
DiffScoreConfig
is a type of structure that only needs to get created once, and then passed around as a reference to avoid copying (since the size of a struct will be quite big initially and will ). This type implements theDefault
trait, to provide sensible default weighted values for different type of operations.An alternative is to use a
HashMap
, but this would involve creating a key type forK
like another enum (sinceStringDiffOpKind
is specifically for storing the values for each operation), as well as void re-hashing and re-allocations (HashMap
is a DST). By having specific properties instead, theDiffScoreConfig
struct will have a guaranteed fized-size at compile-time.Diff
The
Diff
struct would be the return type given by diffing algorithms. It has anops
property which holds a slice (&[T]
), which are immutable pointers. It also holds atotal_len
, to know how to compute the score.It does not however store an instance of a
DiffScoreConfig
; that's because aDiff
type will be returned by the hamming + levenshtein distance algorithms (and future algorithms). This would mean require having to pass inDiffScoreConfig
as an argument for everything, and becauseDiffScoreConfig
is designed to be mutable.The
similarity()
anddifference()
are methods instead of properties since they would take time to compute, The job of a constructor is only to initialize the state/fields of a type.The jobs of the distance algorithms are purely to compute the list of difference operations, and the user won't always need to know what the similarity and difference scores are.
Algorithms
This leaves the
StringDiffAlgorithm
, and the hamming distance and levenshtein. Further thinking, I think it would actually make more sense to remove theStringDiffAlgorithm
trait, and turn both hamming + levenshtein algorithms from structs into pure functions. Helper methods can still be only available internally (pub(crate)
). This would leave the signatures of the public API to look like the following. Of course, bothhamming.rs
andlevenshtein.rs
would still be separate.The text was updated successfully, but these errors were encountered: