This section lists all available text distance metrics along with their IDs for command-line use.
The weighted N-gram score is computed as the sum of the number of weighted shared N-grams between the two texts. It ensures that:
- Shared N-gram instances near interval bounds (dependent on situation) get rated higher than the ones near the center or opposite end
- Large shared N-gram instances are weighted higher than short ones
--align-min-ngram-size <SIZE>
sets the start (minimum) N-gram size
--align-max-ngram-size <SIZE>
sets the final (maximum) N-gram size
--align-ngram-size-factor <FACTOR>
sets a weight factor for the size preference
--align-ngram-position-factor <FACTOR>
sets a weight factor for the position preference
Jaro-Winkler is an edit distance metric described here.
Editex is a phonetic text distance algorithm described here.
Levenshtein is an edit distance metric described here.
The "Match rating approach" is a phonetic text distance algorithm described here.
The Hamming distance is an edit distance metric described here.
This is the same as Levenshtein - just on word level.
Not available for gap alignment.
This is the same as Levenshtein but using a different implementation.
Not available for gap alignment.
This is the final Smith-Waterman score coming from the rough alignment step (but before gap alignment!). It is described here.
Not available for gap alignment.
The character length of the STT transcript.
Not available for gap alignment.
The character length of the matched text of the original transcript (cleaned).
Not available for gap alignment.