Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug for long sequences (> 640) #103

Merged
merged 2 commits into from
Feb 10, 2024
Merged

Fix bug for long sequences (> 640) #103

merged 2 commits into from
Feb 10, 2024

Conversation

boeddeker
Copy link
Contributor

I used this package to compute the so called concatenated minimum permutation word error rate.
While this package is faster than kaldialign, kaldialign computes the individual ins/del/sub.

I compared the distance values for long texts (several thousand words) and they were not the same.
It turned out, that this package uses another implementation (edit_distance_dp), when the number of words is larger than 640.

In the code of edit_distance_dp is a bug, the first value in the vector is not initialized.

This PR contains a fix for the bug and adds tests for edit_distance_dp.

I had to expose edit_distance_dp to write a test.

 - Removed template code:
   - Only one value for the template is used (T=int64_t)
   - In cython it is easier to use non-template code
 - Fix warning: comparison of integer expressions of different signedness
 -
@thequilo
Copy link

This bug was introduced in #39

thequilo added a commit to fgnt/meeteval that referenced this pull request Jun 21, 2023
editdistance produces wrong results for long sequences (roy-ht/editdistance#103) and the maintainer doesn't seem to be active. Use our impelmentation of the Levenshtein distance instead
@roy-ht roy-ht merged commit 84396f3 into roy-ht:master Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants