Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Edit distances for strings #151

Open
4 tasks done
Planeshifter opened this issue Feb 17, 2018 · 4 comments
Open
4 tasks done

RFC: Edit distances for strings #151

Planeshifter opened this issue Feb 17, 2018 · 4 comments
Labels
Feature Issue or pull request for adding a new feature. Math Issue or pull request specific to math functionality. RFC Request for comments. Feature requests and proposed changes.

Comments

@Planeshifter
Copy link
Member

Planeshifter commented Feb 17, 2018

Checklist

Please ensure the following tasks are completed before filing an issue.

  • Read and understood the Code of Conduct.
  • Searched for existing issues and pull requests.
  • If this is a general question, searched the FAQ for an existing answer.
  • If this is a feature request, the issue name begins with RFC:.

Description

Description of the issue (or feature request).

This RFC proposes the inclusion of edit distance functions to compare two strings, for example the Levenstein distance.

Related Issues

Does this issue (or feature request) have any related issues?

No.

Questions

Any questions for reviewers?

What would be a good place inside of stdlib for string distance functions?

@Planeshifter Planeshifter added RFC Request for comments. Feature requests and proposed changes. Feature Issue or pull request for adding a new feature. Math Issue or pull request specific to math functionality. labels Feb 17, 2018
@kgryte
Copy link
Member

kgryte commented Feb 17, 2018

For string distance functions, in the @stdlib/string namespace. I believe one of the initial reasons for not including Levenstein, Hamming, and others, was whether we needed to expose generic interfaces. For example, Hamming could be used to compute the "distance" between two arrays. I was not sure how general we need to be.

@kgryte
Copy link
Member

kgryte commented Jul 15, 2023

@Planeshifter We should probably turn this into a tracking issue for implementing various metrics (Hamming, Levenshtein, etc), and cross-link to the existing issues.

@rgizz
Copy link
Contributor

rgizz commented Nov 17, 2023

I looked at Levenshtein distances for strings this morning. From the above discussion I am not sure if this is still wanted?

Anyway, some notes:

For fun, my implementation of Wikipedia's naive recursive algorithm, in Standard ML:

fun min_of_3(a, b, c) = let
    val min_1 = Int.min(a, b)
    in
        Int.min(min_1, c) 
    end

fun lev(a, []) = length a
|   lev([], b) = length b
|   lev(x::xs, y::ys) = 
    (*if x=y then*)
    if Char.compare(x, y) = EQUAL then 
        lev(xs, ys) 
    else
        1 + min_of_3 ( lev(xs, y::ys), lev(x::xs, ys), lev(xs, ys) )

fun timeLevStr s1 s2 = let
    val tmr = Timer.startRealTimer()
    val d   = levstr s1 s2
    val t   = Timer.checkRealTimer tmr
    in
        (d, t)
    end

Very very very slow. This takes about 5 minutes.

  • timeLevStr "alphabetically" "betamaxrecording";
    val it = (14,TIME {usec=306635216}) : int * Time.time

Some links:

Do we want to do this?

@kgryte
Copy link
Member

kgryte commented Nov 17, 2023

@rgizz Thanks for following up here. We have an initial implementation of Levenshtein as @stdlib/string/base/distances/levenschtein; however, that implementation is currently sub-optimal and needs to be refactored. I have notes, somewhere, which I need to dig up which outline what changes need to be made; however, the changes are non-trivial.

One additional hurdle is that the refactoring needs to address handling of grapheme clusters, as linked to in the following tracking issue: #1062.

kgryte added a commit that referenced this issue Dec 15, 2023
PR-URL: 	#1166
Co-authored-by: Athan Reines <kgryte@gmail.com>
Reviewed-by: Athan Reines <kgryte@gmail.com>
Ref: #836
Ref: #151
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Issue or pull request for adding a new feature. Math Issue or pull request specific to math functionality. RFC Request for comments. Feature requests and proposed changes.
Projects
None yet
Development

No branches or pull requests

3 participants