Skip to content

Commit

Permalink
Merge pull request #55 from denmase/master
Browse files Browse the repository at this point in the history
Implementation of Ratcliff-Obershelp algorithm
  • Loading branch information
tdebatty authored May 12, 2020
2 parents eeb33dc + f6c7aad commit 4946f58
Show file tree
Hide file tree
Showing 3 changed files with 291 additions and 0 deletions.
34 changes: 34 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ A library implementing different string similarity and distance measures. A doze
* [Cosine similarity](#shingle-n-gram-based-algorithms)
* [Jaccard index](#shingle-n-gram-based-algorithms)
* [Sorensen-Dice coefficient](#shingle-n-gram-based-algorithms)
* [Ratcliff-Obershelp](#ratcliff-obershelp)
* [Experimental](#experimental)
* [SIFT4](#sift4)
* [Users](#users)
Expand Down Expand Up @@ -58,6 +59,7 @@ The main characteristics of each implemented algorithm are presented below. The
| [Cosine similarity](#cosine-similarity) |similarity<br>distance | Yes | No | Profile | O(m+n) | |
| [Jaccard index](#jaccard-index) |similarity<br>distance | Yes | Yes | Set | O(m+n) | |
| [Sorensen-Dice coefficient](#sorensen-dice-coefficient) |similarity<br>distance | Yes | No | Set | O(m+n) | |
| [Ratcliff-Obershelp](#ratcliff-obershelp) |similarity<br>distance | Yes | No | | ? | |

[1] In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the **dynamic programming** method, which has a cost O(m.n). For Levenshtein distance, the algorithm is sometimes called **Wagner-Fischer algorithm** ("The string-to-string correction problem", 1974). The original algorithm uses a matrix of size m x n to store the Levenshtein distance between string prefixes.

Expand Down Expand Up @@ -443,6 +445,38 @@ Similar to Jaccard index, but this time the similarity is computed as 2 * |V1 in

Distance is computed as 1 - similarity.

## Ratcliff-Obershelp
Ratcliff/Obershelp Pattern Recognition, also known as Gestalt Pattern Matching, is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988

Ratcliff/Obershelp computes the similarity between 2 strings, and the returned value lies in the interval [0.0, 1.0].

The distance is computed as 1 - Ratcliff/Obershelp similarity.

```java
import info.debatty.java.stringsimilarity.*;

public class MyApp {


public static void main(String[] args) {
RatcliffObershelp ro = new RatcliffObershelp();

// substitution of s and t
System.out.println(ro.similarity("My string", "My tsring"));

// substitution of s and n
System.out.println(ro.similarity("My string", "My ntrisg"));
}
}
```

will produce:

```
0.8888888888888888
0.7777777777777778
```

## Experimental

### SIFT4
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
/*
* The MIT License
*
* Copyright 2015 Thibault Debatty.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/
package info.debatty.java.stringsimilarity;

import info.debatty.java.stringsimilarity.interfaces.NormalizedStringSimilarity;
import info.debatty.java.stringsimilarity.interfaces.NormalizedStringDistance;
import java.util.List;
import java.util.ArrayList;
import java.util.Iterator;

import net.jcip.annotations.Immutable;

/**
* Ratcliff/Obershelp pattern recognition
* The Ratcliff/Obershelp algorithm computes the similarity of two strings a
* the doubled number of matching characters divided by the total number of
* characters in the two strings. Matching characters are those in the longest
* common subsequence plus, recursively, matching characters in the unmatched
* region on either side of the longest common subsequence.
* The Ratcliff/Obershelp distance is computed as 1 - Ratcliff/Obershelp
* similarity.
*
* @author Ligi https://github.com/dxpux (as a patch for fuzzystring)
* Ported to java from .net by denmase
*/
@Immutable
public class RatcliffObershelp implements
NormalizedStringSimilarity, NormalizedStringDistance {

/**
* Compute the Ratcliff-Obershelp similarity between strings.
*
* @param s1 The first string to compare.
* @param s2 The second string to compare.
* @return The RatcliffObershelp similarity in the range [0, 1]
* @throws NullPointerException if s1 or s2 is null.
*/
public final double similarity(final String s1, final String s2) {
if (s1 == null) {
throw new NullPointerException("s1 must not be null");
}

if (s2 == null) {
throw new NullPointerException("s2 must not be null");
}

if (s1.equals(s2)) {
return 1.0d;
}

List<String> matches = getMatchList(s1, s2);
int sumofmatches = 0;
Iterator it = matches.iterator();

while (it.hasNext()) {
String element = it.next().toString();
sumofmatches += element.length();
}

return 2.0d * sumofmatches / (s1.length() + s2.length());
}

/**
* Return 1 - similarity.
*
* @param s1 The first string to compare.
* @param s2 The second string to compare.
* @return 1 - similarity
* @throws NullPointerException if s1 or s2 is null.
*/
public final double distance(final String s1, final String s2) {
return 1.0d - similarity(s1, s2);
}

private static List<String> getMatchList(final String s1, final String s2) {
List<String> list = new ArrayList<String>();
String match = frontMaxMatch(s1, s2);

if (match.length() > 0) {
String frontsource = s1.substring(0, s1.indexOf(match));
String fronttarget = s2.substring(0, s2.indexOf(match));
List<String> frontqueue = getMatchList(frontsource, fronttarget);

String endsource = s1.substring(s1.indexOf(match) + match.length());
String endtarget = s2.substring(s2.indexOf(match) + match.length());
List<String> endqueue = getMatchList(endsource, endtarget);

list.add(match);
list.addAll(frontqueue);
list.addAll(endqueue);
}

return list;
}

private static String frontMaxMatch(final String s1, final String s2) {
int longest = 0;
String longestsubstring = "";

for (int i = 0; i < s1.length(); ++i) {
for (int j = i + 1; j <= s1.length(); ++j) {
String substring = s1.substring(i, j);
if (s2.contains(substring) && substring.length() > longest) {
longest = substring.length();
longestsubstring = substring;
}
}
}

return longestsubstring;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
/*
* The MIT License
*
* Copyright 2015 Thibault Debatty.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/

package info.debatty.java.stringsimilarity;

import info.debatty.java.stringsimilarity.testutil.NullEmptyTests;
import org.junit.Test;
import static org.junit.Assert.*;

/**
*
* @author Agung Nugroho
*/
public class RatcliffObershelpTest {


/**
* Test of similarity method, of class RatcliffObershelp.
*/
@Test
public final void testSimilarity() {
System.out.println("similarity");
RatcliffObershelp instance = new RatcliffObershelp();

// test data from other algorithms
// "My string" vs "My tsring"
// Substrings:
// "ring" ==> 4, "My s" ==> 3, "s" ==> 1
// Ratcliff-Obershelp = 2*(sum of substrings)/(length of s1 + length of s2)
// = 2*(4 + 3 + 1) / (9 + 9)
// = 16/18
// = 0.888888
assertEquals(
0.888888,
instance.similarity("My string", "My tsring"),
0.000001);

// test data from other algorithms
// "My string" vs "My tsring"
// Substrings:
// "My " ==> 3, "tri" ==> 3, "g" ==> 1
// Ratcliff-Obershelp = 2*(sum of substrings)/(length of s1 + length of s2)
// = 2*(3 + 3 + 1) / (9 + 9)
// = 14/18
// = 0.777778
assertEquals(
0.777778,
instance.similarity("My string", "My ntrisg"),
0.000001);

// test data from essay by Ilya Ilyankou
// "Comparison of Jaro-Winkler and Ratcliff/Obershelp algorithms
// in spell check"
// https://ilyankou.files.wordpress.com/2015/06/ib-extended-essay.pdf
// p13, expected result is 0.857
assertEquals(
0.857,
instance.similarity("MATEMATICA", "MATHEMATICS"),
0.001);

// test data from stringmetric
// https://github.com/rockymadden/stringmetric
// expected output is 0.7368421052631579
assertEquals(
0.736842,
instance.similarity("aleksander", "alexandre"),
0.000001);

// test data from stringmetric
// https://github.com/rockymadden/stringmetric
// expected output is 0.6666666666666666
assertEquals(
0.666666,
instance.similarity("pennsylvania", "pencilvaneya"),
0.000001);

// test data from wikipedia
// https://en.wikipedia.org/wiki/Gestalt_Pattern_Matching
// expected output is 14/18 = 0.7777777777777778‬
assertEquals(
0.777778,
instance.similarity("WIKIMEDIA", "WIKIMANIA"),
0.000001);

// test data from wikipedia
// https://en.wikipedia.org/wiki/Gestalt_Pattern_Matching
// expected output is 24/40 = 0.65
assertEquals(
0.6,
instance.similarity("GESTALT PATTERN MATCHING", "GESTALT PRACTICE"),
0.000001);

NullEmptyTests.testSimilarity(instance);
}

@Test
public final void testDistance() {
RatcliffObershelp instance = new RatcliffObershelp();
NullEmptyTests.testDistance(instance);

// TODO: regular (non-null/empty) distance tests
}
}

0 comments on commit 4946f58

Please sign in to comment.