Skip to content

Fuzzy Word Matcher (FWM) is a Java library that identifies the best matching word from a list using fuzzy logic. It supports Jaro-Winkler and Cosine Similarity algorithms and employs a BK-tree for efficient searching. Users can customize thresholds and default values for unmatched results, ensuring flexible and accurate word matching.

License

Notifications You must be signed in to change notification settings

NoelToy/fuzzy-word-matcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License Maven Central GitHub release (latest by date) javadoc

Fuzzy Word Matcher (FWM)

Fuzzy Word Matcher (FWM) is a Java-based library designed to efficiently find the most similar word from a given list of words using fuzzy logic. This library provides flexibility by allowing users to choose between two well-known similarity metrics: Jaro-Winkler and Cosine Similarity.

Key Features

  • Fuzzy Matching Algorithms: Supports both Jaro-Winkler and Cosine Similarity for calculating similarity between words. Internally, the library leverages Apache Commons Text to perform these similarity calculations.

  • Efficient Search: Utilizes the BK-tree data structure to perform efficient searches with a tolerance value (ranging from 0 to 1).

  • Threshold & Default Values: If the similarity score of the best match is below the defined threshold (0-1), a customizable default value will be returned instead.

  • Configurable Tolerance: The tolerance value makes the BK-tree search more efficient by limiting the search space based on allowable differences.

Use Cases

  • Finding approximate word matches in large datasets.
  • Spell-checking or correcting user inputs by suggesting the closest matching word.
  • Matching and aligning data entries with slightly different names or terms.

Dependencies

  • Apache Commons Text: Used for calculating Jaro-Winkler and Cosine Similarity scores.
  • Java 8 or higher: The minimum required Java version to run the library.

Usage/Examples

Add Maven Dependency

<dependency>
    <groupId>io.github.noeltoy</groupId>
    <artifactId>fuzzy-word-matcher</artifactId>
    <version>0.1</version>
</dependency>

Example for Jaro-Winkler

import io.github.fwm.WordMatcher;
import io.github.fwm.lib.enums.MatchType;

import java.util.Arrays;
import java.util.List;

public void jaroWinklerTest(){
        List<String> candidates = Arrays.asList("District Code", "District Name", "Country_Code", "Country_Name", "Pin_Code");
        WordMatcher wordMatcher = new WordMatcher.WordMatcherBuilder(candidates, MatchType.JARO_WINKLER)
                .setTolerance(.85)
                .setThreshold(.85)
                .setDefaultValue(null).build();
        String bestMatch = wordMatcher.findBestMatch("Dist Na");
        System.out.printf(bestMatch);
    }

Output: District Name

Example for Cosine-Similarity

import io.github.fwm.WordMatcher;
import io.github.fwm.lib.enums.MatchType;

import java.util.Arrays;
import java.util.List;

public void cosineSimilarityTest(){
        List<String> candidates = Arrays.asList("District Code", "District Name", "Country_Code", "Country_Name", "Pin_Code");
        WordMatcher wordMatcher = new WordMatcher.WordMatcherBuilder(candidates, MatchType.COSINE_SIMILARITY)
                .setTolerance(.85)
                .setThreshold(.85)
                .setDefaultValue(null).build();
        String bestMatch = wordMatcher.findBestMatch("Country_Na");
        System.out.printf(bestMatch);
    }

Output: Country_Name

Parameters

Parameter Type Description Default Value
candidates List of String A list of strings from which Fuzzy Word Matcher (FWM) will identify the best match. Internally, this list is converted into a BK-tree to enable efficient fuzzy searching and matching based on the selected similarity metric. Not Applicable
matchType MatchType Specifies the algorithm to be used for calculating the similarity distance. It accepts two values: JARO_WINKLER or COSINE_SIMILARITY. This determines whether Fuzzy Word Matcher (FWM) will use the Jaro-Winkler or Cosine Similarity algorithm for word matching. Not Applicable
tolerance Double Defines the tolerance level used in the BK-tree search algorithm. This value (ranging from 0 to 1) controls the allowable difference between words during the search, helping Fuzzy Word Matcher (FWM) optimize the search process by limiting the search space. 0.85
threshold Double Sets the minimum similarity score required for a match. If the similarity distance between the best match and the search input word is less than this threshold (ranging from 0 to 1), Fuzzy Word Matcher (FWM) will return a predefined default value instead of the best match. 0.60
defaultValue String The value to be returned if the similarity score between the best match and the search input word is below the defined threshold. This ensures Fuzzy Word Matcher (FWM) provides a fallback option when no sufficiently close match is found. NULL

Acknowledgement

The development of Fuzzy Word Matcher (FWM) was inspired by the Intuit Fuzzy Matcher project.

License

Apache License 2.0

Authors

About

Fuzzy Word Matcher (FWM) is a Java library that identifies the best matching word from a list using fuzzy logic. It supports Jaro-Winkler and Cosine Similarity algorithms and employs a BK-tree for efficient searching. Users can customize thresholds and default values for unmatched results, ensuring flexible and accurate word matching.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages