Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Matching algorithm approaches

Oskar Hinc edited this page Dec 31, 2018 · 4 revisions

Overview

This page describes the matching algorithm approaches that might be taken to match two similar organizations.

First of all, to make matching organizations possible, data from each provider has to be unified to a common form, so described algorithms assume that the entities are consistent. Due to possible differences between data sets, we will probably need a separate adapter for each of them. Using an ETL for migrating those data might be a good solution.

To make matching algorithm efficient, we had to match some organizations, even if not all data are equals, so different parts of data are going to be compared, and if any of them matches, we should consider that it might be a match. A good solution would be displaying the user the most possible matches as a short list, to allow him to select it manually. If there will be a 100% match, we might skip this step, and only inform a user, that the organization was matched, so he could manually revert this action if he disagrees. There should be calculated a similarity ratio and specified a threshold, to verify if it's a complete, partial or accidental match.

Fields to be compared

Below, there are described fields, that should be compared, with propositions of how to match them. Each field should be compared also with the historical values, as indeed both entities might represent the same organization, but part of the data might be outdated. Each field should have a different weight to calculate the similarity ratio properly. Those weights might be later established by testing different combinations of them too see which ones bring the most results.

Name and Alternate Name

Both fields are mandatory for organizations matching. Of course, the name is more important. Unfortunately, the same names can be misspelled, or simply differ in words order, letters capitalizations etc. There are already several algorithms that are comparing strings in terms of similarity. Each of them uses a quite different approach, so a combination of at least two of them would be a good solution (if computing memory and time required for comparing those values would allow it to work efficiently). A quick overview of the most popular algorithms for such matches can be found on below blogs:

The most common algorithms are using Levenshtein Distance, so there are already implemented libraries for matching strings this way (like this), which might save time on implementing them, but to rely on external solutions, we'd have to test them well instead. It's important to verify if the selected algorithm takes into consideration that the order of could be omitted. If it doesn't we should probably sort those words in alphabetical order before comparing them. Also, we should manually create possible shortcuts of Organizations names, as the algorithms themselves won't match it.

Localization

Address of two localisations could be really accurate for the matching algorithm if Google Maps API would be used. As known, Google Maps can return geographic coordinates, even if the address is misspelled, or only the part of the data is provided. Geocoding might be used for extracting coordinates from string address, and Distance Matrix can be used to compare the distance between them. Of course, the distance could be compared with some custom algorithm, but extracting coordinates probably couldn't. Unfortunately, using Google Maps API for this purpose on this scale is not free, and we'd have to consider if localization matching is important enough to cover those costs.

Email and URL

Fields like those should not differ for the same organization but might be ofter not specified. Anyway, those values might still differ in capitalization or custom prefixes (like 'http://' or 'www.' for URLs), so we'd probably have to use matching algorithms here as well, or think about unifying those values while mapping them to the common schema.

Years Incorporated

If this data will be accurate at least to the month of proper date, we can use this as a small matching ratio boost. Even if only the year will match, it still might be an useful information.

Description

This value might have quite a big amount of data, so using a matching algorithm is not the best solution to compare them. If we'd like to compare those fields, we should probably use something like LCS problem to compare the whole description text.

Contact

Contact data might be also useful for finding matches. We should compare phone numbers between organizations, and maybe even people associated with the organizations if such data will be available.

Other

In edge cases, we could compare programs and services to find an organizations match, but it might be useless.