-
Notifications
You must be signed in to change notification settings - Fork 6
Matching algorithm approaches
This page describes the matching algorithm approaches that might be taken to match two similar organizations.
First of all, to make matching organizations possible, data from each provider has to be unified to a common form, so described algorithms assume that the entities are consistent. Due to possible differences between data sets, we will probably need a separate adapter for each of them. Using an ETL for migrating those data might be a good solution.
To make matching algorithm efficient, we had to match some organizations, even if not all data are equals, so different parts of data are going to be compared, and if any of them matches, we should consider that it might be a match. A good solution would be displaying the user the most possible matches as a short list, to allow him to select it manually. If there will be a 100% match, we might skip this step, and only inform a user, that the organization was matched, so he could manually revert this action if he disagrees. There should be calculated a similarity ratio and specified a threshold, to verify if it's a complete, partial or accidental match.
Below, there are described fields, that should be compared, with propositions of how to match them. Each field should be compared also with the historical values, as indeed both entities might represent the same organization, but part of the data might be outdated. Each field should have a different weight to calculate the similarity ratio properly. Those weights might be later established by testing different combinations of them too see which ones bring the most results.
Both fields are mandatory for organizations matching. Of course, the name is more important. Unfortunately, the same names can be misspelled, or simply differ in words order, letters capitalizations etc. There are already several algorithms that are comparing strings in terms of similarity. Each of them uses a quite different approach, so a combination of at least two of them would be a good solution (if computing memory and time required for comparing those values would allow it to work efficiently). A quick overview of the most popular algorithms for such matches can be found on below blogs:
- http://ntz-develop.blogspot.com/
- https://www.rosette.com/blog/overview-fuzzy-name-matching-techniques/
The most common algorithms are using Levenshtein Distance, so there are already implemented libraries for matching strings this way (like this), which might save time on implementing them, but to rely on external solutions, we'd have to test them well instead. It's important to verify if the selected algorithm takes into consideration that the order of could be omitted. If it doesn't we should probably sort those words in alphabetical order before comparing them. Also, we should manually create possible shortcuts of Organizations names, as the algorithms themselves won't match it.
Address of two localisations could be really accurate for the matching algorithm if Google Maps API would be used. As known, Google Maps can return geographic coordinates, even if the address is misspelled, or only the part of the data is provided. Geocoding might be used for extracting coordinates from string address, and Distance Matrix can be used to compare the distance between them. Of course, the distance could be compared with some custom algorithm, but extracting coordinates probably couldn't. Unfortunately, using Google Maps API for this purpose on this scale is not free, and we'd have to consider if localization matching is important enough to cover those costs.
Fields like those should not differ for the same organization but might be ofter not specified. Anyway, those values might still differ in capitalization or custom prefixes (like 'http://' or 'www.' for URLs), so we'd probably have to use matching algorithms here as well, or think about unifying those values while mapping them to the common schema.
If this data will be accurate at least to the month of proper date, we can use this as a small matching ratio boost. Even if only the year will match, it still might be an useful information.
This value might have quite a big amount of data, so using a matching algorithm is not the best solution to compare them. If we'd like to compare those fields, we should probably use something like LCS problem to compare the whole description text.
Contact data might be also useful for finding matches. We should compare phone numbers between organizations, and maybe even people associated with the organizations if such data will be available.
In edge cases, we could compare programs and services to find an organizations match, but it might be useless.