Skip to content

epicentre-msf/nmatch

Repository files navigation

nmatch: Fuzzy matching for proper names

Lifecycle: experimental R-CMD-check

Compare or match proper names from different sources, accounting for potential variation in format, name order, and spelling (e.g. “Beyoncé Knowles” vs. “KNOWLES-CARTER, Beyonce”). Specifically, uses string-standardization to account for variation in punctuation, accents, and character case, token-permutation to account for variation in name order, and fuzzy matching to handle alternate spellings.

Installation

Install from GitHub with:

# install.packages("remotes")
remotes::install_github("epicentre-msf/nmatch")

Example usage

library(nmatch)
library(dplyr)
library(purrr)
library(tidyr)

# example data with names to compare
data("names_ex")

# compare with default settings
names_ex %>% 
  mutate(is_match = nmatch(name_source1, name_source2))
#>                  name_source1           name_source2 is_match
#> 1             Beyoncé Knowles Beyonce Knowles-Carter     TRUE
#> 2    Frédéric François Chopin    CHOPIN, Fryderyk F.    FALSE
#> 3    Kendrick Lamar Duckworth         LAMAR, Kendrik     TRUE
#> 4 Calvin Cordozar Broadus Jr.             Snoop Dogg    FALSE
#> 5 Céline Marie Claudette Dion           DION, Céline     TRUE
#> 6         Aubrey Drake Graham                  Drake    FALSE

# return full match details
names_ex %>% 
  mutate(match_df = purrr::map2(name_source1, name_source2, nmatch, return_full = TRUE)) %>% 
  tidyr::unnest("match_df")
#> # A tibble: 6 × 11
#>   name_source1                name_source2     is_match    id   k_x   k_y k_align n_match dist_total freq_score align   
#>   <chr>                       <chr>            <lgl>    <int> <int> <int>   <int>   <int>      <int> <chr>      <list>  
#> 1 Beyoncé Knowles             Beyonce Knowles… TRUE         1     2     3       2       2          0 <NA>       <tibble>
#> 2 Frédéric François Chopin    CHOPIN, Frydery… FALSE        1     3     2       2       1          3 <NA>       <tibble>
#> 3 Kendrick Lamar Duckworth    LAMAR, Kendrik   TRUE         1     3     2       2       2          1 <NA>       <tibble>
#> 4 Calvin Cordozar Broadus Jr. Snoop Dogg       FALSE        1     4     2       2       0         10 <NA>       <tibble>
#> 5 Céline Marie Claudette Dion DION, Céline     TRUE         1     4     2       2       2          0 <NA>       <tibble>
#> 6 Aubrey Drake Graham         Drake            FALSE        1     3     1       1       1          0 <NA>       <tibble>

# use a custom function to classify matches, based on any of the vars returned
# when return_full = TRUE (i.e. k_x, k_y, k_align, n_match, dist_total)
# note that fn must include argument `...` even if not used
classify_matches <- function(k_align, dist_total, ...) {
  k_align >= 1 & dist_total <= 3
}

names_ex %>% 
  mutate(is_match = nmatch(name_source1, name_source2, eval_fn = classify_matches))
#>                  name_source1           name_source2 is_match
#> 1             Beyoncé Knowles Beyonce Knowles-Carter     TRUE
#> 2    Frédéric François Chopin    CHOPIN, Fryderyk F.     TRUE
#> 3    Kendrick Lamar Duckworth         LAMAR, Kendrik     TRUE
#> 4 Calvin Cordozar Broadus Jr.             Snoop Dogg    FALSE
#> 5 Céline Marie Claudette Dion           DION, Céline     TRUE
#> 6         Aubrey Drake Graham                  Drake     TRUE

Using with fuzzyjoin

library(fuzzyjoin)

# example hospital datasets (IPD and ICU) with patients names in different
# formats (note these names are just random combinations of common French names)
data("dat_ipd")
data("dat_icu")

# join IPD and ICU datasets by name variables, using fuzzy matching with nmatch()
fuzzyjoin::fuzzy_join(
  dat_ipd,
  dat_icu,
  by = c("name_ipd" = "name_icu"),
  match_fun = nmatch::nmatch,
  mode = "left"
)
#> # A tibble: 10 × 4
#>    name_ipd                           date_ipd   name_icu                 date_icu  
#>    <chr>                              <date>     <chr>                    <date>    
#>  1 COLLET, André Daniel               2020-07-17 André D. Colet           2020-07-19
#>  2 LÈFEVRE, Françoise Sylvie          2020-06-03 Francoise Lefevre        2020-06-05
#>  3 DUBOIS, Monique Léa                2020-06-10 Monique Léa Dubois       2020-06-14
#>  4 GUÉRIN, Jacqueline Hélène          2020-07-09 Jacqueline Hélène Guérin 2020-07-09
#>  5 MARTIN, Philippe Arnaud            2020-06-11 <NA>                     NA        
#>  6 DUMONT, René Stéphane              2020-07-26 Renae  Dumont            2020-07-29
#>  7 LÉVEILLÉ, Anne-Charlotte Catherine 2020-06-01 <NA>                     NA        
#>  8 MARCHAND, Louis Enzo               2020-07-17 <NA>                     NA        
#>  9 SANCHEZ, Isabelle Suzanne          2020-07-03 <NA>                     NA        
#> 10 ROUX, Nathalie Elisabeth           2020-07-06 Natalia Roux             2020-07-06

Note that matching large datasets using fuzzy_join() and nmatch() may be very slow. The nmatch package will eventually contain stand-alone join functions that are optimized for name-matching.