An R package for cleaning and matching messy hierarchically-structured data (e.g. country / region / district / municipality). The general goal is to match sets of hierarchical values in a raw dataset to corresponding values within a reference dataset, while accounting for potential discrepancies such as:
- variation in character case, punctuation, spacing, use of accents, or spelling
- variation in hierarchical resolution (e.g. some entries specified to municipality-level but others only to region)
- missing values at one or more hierarchical levels
- values entered at the wrong hierarchical level
Install from GitHub with:
# install.packages("remotes")
remotes::install_github("epicentre-msf/hmatch")
hmatch
: match hierarchical sequences up to the highest-resolution level specified within a given row of raw data, optionally allowing for missing values below the match level, and fuzzy matches (using the stringdist package)
hmatch_tokens
: match tokens rather than entire strings to allow for variation in multi-term nameshmatch_permute
: sequentially permute hierarchical columns to allow for values entered at the wrong levelhmatch_parents
: match values at a given hierarchical level based on shared sets of ‘offspring’hmatch_settle
: try matching at every level and settle for the highest-resolution match possiblehmatch_manual
: match using a user-supplied dictionaryhmatch_split
: implement any otherhmatch_
function separately at each hierarchical level, only on unique sequenceshmatch_composite
: implement a variety of matching strategies in sequence, from most to least strict
Independent of optional fuzzy matching with
stringdist, hmatch
functions use behind-the-scenes string standardization to help account
for variation in character case, punctuation, spacing, or use of accents
between the raw and reference data. E.g.
raw_value reference_value match
----------------------------------------------------
original: ILE DE FRANCE Île-de-France FALSE
standardized: ile_de_france ile_de_france TRUE
Users can choose default standardization (illustrated above), no
standardization, or supply their own preferred function to standardize
strings (e.g. tolower
).
The hmatch
package contains example datasets ne_raw
(messy
geographical data) and ne_ref
(reference data derived from a
shapefile), based on a small subset of northeastern North America.
library(hmatch)
head(ne_raw) # raw messy data
#> id adm0 adm1 adm2
#> 1 PID01 USA New York Suffolk
#> 2 PID02 can ontario <NA>
#> 3 PID03 USA New York Kings County
#> 4 PID04 <NA> <NA> Philadelphia
#> 5 PID05 USA <NA> York
#> 6 PID06 USA new. york jefferson
head(ne_ref) # reference data derived from shapefile
#> level adm0 adm1 adm2 hcode
#> 1 adm0 CAN <NA> <NA> 100
#> 2 adm0 USA <NA> <NA> 200
#> 3 adm1 CAN Ontario <NA> 110
#> 4 adm1 USA New Jersey <NA> 210
#> 5 adm1 USA New York <NA> 220
#> 6 adm1 USA Pennsylvania <NA> 230
We’ll start with a simple call to hmatch
to see which rows can be
matched with no extra magic.
hmatch(ne_raw, ne_ref, pattern = "^adm")
#> id adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 PID01 USA New York Suffolk adm2 USA New York Suffolk 227
#> 2 PID02 can ontario <NA> adm1 CAN Ontario <NA> 110
#> 3 PID03 USA New York Kings County <NA> <NA> <NA> <NA> <NA>
#> 4 PID04 <NA> <NA> Philadelphia adm2 USA Pennsylvania Philadelphia 237
#> 5 PID05 USA <NA> York adm2 USA Pennsylvania York 238
#> 6 PID06 USA new. york jefferson adm2 USA New York Jefferson 222
#> 7 PID07 CAN Ontario Peel R.M. <NA> <NA> <NA> <NA> <NA>
#> 8 PID08 USA Pensylvania Ithaca <NA> <NA> <NA> <NA> <NA>
#> 9 PID09 USA New_York King <NA> <NA> <NA> <NA> <NA>
#> 10 PID10 <NA> <NA> Bergen, N.J. <NA> <NA> <NA> <NA> <NA>
#> 11 PID11 USA Philadelphia <NA> <NA> <NA> <NA> <NA> <NA>
#> 12 PID12 USA NJ <NA> <NA> <NA> <NA> <NA> <NA>
#> 13 PID13 <NA> <NA> Jeffersen <NA> <NA> <NA> <NA> <NA>
#> 14 PID14 <NA> <NA> york adm2 CAN Ontario York 115
#> 15 PID14 <NA> <NA> york adm2 USA Pennsylvania York 238
#> 16 PID15 USA New York State New York <NA> <NA> <NA> <NA> <NA>
There are still quite a few unmatched rows, and entry ‘PID14’ actually
matches two different rows within ref
, so we’ll press on. We can
separate the matched and unmatched rows using inner- and anti-joins
respectively, specifically using the “resolve_” join type here to only
consider matches that are unique.
(raw_match1 <- hmatch(ne_raw, ne_ref, pattern = "^adm", type = "resolve_inner"))
#> id adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 PID01 USA New York Suffolk adm2 USA New York Suffolk 227
#> 2 PID02 can ontario <NA> adm1 CAN Ontario <NA> 110
#> 3 PID04 <NA> <NA> Philadelphia adm2 USA Pennsylvania Philadelphia 237
#> 4 PID05 USA <NA> York adm2 USA Pennsylvania York 238
#> 5 PID06 USA new. york jefferson adm2 USA New York Jefferson 222
(raw_remain1 <- hmatch(ne_raw, ne_ref, pattern = "^adm", type = "resolve_anti"))
#> id adm0 adm1 adm2
#> 1 PID03 USA New York Kings County
#> 2 PID07 CAN Ontario Peel R.M.
#> 3 PID08 USA Pensylvania Ithaca
#> 4 PID09 USA New_York King
#> 5 PID10 <NA> <NA> Bergen, N.J.
#> 6 PID11 USA Philadelphia <NA>
#> 7 PID12 USA NJ <NA>
#> 8 PID13 <NA> <NA> Jeffersen
#> 9 PID14 <NA> <NA> york
#> 10 PID15 USA New York State New York
Next we’ll add in fuzzy-matching, using the default maximum string-distance of 1.
hmatch(raw_remain1, ne_ref, pattern = "^adm", fuzzy = TRUE, type = "inner")
#> id adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 PID09 USA New_York King adm2 USA New York Kings 223
#> 2 PID13 <NA> <NA> Jeffersen adm2 USA New York Jefferson 222
#> 3 PID13 <NA> <NA> Jeffersen adm2 USA Pennsylvania Jefferson 235
#> 4 PID14 <NA> <NA> york adm2 CAN Ontario York 115
#> 5 PID14 <NA> <NA> york adm2 USA Pennsylvania York 238
Only one additional unique match, so we’ll again split and move on.
Note that we’ve been using the pattern
argument above to specify the
hierarchical columns in raw
and ref
, but because the hierarchical
columns have the same names in raw
and ref
(and are the only
matching column names), we can drop the pattern
argument for brevity.
(raw_match2 <- hmatch(raw_remain1, ne_ref, fuzzy = TRUE, type = "resolve_inner"))
#> id adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 PID09 USA New_York King adm2 USA New York Kings 223
(raw_remain2 <- hmatch(raw_remain1, ne_ref, fuzzy = TRUE, type = "resolve_anti"))
#> id adm0 adm1 adm2
#> 1 PID03 USA New York Kings County
#> 2 PID07 CAN Ontario Peel R.M.
#> 3 PID08 USA Pensylvania Ithaca
#> 4 PID10 <NA> <NA> Bergen, N.J.
#> 5 PID11 USA Philadelphia <NA>
#> 6 PID12 USA NJ <NA>
#> 7 PID13 <NA> <NA> Jeffersen
#> 8 PID14 <NA> <NA> york
#> 9 PID15 USA New York State New York
Next let’s try hmatch_tokens
, which matches based on components of
strings (i.e. tokens) rather than entire strings.
(raw_match3 <- hmatch_tokens(raw_remain2, ne_ref, type = "resolve_inner"))
#> id adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 PID03 USA New York Kings County adm2 USA New York Kings 223
#> 2 PID07 CAN Ontario Peel R.M. adm2 CAN Ontario Peel 113
#> 3 PID10 <NA> <NA> Bergen, N.J. adm2 USA New Jersey Bergen 211
#> 4 PID15 USA New York State New York adm2 USA New York New York 225
(raw_remain3 <- hmatch_tokens(raw_remain2, ne_ref, type = "resolve_anti"))
#> id adm0 adm1 adm2
#> 1 PID08 USA Pensylvania Ithaca
#> 2 PID11 USA Philadelphia <NA>
#> 3 PID12 USA NJ <NA>
#> 4 PID13 <NA> <NA> Jeffersen
#> 5 PID14 <NA> <NA> york
If there are any values entered at the wrong hierarchical level, we can try systematically permuting the hierarchical columns before matching.
(raw_match4 <- hmatch_permute(raw_remain3, ne_ref, type = "resolve_inner"))
#> id adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 PID11 USA Philadelphia <NA> adm2 USA Pennsylvania Philadelphia 237
(raw_remain4 <- hmatch_permute(raw_remain3, ne_ref, type = "resolve_anti"))
#> id adm0 adm1 adm2
#> 1 PID08 USA Pensylvania Ithaca
#> 2 PID12 USA NJ <NA>
#> 3 PID13 <NA> <NA> Jeffersen
#> 4 PID14 <NA> <NA> york
For the remaining rows that we haven’t yet matched, there a few options.
We could use hmatch_settle()
to settle for matches below the
highest-resolution level specified within a given row of raw
. We could
also do some ‘manual’ comparison of the raw and reference datasets and
create a dictionary to recode values within raw
to match corresponding
entries in ref
. Here we’ll do both.
ne_dict <- data.frame(
value = "NJ",
replacement = "New Jersey",
variable = "adm1"
)
(raw_match5 <- hmatch_settle(raw_remain4, ne_ref, dict = ne_dict,
fuzzy = TRUE, type = "resolve_inner"))
#> id adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 PID08 USA Pensylvania Ithaca adm1 USA Pennsylvania <NA> 230
#> 2 PID12 USA NJ <NA> adm1 USA New Jersey <NA> 210
#> 3 PID13 <NA> <NA> Jeffersen adm2 USA <NA> <NA> 222
(raw_remain5 <- hmatch_settle(raw_remain4, ne_ref, dict = ne_dict,
fuzzy = TRUE, type = "resolve_anti"))
#> id adm0 adm1 adm2
#> 1 PID14 <NA> <NA> york