-
Notifications
You must be signed in to change notification settings - Fork 0
/
diy-ch05-record-linkage.Rmd
243 lines (154 loc) · 11.4 KB
/
diy-ch05-record-linkage.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
---
title: 'DIY: Matching people in the UK-UN sanction lists'
output:
html_document:
fig_caption: yes
highlight: tango
theme: spacelab
---
*From Chapter 5*
## Overview
Creating and maintaining lists of high-risk individuals and organizations is part of any national security apparatus. In fact, a number of nations and governing bodies publish such lists of enemies of the state and their aliases so that companies and people conducting global commerce can follow international sanctions. For example, a number of countries and international organizations maintain sanctions lists:
- [*US Consolidated Screening List*](http://2016.export.gov/ecr/eg_main_023148.asp)
- [*UK Financial sanctions targets*](https://www.gov.uk/government/publications/financial-sanctions-consolidated-list-of-targets/consolidated-list-of-targets)
- [*UN Sanctions List*]( https://scsanctions.un.org/resources/xml/en/consolidated.xml)
- [*EU Sanctions List*](http://ec.europa.eu/external_relations/cfsp/sanctions/list/version4/global/global.xml)
These data sets pertain to the serious business of global security, but also present an opportunity for record linkage using real-world data. Every day, companies sift through customer lists and marketing lists to understand their customers. Nonprofit development managers compare the lists of donors to marketing lists to find potential donors who resemble their current support base. And organizations operating in the interest of global security compare names of intended recipients of shipments against sanction lists.
##Examining the data
For simplicity, we will focus only on the UN and EU lists to learn the extent to which they have common identified risks. We begin by loading a `Rda` file containing all watch lists. Note that these lists are constantly updated, thus the files extracted for this exercise are a snapshot from one point in time.
```{r, warning= FALSE, message=FALSE}
load("data/watch_lists.Rda")
```
The watch lists capture similar attributes, but often are organized quite differently. Let's examine the case of Saddam Hussein, the former President of Iraq (Table \@ref(tab:saddam)). Both the EU and UN lists contain his first name and last name, birth date, and citizenship. There are slight deviations in how the names are recorded, but we can see how the lists could be matched after basic string manipulation.
```{r, saddam, echo = FALSE, message = FALSE, warning = FALSE}
# produce a data frame for comparing data sets
example <- data.frame( eu.variables = c(colnames(eu), rep("",6)),
eu.example = c(t(eu[eu$id==13,]), rep("",6)),
un.variables = colnames(un),
un.example = c(t(un[un$id==6908048,])))
#Loop through and limit number of characters
for(i in 1:4){
example[,i] <- substr(example[,i], 1,23)
example[,i][is.na(example[,i])] <- ""
}
row.names(example) <- NULL
knitr::kable(example, caption = "Comparison of EU and UN lists.", booktab = TRUE,
col.names = c("EU Variables", "", "UN Variables", "Example"))
```
Based on visual inspection, the following five fields contain similar information useful for matching: `firstname`, `middlename`/`secondname`, `lastname`/`thirdname`, `wholename`, and `birthdate`. Using the procedures laid out in this chapter, we can easily compare the performance of deterministic record linkage versus probabilistic record linkage.
##Data preparation
Different organizations treat names differently. Some may concatenate parts of first and middle names while others keep them separate. To minimize the influence of recording error, all parts of a name from the UN data set are concatenated into a `wholename` field, similar to the EU data, and filling any `NA` value with empty quotes. This will prevent `NA` values from being treated as "NA" strings. To establish a baseline for matching performance, we check for possible matches without any additional text standardization.
```{r, message=FALSE, warning=FALSE, results = "hide"}
#Fill empty quotes
eu[is.na(eu)] <- ""
un[is.na(un)] <- ""
#Create wholename
un$wholename <- paste(un$firstname, un$secondname, un$thirdname)
#Number of overlapping records
print(sum(un$wholename %in% eu$wholename))
```
Sadly, only $n = $`r sum(un$wholename %in% eu$wholename)` records from thousands match. With only a modest level of effort, we develop a function called `textStandard` to correct for typographical differences. Text characters can be standardized by stripping spaces that come before and after strings known as white space (`str_trim`), removing unnecessary punctuation and spaces (`str_remove_all` with a regex pattern `[[:punct:][:space:]]`), then converting strings to lower case (`str_to_lower`). As the procedure will be applied and re-applied, it is a good idea to write a re-usable function that strings together these three `stringr` package functions:
```{r}
textStandard <- function(string_val){
#
# Basic text standardization
# Args:
# string_val = a string or string vector
# Returns:
# Transformed, trimmed, lower case strings
#
require(stringr)
output <- string_val %>%
str_remove_all("[[:punct:][:space:]]") %>%
str_to_lower() %>%
str_trim()
}
```
A second attempt at matching returns much improved results, reaching $n=668$ matches.
```{r, message=FALSE, warning=FALSE, results = "hide"}
#Clean
un$clean.wholename <- textStandard(un$wholename)
eu$clean.wholename <- textStandard(eu$wholename)
#Overlap
print(sum(un$clean.wholename %in% eu$clean.wholename))
```
Next, the `birthdate` values also need to be standardized. Some UN records contain YYYY-MM-DD while others oddly capture minute-level detail (likely a recording error). The EU records are a mix of year and YYYY-MM-DD along with commentary on the data quality. Using `stringr` functions, we remove text from the `birthdate` fields as well as construct a four-digit `birthyear` field.
```{r, message=FALSE, warning=FALSE, results = "hide"}
#Clean UN to YYYY-MM-DD records
un$birthdate <- str_extract(un$birthdate, "\\d{4}-\\d{2}-\\d{2}")
#Remove text from YYYY
eu$birthdate <- str_replace(eu$birthdate, " \\(approximative\\)", "")
#Extract birth year using regex and stringr
un$birthyear <- str_extract(un$birthdate, "\\d{4}")
eu$birthyear <- str_extract(eu$birthdate, "\\d{4}")
```
## Deterministic record linkage
Our deterministic linkage strategy starts with the most precise information, then relaxes matching requirements in each of two subsequent matching waves:
- first: `clean.wholename` and `birthdate`;
- second: `clean.wholename` and `birthyear`; then
- third: `clean.wholename` only.
In each wave, we make note of which wave a match was made, giving credit to more precise matches.
```{r, message=FALSE, warning=FALSE, results = "hide"}
#Load for inner join function
pacman::p_load(dplyr)
#Abridged tables
eu_short <- eu[, c("eu.id", "wholename", "clean.wholename", "birthdate", "birthyear")]
un_short <- un[, c("un.id", "wholename", "clean.wholename", "birthdate", "birthyear")]
#Match waves
wave_1 <- inner_join(eu_short, un_short, by = c("clean.wholename", "birthdate"))
wave_2 <- inner_join(eu_short, un_short, by = c("clean.wholename", "birthyear"))
wave_3 <- inner_join(eu_short, un_short, by = c("clean.wholename"))
#Combine into master table and de-duplicate
det_key <- rbind(cbind(wave_1[, c("eu.id", "un.id", "clean.wholename")], match = 1),
cbind(wave_2[, c("eu.id", "un.id", "clean.wholename")], match = 2),
cbind(wave_3[, c("eu.id", "un.id", "clean.wholename")], match = 3))
det_key <- det_key[!duplicated(det_key[,1:3]), ]
```
From three waves of matching, a total of $n = 676$ matches (```nrow(det_key)```) were identified. This is only a fraction of the UN list ($n = 1046$ -- ```nrow(un_short)```) and the EU list ($n = 2016$ -- ```nrow(eu_short)```). While the most precise matches accounted for only $n=146$ (```nrow(wave_1)```), the bulk of matches were identified in the second wave. The third wave using `wholename` alone offers few gains over the first two waves.
##Probabilistic record linkage
There will undoubtedly be missed matching opportunities due to variation in spelling. Probabilistic record linkage can close the gap using the same three fields, but cast a wider net by allowing a partial fuzzy match on the `clean.wholename` field.
```{r, message=FALSE, warning=FALSE, results = "hide"}
#Load for fastLink
pacman::p_load(fastLink)
#Run example
prob_link <- fastLink(eu_short, un_short,
varnames = c("clean.wholename", "birthdate", "birthyear"),
stringdist.match = c("clean.wholename"),
partial.match = c("clean.wholename"))
```
A comparison of classification thresholds indicates that there is little difference in the match rate from $.75 \leq \xi \leq 0.9$ -- model's estimates are robust. The match rate is significantly greater than the deterministic approach.
```{r, message=FALSE, warning=FALSE, results = "hide"}
#Optimal cutoff
summary(prob_link, threshold = seq(0.75, 0.95, 0.05))
```
We use `getMatches` to recover the dataframe of matches, containing attributes from the `eu` data along with a row index for the corresponding match from the `un` data. In total, the model identified $n=915$ matches -- most of which are $1:1$ matches and a smaller number ($n=22$) are $1:m$ matches. Virtually all deterministic matches are also present in the probabilistic match results, thus the latter can do the job of the former and much more. There are clearly more matches generated through probabilistic matching, but some may wonder if a model-based approach actually surpasses human intuition: *Are the matches accurate?*
By drawing a random subset of non-exact matches, there is clear evidence that the matching model is robust to spelling differences, excess and missing information, and order of names. The proof is in the pudding. Probabilistic record linkage is arguably a superior strategy to deterministic record linkage, but in practice, it does not hurt to try both.
```{r, message=FALSE, warning=FALSE, results = "hide"}
#Recover matches
prob_match <- getMatches(dfA = eu_short, dfB = un_short,
fl.out = prob_link, threshold.match = 0.85)
#Recover UN IDs, whole name and birth year for visual inspection
prob_match$un.id <- un_short$un.id[prob_match$`dfB.match[, names.dfB]`]
prob_match$un.wholename <- un_short$wholename[prob_match$`dfB.match[, names.dfB]`]
prob_match$un.birthyear <- un_short$birthyear[prob_match$`dfB.match[, names.dfB]`]
```
```{r, echo = FALSE, message = FALSE, warning = FALSE, error = FALSE}
#Extract nice looking results
outie <- prob_match[, c("un.wholename", "un.birthyear",
"wholename", "birthyear")]
#Format
outie$un.wholename <- str_to_title(outie$un.wholename)
outie$wholename <- str_to_title(outie$wholename)
#Find cases where information was different
flag1 <- stringdist(outie$wholename, outie$un.wholename, method = "jw")
outie <- outie[flag1 > 0.05 & flag1 < 0.4, ]
outie <- outie[nchar(outie$un.wholename) < 25,]
#Print out
set.seed(100)
knitr::kable(outie[sample(1:nrow(outie),9),],
caption = "Sample results from probabilistic record linkage.", booktab = TRUE,
row.names = FALSE,
col.names = c("Name (UN)", "Year (UN)",
"Name (EU)", "Year (EU)"))
```
##References