-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dealing with aliases in FastLink #75
Comments
Disclaimer: I am a The accuracy of the linkage results will greatly improve if you have more and better linkage variables such as Social Security Number. But I would like to focus my reply comments on I am not aware of an always-best answer. The Regardless, I recommend to pay attention to the de-duplication results -- especially if you main concern is wrong matches (false positives) rather than missed matches (false negatives). The default is Also, In summary, the |
I need to conduct a linkage in R using both deterministic and probabilistic methods. The identifying fields we are using are first name, last name, and date of birth. One of our datasets includes information about civil legal system involvement, and the other involves information about arrests. Especially in the arrest data, a single person might have multiple aliases or different dates of birth recorded. It's hard to know which of those are legally correct, and sometimes only a combination of information across alias records provides the full picture about a person's identity. We do have a "fingerprint ID" that allows us to see how a person's identity has been recorded across time in the arrest data.
Is there a way to use the FastLink package that allows us to keep (and leverage) all the nuanced information provided across these aliases when we undertake the linkage? Or is it necessary to somehow de-duplicate the arrest data and choose a single name for each person (which feels arbitrary, and will inevitably lead to a loss in some important data that could critically improve the validity of the linkage).
An example dataset is available here. As you'll notice, the first person has 4 different entries with minor variations in first name, last name, and DOB.
arrests <- data.frame(fingerprint_id = c("123321", "123321", "123321", "123321", "431940", "532523"),
first = c("Joseph", "Johan", "Johan", "Johan", "Kristn", "Adam"),
last = c("Shmo", "Shomseff", "Shomseff", "Shomsef", "Mickleson", "Gregerson"),
dob = c("05/25/1987", "05/25/1987", "02/25/1987", "02/25/1987", "01/17/1955", "06/05/1995"))
Thanks for any guidance on this. I'm new to the linkage world.
The text was updated successfully, but these errors were encountered: