Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting matches when using blocking #63

Closed
jamesmartherus opened this issue Aug 4, 2022 · 4 comments
Closed

Extracting matches when using blocking #63

jamesmartherus opened this issue Aug 4, 2022 · 4 comments

Comments

@jamesmartherus
Copy link

Hi there, I am trying to identify duplicates in a large dataset. I am blocking on several variables, aggregating with aggregateEM() and then trying to extract the matches with getMatches(). It looks like getMatches() won't work with the fastLink.aggregate class. Is there some other way to get the same functionality?

Reprex:

library(fastLink)
library(foreach)

data <- data.frame(gender = c(1,2,1,1,1,1,2,2,1,2),
                   age = c(18, 25, 18, 35, 45, 55, 65, 76, 87, 98))

blocks <- blockData(data, data, varnames = c("gender"))

tmp_clus <- parallel::makeCluster(spec = parallel::detectCores()-2, 
                                  type = 'PSOCK')  
doParallel::registerDoParallel(tmp_clus)

em_list <- foreach::foreach(i = 1:length(blocks), .verbose = F) %dopar%
  {
    library(fastLink)
    data_block <- data[blocks[[i]]$dfA.inds,]
    
    fastLink(
      dfA = data_block, dfB = data_block, 
      varnames = c("gender", "age")
      )
  }
parallel::stopCluster(tmp_clus)

em_aggregated <- aggregateEM(em_list)

data_dedupe <- getMatches(dfA = data, dfB = data,
                          fl.out = em_aggregated)

# Error in getMatches(dfA = data, dfB = data, fl.out = em_aggregated) : 
#   dfA and dfB are identical, but fl.out is not of class 'fastLink.dedupe.' Please check your inputs.
@bengoehring
Copy link

I have a similar question. I figured out a work around by using the number of indices in each block to figure out which block corresponds to which value of the blocking variable. From there, I found the matches within each block and then binded them together. This really only works when the blocking variable(s) only have a small number of unique values. It would great to have a more systematic option.

Thank you for making and maintaining such a great package.

(This is the sample example as above with my approach pasted at the bottom -- note that I dropped gender from varnames to make the merge work.)


library(fastLink)
library(foreach)
library(tidyverse)

data <- data.frame(gender = c(1,2,1,1,1,1,2,2,1,2),
                   age = c(18, 25, 18, 35, 45, 55, 65, 76, 87, 98))

blocks <- blockData(data, data, varnames = c("gender"))

tmp_clus <- parallel::makeCluster(spec = parallel::detectCores()-2, 
                                  type = 'PSOCK')  
doParallel::registerDoParallel(tmp_clus)

em_list <- foreach::foreach(i = 1:length(blocks), .verbose = F) %dopar%
  {
    library(fastLink)
    data_block <- data[blocks[[i]]$dfA.inds,]
    
    fastLink(
      dfA = data_block, dfB = data_block, 
      varnames = c("age")
    )
  }
parallel::stopCluster(tmp_clus)


data %>% 
  group_by(gender) %>% 
  summarise(n = n())
pluck(em_list, 1, 'nobs.a')
pluck(em_list, 2, 'nobs.a')
# the first value in em_list corresponds to the gender 1 block
# the second value in em_list corresponds to the gender 2 block

matches_1 <- getMatches(filter(data, gender == 1), 
                        filter(data, gender == 1), 
                        em_list[[1]]) %>% 
  mutate(dedupe.ids = str_c("gender_1_", 
                            dedupe.ids))
matches_2 <- getMatches(filter(data, gender == 2), 
                        filter(data, gender == 2), 
                        em_list[[2]]) %>% 
  mutate(dedupe.ids = str_c("gender_2_", 
                            dedupe.ids))

all_matches <- rbind(matches_1,
                     matches_2)

@aalexandersson
Copy link

Disclaimer: I am a regular fastLink user, not a fastLink developer.

@jamesmartherus @bengoehring Are you both "merely" asking how to extract matches when using blocking? I know how to do that. But I am not sure how to relate to your very specific code, which seems complicated and convoluted to me.

@bengoehring
Copy link

Yes.

@aalexandersson
Copy link

aalexandersson commented Sep 22, 2022

I wrote "merely" in quotation marks because this is a known issue that the fastLink developers are working on.

Ted helped me with my similar question a few years ago, and thanks to him I regularly use code similar to the example below. It should be much simpler to do this in fastLink, and maybe I messed up something. But, as asked for, the sample code extracts matches when using blocking (3 blocks in my example). I also added comments, and code for the confusion table because that is a related known issue when using blocking.

library(fastLink)
data(samplematch)

df1 <- dfA
df2 <- dfB

# blocking
block_out <- blockData(df1, df2, 
    varnames = c("firstname"),
    kmeans.block = "firstname", nclusters = 3)  # 3 blocks

# linkage
linkvars <- c("firstname", "lastname", "housenum", "streetname", "birthyear")  #
gammas <- c("gamma.1", "gamma.2", "gamma.3", "gamma.4", "gamma.5")


# Loop over blocks and merge 
match_out <- vector(mode = "list", length = length(block_out))
flobj_out <- vector(mode = "list", length = length(block_out))

for (i in 1:length(block_out)){
  print(paste("Block number is", i))
  
  # Subset data
  sub1 <- df1[block_out[[i]]$dfA.inds,]
  sub2   <- df2[block_out[[i]]$dfB.inds,]
  
  # Run fastLink
  hide <- capture.output(fl_out <- fastLink(
    dfA = sub1, dfB = sub2,
    varnames = linkvars,   # 
    return.all = TRUE))  #
  
  # Get matches, store
  match_out[[i]] <- getMatches(
    dfA = sub1, dfB = sub2, fl.out = fl_out,
    threshold.match = 0.95, combine.dfs = FALSE)  # NB 0.95
  flobj_out[[i]] <- fl_out
}
saveRDS(flobj_out, file="flobj_out.rds") # save object as file


# Extract matches
match_df1 <- do.call("rbind", lapply(match_out, "[[", "dfA.match"))
match_df2 <- do.call("rbind", lapply(match_out, "[[", "dfB.match"))


# confusion table
out <- readRDS("flobj_out.rds")  
confusion(out, threshold = 0.95)

The confusion table of the example should look like this:

$confusion.table
                     'True' Matches 'True' Non-Matches
Declared Matches               50.0                0.0
Declared Non-Matches            0.3              299.7

$addition.info
                                results
Max Number of Obs to be Matched  350.00
Sensitivity (%)                   99.40
Specificity (%)                  100.00
Positive Predicted Value (%)     100.00
Negative Predicted Value (%)      99.90
False Positive Rate (%)            0.00
False Negative Rate (%)            0.60
Correctly Classified (%)          99.91
F1 Score (%)                      99.70

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants