Extracting matches when using blocking #63

jamesmartherus · 2022-08-04T17:07:07Z

Hi there, I am trying to identify duplicates in a large dataset. I am blocking on several variables, aggregating with aggregateEM() and then trying to extract the matches with getMatches(). It looks like getMatches() won't work with the fastLink.aggregate class. Is there some other way to get the same functionality?

Reprex:

library(fastLink)
library(foreach)

data <- data.frame(gender = c(1,2,1,1,1,1,2,2,1,2),
                   age = c(18, 25, 18, 35, 45, 55, 65, 76, 87, 98))

blocks <- blockData(data, data, varnames = c("gender"))

tmp_clus <- parallel::makeCluster(spec = parallel::detectCores()-2, 
                                  type = 'PSOCK')  
doParallel::registerDoParallel(tmp_clus)

em_list <- foreach::foreach(i = 1:length(blocks), .verbose = F) %dopar%
  {
    library(fastLink)
    data_block <- data[blocks[[i]]$dfA.inds,]
    
    fastLink(
      dfA = data_block, dfB = data_block, 
      varnames = c("gender", "age")
      )
  }
parallel::stopCluster(tmp_clus)

em_aggregated <- aggregateEM(em_list)

data_dedupe <- getMatches(dfA = data, dfB = data,
                          fl.out = em_aggregated)

# Error in getMatches(dfA = data, dfB = data, fl.out = em_aggregated) : 
#   dfA and dfB are identical, but fl.out is not of class 'fastLink.dedupe.' Please check your inputs.

The text was updated successfully, but these errors were encountered:

bengoehring · 2022-09-22T14:41:10Z

I have a similar question. I figured out a work around by using the number of indices in each block to figure out which block corresponds to which value of the blocking variable. From there, I found the matches within each block and then binded them together. This really only works when the blocking variable(s) only have a small number of unique values. It would great to have a more systematic option.

Thank you for making and maintaining such a great package.

(This is the sample example as above with my approach pasted at the bottom -- note that I dropped gender from varnames to make the merge work.)


library(fastLink)
library(foreach)
library(tidyverse)

data <- data.frame(gender = c(1,2,1,1,1,1,2,2,1,2),
                   age = c(18, 25, 18, 35, 45, 55, 65, 76, 87, 98))

blocks <- blockData(data, data, varnames = c("gender"))

tmp_clus <- parallel::makeCluster(spec = parallel::detectCores()-2, 
                                  type = 'PSOCK')  
doParallel::registerDoParallel(tmp_clus)

em_list <- foreach::foreach(i = 1:length(blocks), .verbose = F) %dopar%
  {
    library(fastLink)
    data_block <- data[blocks[[i]]$dfA.inds,]
    
    fastLink(
      dfA = data_block, dfB = data_block, 
      varnames = c("age")
    )
  }
parallel::stopCluster(tmp_clus)


data %>% 
  group_by(gender) %>% 
  summarise(n = n())
pluck(em_list, 1, 'nobs.a')
pluck(em_list, 2, 'nobs.a')
# the first value in em_list corresponds to the gender 1 block
# the second value in em_list corresponds to the gender 2 block

matches_1 <- getMatches(filter(data, gender == 1), 
                        filter(data, gender == 1), 
                        em_list[[1]]) %>% 
  mutate(dedupe.ids = str_c("gender_1_", 
                            dedupe.ids))
matches_2 <- getMatches(filter(data, gender == 2), 
                        filter(data, gender == 2), 
                        em_list[[2]]) %>% 
  mutate(dedupe.ids = str_c("gender_2_", 
                            dedupe.ids))

all_matches <- rbind(matches_1,
                     matches_2)

aalexandersson · 2022-09-22T19:43:49Z

Disclaimer: I am a regular fastLink user, not a fastLink developer.

@jamesmartherus @bengoehring Are you both "merely" asking how to extract matches when using blocking? I know how to do that. But I am not sure how to relate to your very specific code, which seems complicated and convoluted to me.

bengoehring · 2022-09-22T19:51:14Z

Yes.

aalexandersson · 2022-09-22T21:20:40Z

I wrote "merely" in quotation marks because this is a known issue that the fastLink developers are working on.

Ted helped me with my similar question a few years ago, and thanks to him I regularly use code similar to the example below. It should be much simpler to do this in fastLink, and maybe I messed up something. But, as asked for, the sample code extracts matches when using blocking (3 blocks in my example). I also added comments, and code for the confusion table because that is a related known issue when using blocking.

library(fastLink)
data(samplematch)

df1 <- dfA
df2 <- dfB

# blocking
block_out <- blockData(df1, df2, 
    varnames = c("firstname"),
    kmeans.block = "firstname", nclusters = 3)  # 3 blocks

# linkage
linkvars <- c("firstname", "lastname", "housenum", "streetname", "birthyear")  #
gammas <- c("gamma.1", "gamma.2", "gamma.3", "gamma.4", "gamma.5")


# Loop over blocks and merge 
match_out <- vector(mode = "list", length = length(block_out))
flobj_out <- vector(mode = "list", length = length(block_out))

for (i in 1:length(block_out)){
  print(paste("Block number is", i))
  
  # Subset data
  sub1 <- df1[block_out[[i]]$dfA.inds,]
  sub2   <- df2[block_out[[i]]$dfB.inds,]
  
  # Run fastLink
  hide <- capture.output(fl_out <- fastLink(
    dfA = sub1, dfB = sub2,
    varnames = linkvars,   # 
    return.all = TRUE))  #
  
  # Get matches, store
  match_out[[i]] <- getMatches(
    dfA = sub1, dfB = sub2, fl.out = fl_out,
    threshold.match = 0.95, combine.dfs = FALSE)  # NB 0.95
  flobj_out[[i]] <- fl_out
}
saveRDS(flobj_out, file="flobj_out.rds") # save object as file


# Extract matches
match_df1 <- do.call("rbind", lapply(match_out, "[[", "dfA.match"))
match_df2 <- do.call("rbind", lapply(match_out, "[[", "dfB.match"))


# confusion table
out <- readRDS("flobj_out.rds")  
confusion(out, threshold = 0.95)

The confusion table of the example should look like this:

$confusion.table
                     'True' Matches 'True' Non-Matches
Declared Matches               50.0                0.0
Declared Non-Matches            0.3              299.7

$addition.info
                                results
Max Number of Obs to be Matched  350.00
Sensitivity (%)                   99.40
Specificity (%)                  100.00
Positive Predicted Value (%)     100.00
Negative Predicted Value (%)      99.90
False Positive Rate (%)            0.00
False Negative Rate (%)            0.60
Correctly Classified (%)          99.91
F1 Score (%)                      99.70

tedenamorado closed this as completed Oct 10, 2022

gbdias mentioned this issue Oct 13, 2022

Q: Database size limit for duplicate removal? #65

Open

aalexandersson mentioned this issue Nov 29, 2022

aggconfusion development update #69

Closed

aalexandersson mentioned this issue Jun 26, 2023

blockData – Error: Vector memory exhausted (limit reached?) #72

Open

aalexandersson mentioned this issue Oct 26, 2023

Dealing with aliases in FastLink #75

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting matches when using blocking #63

Extracting matches when using blocking #63

jamesmartherus commented Aug 4, 2022

bengoehring commented Sep 22, 2022

aalexandersson commented Sep 22, 2022

bengoehring commented Sep 22, 2022

aalexandersson commented Sep 22, 2022 •

edited

Loading

Extracting matches when using blocking #63

Extracting matches when using blocking #63

Comments

jamesmartherus commented Aug 4, 2022

bengoehring commented Sep 22, 2022

aalexandersson commented Sep 22, 2022

bengoehring commented Sep 22, 2022

aalexandersson commented Sep 22, 2022 • edited Loading

aalexandersson commented Sep 22, 2022 •

edited

Loading