Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linking / deduplication with a list of possible candidates #83

Open
BERENZ opened this issue May 29, 2024 · 5 comments
Open

Linking / deduplication with a list of possible candidates #83

BERENZ opened this issue May 29, 2024 · 5 comments

Comments

@BERENZ
Copy link

BERENZ commented May 29, 2024

Imagine that blocking variables are not free of errors (include typos, missing data etc.) so I cannot use blockData function to narrow down number of possible pairs.

As a solution, I am using approximate nearest neighbours algorithms and graphs to create blocks of records (see my blocking package). This somehow mimics the microclustering approach (see also Johndrow et al. (2018)).

However, the resulting blocks are small, containing only 2, 3, or 4 pairs. This makes them unsuitable for use with the blockData function, as the blocks are too small.

That is why I would like to ask if it is possible in some way to use the fastLink package for such cases? The main function fastLink takes all possible combinations for dfA and dfB but I would like to limit the number of pairs to the candidates and then apply the EM algorithm (FS model).

Below you may find an example with the problem (detailed description can be found here).

library(blocking)
library(data.table)
census <- read.csv("https://raw.githubusercontent.com/djvanderlaan/tutorial-reclin-uros2021/main/data/census.csv")
cis <- read.csv("https://raw.githubusercontent.com/djvanderlaan/tutorial-reclin-uros2021/main/data/cis.csv")
setDT(census)
setDT(cis)
census[is.na(dob_day), dob_day := ""]
census[is.na(dob_mon), dob_mon := ""]
census[is.na(dob_year), dob_year := ""]
cis[is.na(dob_day), dob_day := ""]
cis[is.na(dob_mon), dob_mon := ""]
cis[is.na(dob_year), dob_year := ""]
census[, txt:=paste0(pername1, pername2, sex, dob_day, dob_mon, dob_year, enumcap, enumpc)]
cis[, txt:=paste0(pername1, pername2, sex, dob_day, dob_mon, dob_year, enumcap, enumpc)]
census[, x:=1:.N]
cis[, y:=1:.N]

## create blocks based on the ANN algorithm and graphs
set.seed(2024)
result1 <- blocking(x = census$txt, y = cis$txt, verbose = 1, n_threads = 8)

You can see that the majority of blocks are of size 2.

> result1
========================================================
Blocking based on the nnd method.
Number of blocks: 23793.
Number of columns used for blocking: 1079.
Reduction ratio: 1.0000.
========================================================
Distribution of the size of the blocks:
    2     3     4     5 
22998   772    21     2 

Here is the list of candidate pairs.

> head(result1$result)
       x     y block       dist
   <int> <int> <num>      <num>
1:     1  8152  8020 0.02941167
2:     2  8584  8441 0.04878050
3:     3 20590 19941 0.01290381
4:     4 18456 17935 0.07158583
5:     5 17257 16801 0.05370837
6:     6 19868 19265 0.05675775

I would like to use this information about pairs (x, y) for the linkage using the fastLink package.

EDIT: to fully understand my problem I provide an example using the reclin2 package.

library(reclin2)

## add blocks to each dataset
census[result1$result, on = "x", block := result1$result$block]
cis[result1$result, on = "y", block := result1$result$block]

## reclin2 pipeline
pairs <- pair_blocking(x = census, y = cis, on = "block") |>
  compare_pairs(on = c("pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year", "enumcap", "enumpc"),
                default_comparator = cmp_jarowinkler(0.9))

em_res <- problink_em(~ pername1 + pername2+sex + dob_day + dob_mon + dob_year + enumcap + enumpc, data = pairs)

predict(em_res, pairs = pairs, add = TRUE) |>
  select_threshold(pairs = _, "threshold", score = "weights", threshold = 8) |> 
  select_n_to_m(pairs = _, "weights", variable = "ntom", threshold = 0) |>
  link(pairs = _, selection = "ntom") 
@aalexandersson
Copy link

aalexandersson commented May 29, 2024

Disclaimer: I am a regular fastLink user, not a fastLink developer.

Great question. One possibility is "probabilistic blocking", see https://doi.org/10.1007/978-3-030-57521-2_16. Also, I am optimistic that the developers will release a faster and more accurate version of fastLink real soon now.

How does your blocking package perform in terms of false positives (FP)? I think it would be helpful to address this in your (currently mostly empty) vignette at articles/v3-evaluation.html.

@BERENZ
Copy link
Author

BERENZ commented May 29, 2024

Thank you! The package is still under development but some initial results can be found in this notebook. Note that these metrics are on a block, not a unit level.

My approach is similar to the other paper on the syrian civil war causalties where they used the LSH algorithm. I am using different algorithms but the package allows for the LSH via the mlpack library.

What do you mean by new version? Is there a plan to include probabilistic blocking?

I understand that the current version of the fastLink package does not allow for what I am asking in this issue?

@aalexandersson
Copy link

My understanding is that the current version of fastLink can narrow down the number of possible pairs, as you ask for, by using fastLink in a separate blocking step as "approximate probabilistic blocking". But it is only a proof of concept rather than established best practices.

Regarding the planned new version, there are posted tidbits here and there. For example, this is known since December: "We plan to release a new version of fastLink that will keep the same structure but will be faster and, more importantly, do so w/o sacrificing accuracy." Source: #73 (comment)

@tedenamorado
Copy link
Collaborator

Hi @BERENZ,

You are totally correct! Using noisy fields for blocking can lead to errors, making traditional deterministic blocking techniques less effective. Thanks for your work on this topic! I am eager to learn more about your R-package.

LSH has shown promise for improving blocking techniques. I am currently developing a new algorithm based on LSH to scale Record Linkage tasks. Unlike traditional blocking, this algorithm focuses on how comparisons are made within and also across fields -- all within the FS framework.

As @aalexandersson mentioned, I plan to release updates to fastLink soon, which will aim to reduce the memory burden associated with handling a large number of comparisons.

Finally, if I understood correctly, you have many small-sized blocks and want to run fastLink on those records. If so, one practical solution might be to combine all the small-sized blocks and then run fastLink. The wrapper will perform all comparisons in the cross-product of the resulting data from your approach.

If I misunderstood anything, please let me know.

Ted

@BERENZ
Copy link
Author

BERENZ commented May 30, 2024

@tedenamorado thanks for kind words. Let me first start with the topic of this issue.

Finally, if I understood correctly, you have many small-sized blocks and want to run fastLink on those records. If so, one practical solution might be to combine all the small-sized blocks and then run fastLink. The wrapper will perform all comparisons in the cross-product of the resulting data from your approach.

I understand that grouping small-sized blocks into larger ones is an option. Could you please provide any suggestions regarding this approach? The result of my blocking procedure is a large number of 2-3 sized blocks. Should I group them to achieve a size of, say, 20-50 units? Is there a theoretical suggestion on how to do so? I would like to go beyond the rule of thumb.

Now, concerning the blocking package

You are totally correct! Using noisy fields for blocking can lead to errors, making traditional deterministic blocking techniques less effective. Thanks for your work on this topic! I am eager to learn more about your R-package.

My package works as follows:

  1. Create 2-character shingles and sparse matrices for the input datasets,
  2. Build a search index using various approaches (the default is an excellent rnndescent package which supports sparse matrices and is really fast). LSH is also available through the mlpack::lsh() function.
  3. Query the index and return 30 closest neighbours from which I take only 1.
  4. Use igraph library to create a graph and return clusters of nodes.

This is the complete receipt. :)

If you would like to test the package, please let me know. If you find any errors or have suggestions for improvement, please open an issue on the GitHub repository. I am available to assist with any questions you may have about the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants