Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a pre-existing EM object does not work unless all comparison levels are present #85

Open
zmbc opened this issue Sep 1, 2024 · 0 comments

Comments

@zmbc
Copy link

zmbc commented Sep 1, 2024

Note: I am still not 100% confident in my diagnosis here. The title of this issue is my best guess of the error case.

I've seen some confusing behavior with pre-trained EM objects, which I believe I've narrowed down. I cannot get any links to occur (even when the matches are all perfect) when any comparison level is not present in the data being predicted, regardless of the data that the EM object was trained on.

Example:

library(fastLink)
library(data.table)

dfA1 <- data.frame(
  foo = c('ABCD', 'AND_NOW_FOR', 'ABCDEFG'),
  bar = c(1, 2, 3)
)

dfB1 <- data.frame(
  foo = c('ABCD', 'SOMETHING_COMPLETELY_DIFFERENT', 'ABCDEFG'),
  bar = c(1, 2, 3)
)

em_obj_test <- fastLink(
  dfA = dfA1,
  dfB = dfB1,
  varnames = c('foo'),
  stringdist.match = c('foo'),
  partial.match = c('foo'),
  estimate.only = TRUE
)

dfA2 <- data.frame(
  foo = c('ABCD', 'THIS_SHOULD_NOT_MATTER', 'ABCDEFG'),
  bar = c(1, 2, 3)
)

dfB2 <- data.frame(
  foo = c('ABCD', 'NEITHER_SHOULD_THIS', 'ABCDEFG'),
  bar = c(1, 2, 3)
)

results <- fastLink(
  dfA = dfA2,
  dfB = dfB2,
  varnames = c('foo'),
  stringdist.match = c('foo'),
  partial.match = c('foo'),
  em.obj = em_obj_test
)

results$matches$inds.a # Outputs 1 and 3

dfA2 <- data.frame(
  foo = c('ABCD', 'ABCDEFG'),
  bar = c(1, 2)
)

dfB2 <- data.frame(
  foo = c('ABCD', 'ABCDEFG'),
  bar = c(1, 2)
)

results <- fastLink(
  dfA = dfA2,
  dfB = dfB2,
  varnames = c('foo'),
  stringdist.match = c('foo'),
  partial.match = c('foo'),
  em.obj = em_obj_test
)

results$matches$inds.a # No matches

dfA2 <- data.frame(
  foo = c('ABCD', 'THIS_SHOULD_NOT_MATTER'),
  bar = c(1, 2)
)

dfB2 <- data.frame(
  foo = c('ABCD', 'NEITHER_SHOULD_THIS'),
  bar = c(1, 2)
)

results <- fastLink(
  dfA = dfA2,
  dfB = dfB2,
  varnames = c('foo'),
  stringdist.match = c('foo'),
  partial.match = c('foo'),
  em.obj = em_obj_test
)

results$matches$inds.a # No matches

In the last two runs, 'ABCD' does not match with itself in the other dataframe, even though it clearly should, I think because both a non-similar string and a partial-match-similar string must be present in addition to the exact match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant