-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log(X) NaNs produced #81
Comments
Disclaimer: I am a regular user of fastLink, not a developer. I am not aware of a best way to handle this warning message. Are there negative values in the dataset? Are you able to show the script (code only, no data)? |
Yes - attached. I wrote in some "XXX"'s for file paths.
There should not be negative values in any fields.
…On Tue, Apr 30, 2024 at 3:45 PM Anders Alexandersson < ***@***.***> wrote:
Disclaimer: I am a regular user of fastLink, not a developer.
I am not aware of a best way to handle this warning message. Are there
negative values in the dataset? Are you able to show the script (code only,
no data)?
—
Reply to this email directly, view it on GitHub
<#81 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BDQY4GWZGO275QUVW5L2VSDY77YD7AVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBWHA4TONBWGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Sorry, I cannot see your attached file. Maybe just paste it? Make sure to preview before sending. Markdown is supported. |
# install.packages("fastLink")
# install.packages("purrr")
# install.packages("tidyverse")
# install.packages("tidyr")
# install.packages("stringdist")
# install.packages("Matrix")
# install.packages(c("fastLink", "xtable", "tidyverse", "ggthemes",
# "gridExtra", "grid", "data.table", "knitr",
"doParallel",
# "parallel", "lattice", "stringdist", "RecordLinkage"))
rm(list = ls())
library(fastLink)
library(purrr)
library(tidyverse)
library(foreign)
library(dplyr)
library(tidyr)
start.time <- Sys.time()
# #
############################################################################
# # fuzzy match: kid/birth records
#adjust the states as needed to run
#states <-list("IN", "IL", "KY")
states <-list("KY")
yrlist=c(1980, 1990, 2000)
for (stat in states){
for (yr in yrlist){
yr2=yr+9
setwd("XXX")
dfA<-read.csv("students_fuzzy.csv")
dfA<-subset(dfA, birthyear>=yr & birthyear<=yr2)
# this is the path for the voting data
dfBname<-paste("XXX", stat, sep="")
dfBname<-paste(dfBname,"XXX", sep="")
dfBname<-paste(dfBname,stat,sep="")
dfBname<-paste(dfBname, yr, sep="_")
dfBname<-paste(dfBname,yr2,sep="_")
dfBname<-paste(dfBname, ".csv", sep="")
dfB<-read.csv(dfBname)
names(dfB)[names(dfB) == "voters_male"] <- "male"
names(dfB)[names(dfB) == "birthyr"] <- "birthyear"
#dfA$ID <- seq.int(nrow(dfA))
#dfB$ID2 <- seq.int(nrow(dfB))
dfA <- transform(dfA, birthyear = as.numeric(birthyear),
birthmonth = as.numeric(birthmonth),
birthday = as.numeric(birthday))
dfB <- transform(dfB, birthyear = as.numeric(birthyear),
birthmonth = as.numeric(birthmonth),
birthday = as.numeric(birthday))
blockgroups <- blockData(dfA, dfB, varnames = c("birthyear", "male"))
dfA_allblocks<-list()
dfB_allblocks<-list()
matches_old<-data.frame()
for (i in 1:length(blockgroups)) {
dfA_allblocks[[i]] <- dfA[blockgroups[[i]]$dfA.inds, ]
dfA_block <- dfA[blockgroups[[i]]$dfA.inds, ]
dfB_allblocks[[i]]<- dfB[blockgroups[[i]]$dfB.inds, ]
dfB_block<- dfB[blockgroups[[i]]$dfB.inds, ]
matches.out <- fastLink(
dfA = dfA_block, dfB = dfB_block,
varnames = c("firstname", "lastname", "middlein", "fullname",
"birthmonth", "birthday"),
stringdist.match = c("firstname", "lastname", "fullname", "middlein"),
numeric.match = c("birthmonth", "birthday"),
partial.match = c("firstname", "lastname","fullname"),
verbose = TRUE,
threshold.match = 0.855,
)
matchesA_other <- dfA_block[matches.out$matches$inds.a,]
matchesB_other <- dfB_block[matches.out$matches$inds.b,]
print("Here")
matches_other <- matchesB_other
if (exists("matches.out")){
matchesA_other <- dfA_block[matches.out$matches$inds.a,]
matchesB_other <- dfB_block[matches.out$matches$inds.b,]
print("Here")
matches_other <- matchesB_other
}
print("here2")
if(exists("matches_other") & !is.null(matches_other)){
matches_other$pattern <- do.call(paste, matches.out$patterns)
print("diagnose me")
matches_other$posterior <- matches.out$posterior
print("diagnose me2")
matches_other$student_alternate_id<- matchesA_other$student_alternate_id
matches_other$studfirstname<- matchesA_other$firstname
matches_other$studmiddlename<- matchesA_other$middlename
matches_other$studmiddlein<- matchesA_other$middlein
matches_other$studlastname<- matchesA_other$lastname
matches_other$studbirth_date<- matchesA_other$birth_date
matches_other$studbirthyear<- matchesA_other$birthyear
matches_other$studbirthmonth<- matchesA_other$birthmonth
matches_other$studbirthday<- matchesA_other$birthday
matches_other$studfullname<-matchesA_other$fullname
print("diagnose me3")
matches_other<-rbind(matches_old, matches_other)
matches_other$posterior <- format(matches_other$posterior, decimal.mark
= ".",digits = 4)
print(i)
matches_old<-matches_other
#rm(matches.out)
}
}
setwd("XXX")
print("writing out")
outname<-paste("FL_kidsvote_", stat, "_", yr, "_", yr2,".csv", sep="")
write.csv(matches_old, outname, row.names=FALSE)
}
}
end.time <- Sys.time()
time.taken2 <- round(end.time - start.time,2)
time.taken2
…On Tue, Apr 30, 2024 at 4:09 PM Anders Alexandersson < ***@***.***> wrote:
Sorry, I cannot see your attached file. Maybe just paste it? Make sure to
preview before sending. Markdown is supported.
—
Reply to this email directly, view it on GitHub
<#81 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BDQY4GWUILAZ3PBD7XR6B6TY773BFAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBXGAZDONJRGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Do the warning messages occur from the for loop, and/or from the code before or after the for loop? Are all the variables mostly complete (little missing data) -- even "middlein"? Also, it seems excessively redundant to link on both "fullname" and, at the same time, all the name parts: "firstname", "lastname", "middlein". |
Hi-
Middlein is missing a lot of data.
The warnings only display after I run the full code (including the loop).
Is there a way to tell where the warning is traced to? The message I get is
just “50 warnings recorded - use warnings() to display” and then it shows
this same warning again and again. I assume it must be related to the fast
link because I can’t see where else logs come into play…
I include both full name and each name field separately because I have some
concerns about which field middle/last are reported (especially for two
part last names, like “Lopez Garcia”).
Does that help?
…On Tue, Apr 30, 2024 at 6:55 PM Anders Alexandersson < ***@***.***> wrote:
Do the warning messages occur from the for loop, and/or from the code
before or after the for loop?
Are all the variables mostly complete (little missing data) -- even
"middlein"? Also, it seems excessively redundant to link on both "fullname"
and, at the same time, all the name parts: "firstname", "lastname",
"middlein".
—
Reply to this email directly, view it on GitHub
<#81 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BDQY4GR3NRLZIS4C7WGXRCTZAAOO5AVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBXGY3TCNZVGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
In practice, I have found fastLink to be unreliable with a lot of missing data, say >30%. Are the other linkage variables more complete? Do the warning messages disappear if variable "middlein" is omitted? You can convert warnings to errors, and then trace the errors. See, for example, https://adv-r.hadley.nz/debugging.html#non-error-failures. Best practices for linking on names is an important and difficult issue. I am concerned about using highly correlated variables, especially while having warning messages. I would try hard to get rid of the warning messages first (before optimizing the linkage). That is, start with a simple record linkage configuration that works without warning messages. Then expand from it, as needed until you can reproduce the issue. The current code seems overly complicated, for example why use both |
Thank you, @aalexandersson, for your valuable insights as always. If you remove It is important to note that to prevent numerical underflow caused by calculating extremely small probabilities, we use logarithmic transformations of all model parameters. At each iteration of the EM algorithm, we convert each parameter estimate back to its original scale. The issue might be that some probabilities are exceptionally tiny. For each block, you can verify this by examining Another possibility is that one of your blocks contains only a few observations for one of the datasets. Please keep us updated! Ted |
Hi Ted and Anders,
Thank you both for your input!
1) If I remove "fullname" (which is highly correlated with the other
fields) for a subsample of data, I get a different error message (4X)
"1: In emlinkMARmov(patterns = counts, nobs.a = nr_a, nobs.b = nr_b, :
The EM algorithm has run for the specified number of iterations but has
not converged yet.”
2) If I also remove "middlein" (which has a lot of missingness), I get the
same message but "fewer" instances of it (1X).
3) If I examine the output for a subsample using matches.out$EM, I do see
that there is very small probability of finding a match (e.g. $p.m
1.029798566225282e-05).
Some additional context: this may be an instance where there are NOT many
matches to be found. One dataset is records for a smaller sample of
individuals and the other is voting records from a full state -- it's very
possible that there are not many matches to be found in some pairings
across states/years/genders. It is also an instance where there may not be
many observations in some of the blocks (e.g. few people by genderxbirthyear)
-- but the blocking is very helpful for speed.
Is it still appropriate (at least: not highly inappropriate) for me to use
fastLink for this type of matching? It seems like there is still output
created even when this warning occurs. This is a setting where there are
many exact matches but I was attracted to fastLink because it provided a
speedy way to also facilitate some "fuzzy" matching. (And it's so fast!)
Best,
Kirsten
…On Wed, May 1, 2024 at 12:29 AM Ted Enamorado ***@***.***> wrote:
Thank you, @aalexandersson <https://github.com/aalexandersson>, for your
valuable insights as always.
If you remove middlein from the merge, do you still receive the same
warnings?
It is important to note that to prevent numerical underflow caused by
calculating extremely small probabilities, we use logarithmic
transformations of all model parameters. At each iteration of the EM
algorithm, we convert each parameter estimate back to its original scale.
The issue might be that some probabilities are exceptionally tiny. For each
block, you can verify this by examining matches.out$EM to see if the
model parameters are too small.
Another possibility is that one of your blocks contains only a few
observations for one of the datasets.
Please keep us updated!
Ted
—
Reply to this email directly, view it on GitHub
<#81 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BDQY4GUXHW7N3Y2N4HV24GDZABVSTAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBXHE2TSNRZGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
What is the approximate run time with and without blocking? How many records are in each dataset? Do the warning and error messages disappear if you reduce the amount of blocking, for example, if you use the variable |
1. If I take a subset of data and run it with blocking as in the original
(block by year AND gender), it takes 18.1 minutes and I do get errors
(depending on the subset of data).
2. If I revise the code and run it without ANY blocking -- code follows --
it takes 51.83 minutes. I don't get errors (at least not in the subsets of
data I tried).
matches.out <- fastLink(
dfA = dfA, dfB = dfB,
varnames = c("firstname", "lastname", "birthmonth", "birthday",
"birthyear"),
stringdist.match = c("firstname", "lastname"),
numeric.match = c("birthmonth", "birthday", "birthyear"),
partial.match = c("firstname", "lastname"),
verbose = TRUE,
threshold.match = 0.855,
)
3. If I reduce my code to block just on gender (and match on birthyear), I
do still get the error ("Warning messages:
1: In emlinkMARmov(patterns = counts, nobs.a = nr_a, nobs.b = nr_b,") (at
least in some subsets of data).
4. There are something like 700-800 K observations in dfA overall and about
1 million in dfB (though that depends on the state).
Getting rid of blocking did seem to get rid of the error message. But since
the blocking saves a lot of time I'm inclined to want to keep it in because
I have a lot of matching to conduct.
My question, then, is this: what is the warning trying to tell me could be
going on? (What would be "wrong" about my output, given this warning)? I
take the fastLink matches and then subject them to further processes for
refinement (i.e., requiring that they exactly match on last name or birth
date, etc.). Given that, should I be concerned?
Kirsten
…On Wed, May 1, 2024 at 10:32 AM Anders Alexandersson < ***@***.***> wrote:
What is the approximate run time with and without blocking? How many
records are in each dataset? Do the warning and error messages disappear if
you reduce the amount of blocking, for example, if you use the variable
birthyear in the record linkage step rather than in the blocking step?
—
Reply to this email directly, view it on GitHub
<#81 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BDQY4GTYV3QA4PBHJYOEE4TZAD4IHAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYGU2TCNJSGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
To learn more about the warning messages, you could either (as I suggested before) convert them to errors and then trace the errors, or you could compare the matched datasets to identify which records differ and how because of the difference in blocking. Ted suggested two cause possibilities, and I agree. Another third possible cause could be that you have too few linkage variables for the EM algorithm to reach a stable, global maximum. Instead you may have unstable, local maxima. I suggest this since removing the blocking also removed all warnings and errors. Could you add more linkage variables, not correlated with the existing variables? Examples are social security number, phone number, email address, and street address. I think the bigger question is: how many false positives (count or rate) are you willing to accept? Why did you change the threshold from the default 0.85 to 0.855 -- is the third decimal a typo error or on purpose? Personally, I always run fastLink with a much higher threshold than the default; I typically use either 0.95 or 0.98 or even 0.99 because I am much more concerned about wrong matches (false positives) than missed matches (false negatives) due to my job position (dealing with sensitive cancer data). With the relatively low threshold of around 0.85, then my main concern would be false positives -- not a few warning messages from blocking. You may have different concerns. If you want a simpler solution, then I recommend running the code without blocking since it does the job without warnings and errors and in less than 1 hour. |
This has all been exceptionally helpful - thank you so much! I will take a
look at how matching differs across different specifications.
Unfortunately, I don't have other variables I can use for matching.
However, I should note that I use fastLInk as "first pass" to generate
matches. I then only accept matches that meet certain criteria (including
exact matching on last name, birth date, and/or full name) to further
refine the matches I accept as "true." Given that, I may be less concerned
about false positives in the output from fastLink than other users and more
willing to accept some (inevitable) measurement error.
I will play around with this some more and see if I can land on the
solution that seems to output matches that meet my needs (ideally
minimizing pesky warnings).
Thanks!
Kirsten
…On Wed, May 1, 2024 at 3:01 PM Anders Alexandersson < ***@***.***> wrote:
To learn more about the warning messages, you could either (as I suggested
before) convert them to errors and then trace the errors, or you could
compare the matched datasets to identify which records differ and how
because of the difference in blocking.
Ted suggested two cause possibilities, and I agree. Another third possible
cause could be that you have too few linkage variables for the EM algorithm
to reach a stable, global maximum. Instead you may have unstable, local
maxima. I suggest this since removing the blocking also removed all
warnings and errors. Could you add more linkage variables, not correlated
with the existing variables? Examples are social security number, phone
number, email address, and street address.
I think the bigger question is: how many false positives (count or rate)
are you willing to accept? Why did you change the threshold from the
default 0.85 to 0.855 -- is the third decimal a typo error or on purpose?
Personally, I always run fastLink with a much higher threshold than the
default; I typically use either 0.95 or 0.98 or even 0.99 because I am much
more concerned about wrong matches (false positives) than missed matches
(false negatives) due to my job position (dealing with sensitive cancer
data). With the relatively low threshold of around 0.85, then my main
concern would false positives -- not a few warning messages from blocking.
You may have different concerns.
If you want a simpler solution, then I recommend running the code without
blocking since it does the job without warnings and errors and in less than
1 hour.
—
Reply to this email directly, view it on GitHub
<#81 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BDQY4GVXFMTCUSUWKYBOP43ZAE3YJAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYHEZTCMZQGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
One follow-up here: I have converted warnings into errors. It is clear the
error comes in the fastLInk process. It occurs in some cases (not all) even
when the block groups are large for both dfA and dfB.
The output looks like this:
"Running the EM algorithm
Iteration number 100....
Maximum difference in log-likelihood = 0.2412
Iteration number 5000
Maximum difference in log-likelihood = 0.2412
Error in emlinkMARmov(patterns = counts, nobs.a = nr_a, nobs.b = nr_b, :
(converted from warning) The EM algorithm has run for the specified
number of iterations but has not converged yet."
So it seems like the EM algorithm is not converging in some cases -- it
certainly could be related to calculating very small probabilities. This is
a case where there are very few matches that are likely to be found. The
warning doesn't kill the function and a (small) number of matches are
found, even for the blocks where this warning appears to occur. I tried
fiddling with tol.em and that did not seem to make a difference. Excluding
missing variables also didn't consistently help.
Am I right that what this means is that the algorithm has not found a
stable solution, but it is just outputting whatever it has at the end of
the specified number of iterations (5000)? As a "good enough" solution,
this might do -- I am getting match rates that are in line with my
expectations for these "low match" situations.
On Wed, May 1, 2024 at 3:10 PM Mumma, Kirsten ***@***.***>
wrote:
… This has all been exceptionally helpful - thank you so much! I will take a
look at how matching differs across different specifications.
Unfortunately, I don't have other variables I can use for matching.
However, I should note that I use fastLInk as "first pass" to generate
matches. I then only accept matches that meet certain criteria (including
exact matching on last name, birth date, and/or full name) to further
refine the matches I accept as "true." Given that, I may be less concerned
about false positives in the output from fastLink than other users and more
willing to accept some (inevitable) measurement error.
I will play around with this some more and see if I can land on the
solution that seems to output matches that meet my needs (ideally
minimizing pesky warnings).
Thanks!
Kirsten
On Wed, May 1, 2024 at 3:01 PM Anders Alexandersson <
***@***.***> wrote:
> To learn more about the warning messages, you could either (as I
> suggested before) convert them to errors and then trace the errors, or you
> could compare the matched datasets to identify which records differ and how
> because of the difference in blocking.
>
> Ted suggested two cause possibilities, and I agree. Another third
> possible cause could be that you have too few linkage variables for the EM
> algorithm to reach a stable, global maximum. Instead you may have unstable,
> local maxima. I suggest this since removing the blocking also removed all
> warnings and errors. Could you add more linkage variables, not correlated
> with the existing variables? Examples are social security number, phone
> number, email address, and street address.
>
> I think the bigger question is: how many false positives (count or rate)
> are you willing to accept? Why did you change the threshold from the
> default 0.85 to 0.855 -- is the third decimal a typo error or on purpose?
> Personally, I always run fastLink with a much higher threshold than the
> default; I typically use either 0.95 or 0.98 or even 0.99 because I am much
> more concerned about wrong matches (false positives) than missed matches
> (false negatives) due to my job position (dealing with sensitive cancer
> data). With the relatively low threshold of around 0.85, then my main
> concern would false positives -- not a few warning messages from blocking.
> You may have different concerns.
>
> If you want a simpler solution, then I recommend running the code without
> blocking since it does the job without warnings and errors and in less than
> 1 hour.
>
> —
> Reply to this email directly, view it on GitHub
> <#81 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/BDQY4GVXFMTCUSUWKYBOP43ZAE3YJAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYHEZTCMZQGU>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
My understanding is that the EM model must converge to have valid, stable results. Does the EM model converge when there is no blocking? Does the model converge when you remove the problematic variable |
Hello,
I am running a (looped) script using fastlink. The script runs (and seems to work) but at the end I get a list of 50 warnings "In log(x): NaNs produced." I assumed that this probably has to do with the likelihood function and isn't generally something to be concerned about re: affecting the output-- does that seem right? I am not able to produce a reproducible sample here since this project uses restricted-use data and I am unable to reproduce the issue with the sample data.
Thanks!
The text was updated successfully, but these errors were encountered: