Log(X) NaNs produced #81

kslungaardmumma · 2024-04-30T18:54:16Z

Hello,

I am running a (looped) script using fastlink. The script runs (and seems to work) but at the end I get a list of 50 warnings "In log(x): NaNs produced." I assumed that this probably has to do with the likelihood function and isn't generally something to be concerned about re: affecting the output-- does that seem right? I am not able to produce a reproducible sample here since this project uses restricted-use data and I am unable to reproduce the issue with the sample data.

Thanks!

aalexandersson · 2024-04-30T19:44:42Z

Disclaimer: I am a regular user of fastLink, not a developer.

I am not aware of a best way to handle this warning message. Are there negative values in the dataset? Are you able to show the script (code only, no data)?

kslungaardmumma · 2024-04-30T19:51:07Z

Yes - attached. I wrote in some "XXX"'s for file paths. There should not be negative values in any fields.

…

On Tue, Apr 30, 2024 at 3:45 PM Anders Alexandersson < ***@***.***> wrote: Disclaimer: I am a regular user of fastLink, not a developer. I am not aware of a best way to handle this warning message. Are there negative values in the dataset? Are you able to show the script (code only, no data)? — Reply to this email directly, view it on GitHub <#81 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BDQY4GWZGO275QUVW5L2VSDY77YD7AVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBWHA4TONBWGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

aalexandersson · 2024-04-30T20:09:33Z

Sorry, I cannot see your attached file. Maybe just paste it? Make sure to preview before sending. Markdown is supported.

kslungaardmumma · 2024-04-30T20:16:23Z

# install.packages("fastLink") # install.packages("purrr") # install.packages("tidyverse") # install.packages("tidyr") # install.packages("stringdist") # install.packages("Matrix") # install.packages(c("fastLink", "xtable", "tidyverse", "ggthemes", # "gridExtra", "grid", "data.table", "knitr", "doParallel", # "parallel", "lattice", "stringdist", "RecordLinkage")) rm(list = ls()) library(fastLink) library(purrr) library(tidyverse) library(foreign) library(dplyr) library(tidyr) start.time <- Sys.time() # # ############################################################################ # # fuzzy match: kid/birth records #adjust the states as needed to run #states <-list("IN", "IL", "KY") states <-list("KY") yrlist=c(1980, 1990, 2000) for (stat in states){ for (yr in yrlist){ yr2=yr+9 setwd("XXX") dfA<-read.csv("students_fuzzy.csv") dfA<-subset(dfA, birthyear>=yr & birthyear<=yr2) # this is the path for the voting data dfBname<-paste("XXX", stat, sep="") dfBname<-paste(dfBname,"XXX", sep="") dfBname<-paste(dfBname,stat,sep="") dfBname<-paste(dfBname, yr, sep="_") dfBname<-paste(dfBname,yr2,sep="_") dfBname<-paste(dfBname, ".csv", sep="") dfB<-read.csv(dfBname) names(dfB)[names(dfB) == "voters_male"] <- "male" names(dfB)[names(dfB) == "birthyr"] <- "birthyear" #dfA$ID <- seq.int(nrow(dfA)) #dfB$ID2 <- seq.int(nrow(dfB)) dfA <- transform(dfA, birthyear = as.numeric(birthyear), birthmonth = as.numeric(birthmonth), birthday = as.numeric(birthday)) dfB <- transform(dfB, birthyear = as.numeric(birthyear), birthmonth = as.numeric(birthmonth), birthday = as.numeric(birthday)) blockgroups <- blockData(dfA, dfB, varnames = c("birthyear", "male")) dfA_allblocks<-list() dfB_allblocks<-list() matches_old<-data.frame() for (i in 1:length(blockgroups)) { dfA_allblocks[[i]] <- dfA[blockgroups[[i]]$dfA.inds, ] dfA_block <- dfA[blockgroups[[i]]$dfA.inds, ] dfB_allblocks[[i]]<- dfB[blockgroups[[i]]$dfB.inds, ] dfB_block<- dfB[blockgroups[[i]]$dfB.inds, ] matches.out <- fastLink( dfA = dfA_block, dfB = dfB_block, varnames = c("firstname", "lastname", "middlein", "fullname", "birthmonth", "birthday"), stringdist.match = c("firstname", "lastname", "fullname", "middlein"), numeric.match = c("birthmonth", "birthday"), partial.match = c("firstname", "lastname","fullname"), verbose = TRUE, threshold.match = 0.855, ) matchesA_other <- dfA_block[matches.out$matches$inds.a,] matchesB_other <- dfB_block[matches.out$matches$inds.b,] print("Here") matches_other <- matchesB_other if (exists("matches.out")){ matchesA_other <- dfA_block[matches.out$matches$inds.a,] matchesB_other <- dfB_block[matches.out$matches$inds.b,] print("Here") matches_other <- matchesB_other } print("here2") if(exists("matches_other") & !is.null(matches_other)){ matches_other$pattern <- do.call(paste, matches.out$patterns) print("diagnose me") matches_other$posterior <- matches.out$posterior print("diagnose me2") matches_other$student_alternate_id<- matchesA_other$student_alternate_id matches_other$studfirstname<- matchesA_other$firstname matches_other$studmiddlename<- matchesA_other$middlename matches_other$studmiddlein<- matchesA_other$middlein matches_other$studlastname<- matchesA_other$lastname matches_other$studbirth_date<- matchesA_other$birth_date matches_other$studbirthyear<- matchesA_other$birthyear matches_other$studbirthmonth<- matchesA_other$birthmonth matches_other$studbirthday<- matchesA_other$birthday matches_other$studfullname<-matchesA_other$fullname print("diagnose me3") matches_other<-rbind(matches_old, matches_other) matches_other$posterior <- format(matches_other$posterior, decimal.mark = ".",digits = 4) print(i) matches_old<-matches_other #rm(matches.out) } } setwd("XXX") print("writing out") outname<-paste("FL_kidsvote_", stat, "_", yr, "_", yr2,".csv", sep="") write.csv(matches_old, outname, row.names=FALSE) } } end.time <- Sys.time() time.taken2 <- round(end.time - start.time,2) time.taken2

…

On Tue, Apr 30, 2024 at 4:09 PM Anders Alexandersson < ***@***.***> wrote: Sorry, I cannot see your attached file. Maybe just paste it? Make sure to preview before sending. Markdown is supported. — Reply to this email directly, view it on GitHub <#81 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BDQY4GWUILAZ3PBD7XR6B6TY773BFAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBXGAZDONJRGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

aalexandersson · 2024-04-30T22:55:21Z

Do the warning messages occur from the for loop, and/or from the code before or after the for loop?

Are all the variables mostly complete (little missing data) -- even "middlein"? Also, it seems excessively redundant to link on both "fullname" and, at the same time, all the name parts: "firstname", "lastname", "middlein".

kslungaardmumma · 2024-04-30T23:16:44Z

Hi- Middlein is missing a lot of data. The warnings only display after I run the full code (including the loop). Is there a way to tell where the warning is traced to? The message I get is just “50 warnings recorded - use warnings() to display” and then it shows this same warning again and again. I assume it must be related to the fast link because I can’t see where else logs come into play… I include both full name and each name field separately because I have some concerns about which field middle/last are reported (especially for two part last names, like “Lopez Garcia”). Does that help?

…

On Tue, Apr 30, 2024 at 6:55 PM Anders Alexandersson < ***@***.***> wrote: Do the warning messages occur from the for loop, and/or from the code before or after the for loop? Are all the variables mostly complete (little missing data) -- even "middlein"? Also, it seems excessively redundant to link on both "fullname" and, at the same time, all the name parts: "firstname", "lastname", "middlein". — Reply to this email directly, view it on GitHub <#81 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BDQY4GR3NRLZIS4C7WGXRCTZAAOO5AVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBXGY3TCNZVGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

aalexandersson · 2024-05-01T00:19:27Z

In practice, I have found fastLink to be unreliable with a lot of missing data, say >30%. Are the other linkage variables more complete? Do the warning messages disappear if variable "middlein" is omitted?

You can convert warnings to errors, and then trace the errors. See, for example, https://adv-r.hadley.nz/debugging.html#non-error-failures.

Best practices for linking on names is an important and difficult issue. I am concerned about using highly correlated variables, especially while having warning messages. I would try hard to get rid of the warning messages first (before optimizing the linkage). That is, start with a simple record linkage configuration that works without warning messages. Then expand from it, as needed until you can reproduce the issue. The current code seems overly complicated, for example why use both dfA_allblocks and dfA_block?

tedenamorado · 2024-05-01T04:29:07Z

Thank you, @aalexandersson, for your valuable insights as always.

If you remove middlein from the merge, do you still receive the same warnings?

It is important to note that to prevent numerical underflow caused by calculating extremely small probabilities, we use logarithmic transformations of all model parameters. At each iteration of the EM algorithm, we convert each parameter estimate back to its original scale. The issue might be that some probabilities are exceptionally tiny. For each block, you can verify this by examining matches.out$EM to see if the model parameters are too small.

Another possibility is that one of your blocks contains only a few observations for one of the datasets.

Please keep us updated!

Ted

kslungaardmumma · 2024-05-01T14:05:34Z

Hi Ted and Anders, Thank you both for your input! 1) If I remove "fullname" (which is highly correlated with the other fields) for a subsample of data, I get a different error message (4X) "1: In emlinkMARmov(patterns = counts, nobs.a = nr_a, nobs.b = nr_b, : The EM algorithm has run for the specified number of iterations but has not converged yet.” 2) If I also remove "middlein" (which has a lot of missingness), I get the same message but "fewer" instances of it (1X). 3) If I examine the output for a subsample using matches.out$EM, I do see that there is very small probability of finding a match (e.g. $p.m 1.029798566225282e-05). Some additional context: this may be an instance where there are NOT many matches to be found. One dataset is records for a smaller sample of individuals and the other is voting records from a full state -- it's very possible that there are not many matches to be found in some pairings across states/years/genders. It is also an instance where there may not be many observations in some of the blocks (e.g. few people by genderxbirthyear) -- but the blocking is very helpful for speed. Is it still appropriate (at least: not highly inappropriate) for me to use fastLink for this type of matching? It seems like there is still output created even when this warning occurs. This is a setting where there are many exact matches but I was attracted to fastLink because it provided a speedy way to also facilitate some "fuzzy" matching. (And it's so fast!) Best, Kirsten

…

On Wed, May 1, 2024 at 12:29 AM Ted Enamorado ***@***.***> wrote: Thank you, @aalexandersson <https://github.com/aalexandersson>, for your valuable insights as always. If you remove middlein from the merge, do you still receive the same warnings? It is important to note that to prevent numerical underflow caused by calculating extremely small probabilities, we use logarithmic transformations of all model parameters. At each iteration of the EM algorithm, we convert each parameter estimate back to its original scale. The issue might be that some probabilities are exceptionally tiny. For each block, you can verify this by examining matches.out$EM to see if the model parameters are too small. Another possibility is that one of your blocks contains only a few observations for one of the datasets. Please keep us updated! Ted — Reply to this email directly, view it on GitHub <#81 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BDQY4GUXHW7N3Y2N4HV24GDZABVSTAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBXHE2TSNRZGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

aalexandersson · 2024-05-01T14:32:13Z

What is the approximate run time with and without blocking? How many records are in each dataset? Do the warning and error messages disappear if you reduce the amount of blocking, for example, if you use the variable birthyear in the record linkage step rather than in the blocking step?

kslungaardmumma · 2024-05-01T17:04:11Z

1. If I take a subset of data and run it with blocking as in the original (block by year AND gender), it takes 18.1 minutes and I do get errors (depending on the subset of data). 2. If I revise the code and run it without ANY blocking -- code follows -- it takes 51.83 minutes. I don't get errors (at least not in the subsets of data I tried). matches.out <- fastLink( dfA = dfA, dfB = dfB, varnames = c("firstname", "lastname", "birthmonth", "birthday", "birthyear"), stringdist.match = c("firstname", "lastname"), numeric.match = c("birthmonth", "birthday", "birthyear"), partial.match = c("firstname", "lastname"), verbose = TRUE, threshold.match = 0.855, ) 3. If I reduce my code to block just on gender (and match on birthyear), I do still get the error ("Warning messages: 1: In emlinkMARmov(patterns = counts, nobs.a = nr_a, nobs.b = nr_b,") (at least in some subsets of data). 4. There are something like 700-800 K observations in dfA overall and about 1 million in dfB (though that depends on the state). Getting rid of blocking did seem to get rid of the error message. But since the blocking saves a lot of time I'm inclined to want to keep it in because I have a lot of matching to conduct. My question, then, is this: what is the warning trying to tell me could be going on? (What would be "wrong" about my output, given this warning)? I take the fastLink matches and then subject them to further processes for refinement (i.e., requiring that they exactly match on last name or birth date, etc.). Given that, should I be concerned? Kirsten

…

On Wed, May 1, 2024 at 10:32 AM Anders Alexandersson < ***@***.***> wrote: What is the approximate run time with and without blocking? How many records are in each dataset? Do the warning and error messages disappear if you reduce the amount of blocking, for example, if you use the variable birthyear in the record linkage step rather than in the blocking step? — Reply to this email directly, view it on GitHub <#81 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BDQY4GTYV3QA4PBHJYOEE4TZAD4IHAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYGU2TCNJSGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

aalexandersson · 2024-05-01T19:01:01Z

To learn more about the warning messages, you could either (as I suggested before) convert them to errors and then trace the errors, or you could compare the matched datasets to identify which records differ and how because of the difference in blocking.

Ted suggested two cause possibilities, and I agree. Another third possible cause could be that you have too few linkage variables for the EM algorithm to reach a stable, global maximum. Instead you may have unstable, local maxima. I suggest this since removing the blocking also removed all warnings and errors. Could you add more linkage variables, not correlated with the existing variables? Examples are social security number, phone number, email address, and street address.

I think the bigger question is: how many false positives (count or rate) are you willing to accept? Why did you change the threshold from the default 0.85 to 0.855 -- is the third decimal a typo error or on purpose? Personally, I always run fastLink with a much higher threshold than the default; I typically use either 0.95 or 0.98 or even 0.99 because I am much more concerned about wrong matches (false positives) than missed matches (false negatives) due to my job position (dealing with sensitive cancer data). With the relatively low threshold of around 0.85, then my main concern would be false positives -- not a few warning messages from blocking. You may have different concerns.

If you want a simpler solution, then I recommend running the code without blocking since it does the job without warnings and errors and in less than 1 hour.

kslungaardmumma · 2024-05-01T19:11:08Z

This has all been exceptionally helpful - thank you so much! I will take a look at how matching differs across different specifications. Unfortunately, I don't have other variables I can use for matching. However, I should note that I use fastLInk as "first pass" to generate matches. I then only accept matches that meet certain criteria (including exact matching on last name, birth date, and/or full name) to further refine the matches I accept as "true." Given that, I may be less concerned about false positives in the output from fastLink than other users and more willing to accept some (inevitable) measurement error. I will play around with this some more and see if I can land on the solution that seems to output matches that meet my needs (ideally minimizing pesky warnings). Thanks! Kirsten

…

On Wed, May 1, 2024 at 3:01 PM Anders Alexandersson < ***@***.***> wrote: To learn more about the warning messages, you could either (as I suggested before) convert them to errors and then trace the errors, or you could compare the matched datasets to identify which records differ and how because of the difference in blocking. Ted suggested two cause possibilities, and I agree. Another third possible cause could be that you have too few linkage variables for the EM algorithm to reach a stable, global maximum. Instead you may have unstable, local maxima. I suggest this since removing the blocking also removed all warnings and errors. Could you add more linkage variables, not correlated with the existing variables? Examples are social security number, phone number, email address, and street address. I think the bigger question is: how many false positives (count or rate) are you willing to accept? Why did you change the threshold from the default 0.85 to 0.855 -- is the third decimal a typo error or on purpose? Personally, I always run fastLink with a much higher threshold than the default; I typically use either 0.95 or 0.98 or even 0.99 because I am much more concerned about wrong matches (false positives) than missed matches (false negatives) due to my job position (dealing with sensitive cancer data). With the relatively low threshold of around 0.85, then my main concern would false positives -- not a few warning messages from blocking. You may have different concerns. If you want a simpler solution, then I recommend running the code without blocking since it does the job without warnings and errors and in less than 1 hour. — Reply to this email directly, view it on GitHub <#81 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BDQY4GVXFMTCUSUWKYBOP43ZAE3YJAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYHEZTCMZQGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

kslungaardmumma · 2024-05-02T15:26:33Z

One follow-up here: I have converted warnings into errors. It is clear the error comes in the fastLInk process. It occurs in some cases (not all) even when the block groups are large for both dfA and dfB. The output looks like this: "Running the EM algorithm Iteration number 100.... Maximum difference in log-likelihood = 0.2412 Iteration number 5000 Maximum difference in log-likelihood = 0.2412 Error in emlinkMARmov(patterns = counts, nobs.a = nr_a, nobs.b = nr_b, : (converted from warning) The EM algorithm has run for the specified number of iterations but has not converged yet." So it seems like the EM algorithm is not converging in some cases -- it certainly could be related to calculating very small probabilities. This is a case where there are very few matches that are likely to be found. The warning doesn't kill the function and a (small) number of matches are found, even for the blocks where this warning appears to occur. I tried fiddling with tol.em and that did not seem to make a difference. Excluding missing variables also didn't consistently help. Am I right that what this means is that the algorithm has not found a stable solution, but it is just outputting whatever it has at the end of the specified number of iterations (5000)? As a "good enough" solution, this might do -- I am getting match rates that are in line with my expectations for these "low match" situations. On Wed, May 1, 2024 at 3:10 PM Mumma, Kirsten ***@***.***> wrote:

…

This has all been exceptionally helpful - thank you so much! I will take a look at how matching differs across different specifications. Unfortunately, I don't have other variables I can use for matching. However, I should note that I use fastLInk as "first pass" to generate matches. I then only accept matches that meet certain criteria (including exact matching on last name, birth date, and/or full name) to further refine the matches I accept as "true." Given that, I may be less concerned about false positives in the output from fastLink than other users and more willing to accept some (inevitable) measurement error. I will play around with this some more and see if I can land on the solution that seems to output matches that meet my needs (ideally minimizing pesky warnings). Thanks! Kirsten On Wed, May 1, 2024 at 3:01 PM Anders Alexandersson < ***@***.***> wrote: > To learn more about the warning messages, you could either (as I > suggested before) convert them to errors and then trace the errors, or you > could compare the matched datasets to identify which records differ and how > because of the difference in blocking. > > Ted suggested two cause possibilities, and I agree. Another third > possible cause could be that you have too few linkage variables for the EM > algorithm to reach a stable, global maximum. Instead you may have unstable, > local maxima. I suggest this since removing the blocking also removed all > warnings and errors. Could you add more linkage variables, not correlated > with the existing variables? Examples are social security number, phone > number, email address, and street address. > > I think the bigger question is: how many false positives (count or rate) > are you willing to accept? Why did you change the threshold from the > default 0.85 to 0.855 -- is the third decimal a typo error or on purpose? > Personally, I always run fastLink with a much higher threshold than the > default; I typically use either 0.95 or 0.98 or even 0.99 because I am much > more concerned about wrong matches (false positives) than missed matches > (false negatives) due to my job position (dealing with sensitive cancer > data). With the relatively low threshold of around 0.85, then my main > concern would false positives -- not a few warning messages from blocking. > You may have different concerns. > > If you want a simpler solution, then I recommend running the code without > blocking since it does the job without warnings and errors and in less than > 1 hour. > > — > Reply to this email directly, view it on GitHub > <#81 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/BDQY4GVXFMTCUSUWKYBOP43ZAE3YJAVCNFSM6AAAAABHAZ6Z72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYHEZTCMZQGU> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

aalexandersson · 2024-05-02T19:30:05Z

My understanding is that the EM model must converge to have valid, stable results. Does the EM model converge when there is no blocking? Does the model converge when you remove the problematic variable middlein?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log(X) NaNs produced #81

Log(X) NaNs produced #81

kslungaardmumma commented Apr 30, 2024

aalexandersson commented Apr 30, 2024

kslungaardmumma commented Apr 30, 2024 via email

aalexandersson commented Apr 30, 2024

kslungaardmumma commented Apr 30, 2024 via email

aalexandersson commented Apr 30, 2024

kslungaardmumma commented Apr 30, 2024 via email

aalexandersson commented May 1, 2024

tedenamorado commented May 1, 2024

kslungaardmumma commented May 1, 2024 via email

aalexandersson commented May 1, 2024

kslungaardmumma commented May 1, 2024 via email

aalexandersson commented May 1, 2024 •

edited

Loading

kslungaardmumma commented May 1, 2024 via email

kslungaardmumma commented May 2, 2024 via email

aalexandersson commented May 2, 2024

Log(X) NaNs produced #81

Log(X) NaNs produced #81

Comments

kslungaardmumma commented Apr 30, 2024

aalexandersson commented Apr 30, 2024

kslungaardmumma commented Apr 30, 2024 via email

aalexandersson commented Apr 30, 2024

kslungaardmumma commented Apr 30, 2024 via email

aalexandersson commented Apr 30, 2024

kslungaardmumma commented Apr 30, 2024 via email

aalexandersson commented May 1, 2024

tedenamorado commented May 1, 2024

kslungaardmumma commented May 1, 2024 via email

aalexandersson commented May 1, 2024

kslungaardmumma commented May 1, 2024 via email

aalexandersson commented May 1, 2024 • edited Loading

kslungaardmumma commented May 1, 2024 via email

kslungaardmumma commented May 2, 2024 via email

aalexandersson commented May 2, 2024

aalexandersson commented May 1, 2024 •

edited

Loading