Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log(X) NaNs produced #81

Open
kslungaardmumma opened this issue Apr 30, 2024 · 15 comments
Open

Log(X) NaNs produced #81

kslungaardmumma opened this issue Apr 30, 2024 · 15 comments

Comments

@kslungaardmumma
Copy link

Hello,

I am running a (looped) script using fastlink. The script runs (and seems to work) but at the end I get a list of 50 warnings "In log(x): NaNs produced." I assumed that this probably has to do with the likelihood function and isn't generally something to be concerned about re: affecting the output-- does that seem right? I am not able to produce a reproducible sample here since this project uses restricted-use data and I am unable to reproduce the issue with the sample data.

Thanks!

@aalexandersson
Copy link

Disclaimer: I am a regular user of fastLink, not a developer.

I am not aware of a best way to handle this warning message. Are there negative values in the dataset? Are you able to show the script (code only, no data)?

@kslungaardmumma
Copy link
Author

kslungaardmumma commented Apr 30, 2024 via email

@aalexandersson
Copy link

Sorry, I cannot see your attached file. Maybe just paste it? Make sure to preview before sending. Markdown is supported.

@kslungaardmumma
Copy link
Author

kslungaardmumma commented Apr 30, 2024 via email

@aalexandersson
Copy link

Do the warning messages occur from the for loop, and/or from the code before or after the for loop?

Are all the variables mostly complete (little missing data) -- even "middlein"? Also, it seems excessively redundant to link on both "fullname" and, at the same time, all the name parts: "firstname", "lastname", "middlein".

@kslungaardmumma
Copy link
Author

kslungaardmumma commented Apr 30, 2024 via email

@aalexandersson
Copy link

In practice, I have found fastLink to be unreliable with a lot of missing data, say >30%. Are the other linkage variables more complete? Do the warning messages disappear if variable "middlein" is omitted?

You can convert warnings to errors, and then trace the errors. See, for example, https://adv-r.hadley.nz/debugging.html#non-error-failures.

Best practices for linking on names is an important and difficult issue. I am concerned about using highly correlated variables, especially while having warning messages. I would try hard to get rid of the warning messages first (before optimizing the linkage). That is, start with a simple record linkage configuration that works without warning messages. Then expand from it, as needed until you can reproduce the issue. The current code seems overly complicated, for example why use both dfA_allblocks and dfA_block?

@tedenamorado
Copy link
Collaborator

Thank you, @aalexandersson, for your valuable insights as always.

If you remove middlein from the merge, do you still receive the same warnings?

It is important to note that to prevent numerical underflow caused by calculating extremely small probabilities, we use logarithmic transformations of all model parameters. At each iteration of the EM algorithm, we convert each parameter estimate back to its original scale. The issue might be that some probabilities are exceptionally tiny. For each block, you can verify this by examining matches.out$EM to see if the model parameters are too small.

Another possibility is that one of your blocks contains only a few observations for one of the datasets.

Please keep us updated!

Ted

@kslungaardmumma
Copy link
Author

kslungaardmumma commented May 1, 2024 via email

@aalexandersson
Copy link

What is the approximate run time with and without blocking? How many records are in each dataset? Do the warning and error messages disappear if you reduce the amount of blocking, for example, if you use the variable birthyear in the record linkage step rather than in the blocking step?

@kslungaardmumma
Copy link
Author

kslungaardmumma commented May 1, 2024 via email

@aalexandersson
Copy link

aalexandersson commented May 1, 2024

To learn more about the warning messages, you could either (as I suggested before) convert them to errors and then trace the errors, or you could compare the matched datasets to identify which records differ and how because of the difference in blocking.

Ted suggested two cause possibilities, and I agree. Another third possible cause could be that you have too few linkage variables for the EM algorithm to reach a stable, global maximum. Instead you may have unstable, local maxima. I suggest this since removing the blocking also removed all warnings and errors. Could you add more linkage variables, not correlated with the existing variables? Examples are social security number, phone number, email address, and street address.

I think the bigger question is: how many false positives (count or rate) are you willing to accept? Why did you change the threshold from the default 0.85 to 0.855 -- is the third decimal a typo error or on purpose? Personally, I always run fastLink with a much higher threshold than the default; I typically use either 0.95 or 0.98 or even 0.99 because I am much more concerned about wrong matches (false positives) than missed matches (false negatives) due to my job position (dealing with sensitive cancer data). With the relatively low threshold of around 0.85, then my main concern would be false positives -- not a few warning messages from blocking. You may have different concerns.

If you want a simpler solution, then I recommend running the code without blocking since it does the job without warnings and errors and in less than 1 hour.

@kslungaardmumma
Copy link
Author

kslungaardmumma commented May 1, 2024 via email

@kslungaardmumma
Copy link
Author

kslungaardmumma commented May 2, 2024 via email

@aalexandersson
Copy link

My understanding is that the EM model must converge to have valid, stable results. Does the EM model converge when there is no blocking? Does the model converge when you remove the problematic variable middlein?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants