Issue #578 (RecordLink.blocker and Gazetteer.blocker create a huge number of blocks) is not fixed in version 1.7.0 #587

ofershar · 2017-07-10T06:58:29Z

I've installed version 1.7.0 of Dedupe and re-ran the test code for issue #578 (see link in that issue's description).
The number of blocks created by RecordLink.blocker is still huge. In fact, it seems to be even larger than before. I stopped the run when the CSV file containing the blocks reached a size of 10G.

I use Python 3.5.3 and RHEL 6.5 (but that's probably irrelevant to the problem).

fgregg · 2017-07-10T12:33:53Z

You need to to use the target=True argument for blocking your target dataset. Then you'll need to reproduce the logic of _blockGenerator with your db https://github.com/dedupeio/dedupe/blob/master/dedupe/api.py#L398-L420

fgregg · 2017-07-10T14:06:57Z

This will be resolved when we have a big record link example dedupeio/dedupe-examples#23

ofershar · 2017-07-11T07:09:35Z

Hi Forest, Thanks for your reply (and for sending it so quickly). I just wanted to make sure I understand correctly what you meant. So here is how I figured it out: Using the Gazetteer class, I first need to block my target dataset with target=True argument. (I already tested it, and it seems to be working fine). Then, I need to block my "messy" dataset with target=False. But calling it "as is" would still produce a huge number of blocks (as reported in issue #578). For preventing this from happening, I need to create a derived class of Gazetteer, similar to DatabaseGazetteer in https://github.com/dedupeio/address-matching/blob/sqlclass/address_matching.py , with my own implementation of _blockRecords that would access my DB (similar to the code of _blockData in the above link). This should make the blocking work properly even with target=False. Is that right? Or maybe I should be using the same instance of Gazetteer / DatabaseGazetteer for blocking both the target and the messy data? If that is the case, wouldn’t the overridden implementation of _blockRecords tamper with the proper blocking of the target dataset? Thanks, Ofer From: Forest Gregg [mailto:notifications@github.com] Sent: Monday, July 10, 2017 3:34 PM To: dedupeio/dedupe <dedupe@noreply.github.com> Cc: Ofer Sharon <Ofer.Sharon@tsgitsystems.com>; Author <author@noreply.github.com> Subject: Re: [dedupeio/dedupe] Issue #578 (RecordLink.blocker and Gazetteer.blocker create a huge number of blocks) is not fixed in version 1.7.0 (#587) You need to to use the target=True argument for blocking your target dataset. Then you'll need to reproduce the logic of _blockGenerator with your db https://github.com/dedupeio/dedupe/blob/master/dedupe/api.py#L398-L420 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#587 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ASePy_G9sKtvF31B8ivmgzgPmvfOR0yuks5sMho1gaJpZM4OSbio>.

…

____________________________ The information contained in this communication (including its attachments) is for the intended recipient only. It may contain confidential, proprietary or otherwise protected information. If you received this communication in error, please: (a) note that any use, disclosure, copying, distribution hereof, and/or taking any action in reliance on its contents, is strictly prohibited and may be unlawful, and (b) notify us immediately, by replying to the message, and then delete it from your system.

fgregg · 2017-07-11T11:29:59Z

Then, I need to block my "messy" dataset with target=False. But calling it "as is" would still produce a huge number of blocks (as reported in issue #578).

This is right, but you don't need to store the blocks in your database or anywhere else. As soon as you generate a block key for a messy record you can see if it matches any, stored block key of your target record. That's what's going on in that method I linked to. You don't have to subclass the dedupe class (though you can), but you do need that type of logic.

fgregg mentioned this issue Jul 10, 2017

Gazetteer example has a poor recall rate (under 28%) #588

Closed

fgregg closed this as completed Jul 10, 2017

github-actions bot locked as resolved and limited conversation to collaborators Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #578 (RecordLink.blocker and Gazetteer.blocker create a huge number of blocks) is not fixed in version 1.7.0 #587

Issue #578 (RecordLink.blocker and Gazetteer.blocker create a huge number of blocks) is not fixed in version 1.7.0 #587

ofershar commented Jul 10, 2017

fgregg commented Jul 10, 2017

fgregg commented Jul 10, 2017

ofershar commented Jul 11, 2017 via email

fgregg commented Jul 11, 2017

Issue #578 (RecordLink.blocker and Gazetteer.blocker create a huge number of blocks) is not fixed in version 1.7.0 #587

Issue #578 (RecordLink.blocker and Gazetteer.blocker create a huge number of blocks) is not fixed in version 1.7.0 #587

Comments

ofershar commented Jul 10, 2017

fgregg commented Jul 10, 2017

fgregg commented Jul 10, 2017

ofershar commented Jul 11, 2017 via email

fgregg commented Jul 11, 2017