-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modest evaluation #189
Modest evaluation #189
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but the new option needs to be documented somewhere.
if total_rec == 0: | ||
num_comparisons += len(dp1_recs) * len(dp2_recs) | ||
|
||
found_matches.update(set(dp1_data[rec] for rec in dp1_recs).intersection(dp2_data[rec] for rec in dp2_recs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks faster than 3 nested for loops!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly... For the 2Mx2M test file I've got, it reduces from tens of minutes to just a few seconds.
Codecov Report
@@ Coverage Diff @@
## master #189 +/- ##
==========================================
- Coverage 97.63% 97.61% -0.02%
==========================================
Files 18 18
Lines 634 630 -4
==========================================
- Hits 619 615 -4
Misses 15 15 |
I looked into why the
asses_blocks_2party
function didn't scale well.As a result I simplified the loop and store less things in memory.
Also, the block keys can be quite large, e.g.:
"(3, 265, 403, 665, 927, 165, 41, 303, 565, 827, 965, 203, 465, 727, 865, 103, 365, 627, 503, 765)"
. That's a 97 character string, which can get problematic if you hold lots of those in memory.In case people don't need the Bloom filter indices, we can reduce these keys to some smaller hash.
I introduced an additional flag
"compress-block-key"
in the"blocking-filter"
section. If set toTrue
, then blocklib will replace the large indices string with a 5 byte long blake2b hash.