Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modest evaluation #189

Merged
merged 12 commits into from
Apr 4, 2022
Merged

Modest evaluation #189

merged 12 commits into from
Apr 4, 2022

Conversation

wilko77
Copy link
Collaborator

@wilko77 wilko77 commented Apr 1, 2022

I looked into why the asses_blocks_2party function didn't scale well.
As a result I simplified the loop and store less things in memory.

Also, the block keys can be quite large, e.g.: "(3, 265, 403, 665, 927, 165, 41, 303, 565, 827, 965, 203, 465, 727, 865, 103, 365, 627, 503, 765)". That's a 97 character string, which can get problematic if you hold lots of those in memory.
In case people don't need the Bloom filter indices, we can reduce these keys to some smaller hash.

I introduced an additional flag "compress-block-key" in the "blocking-filter" section. If set to True, then blocklib will replace the large indices string with a 5 byte long blake2b hash.

@wilko77 wilko77 requested a review from hardbyte April 1, 2022 12:15
Copy link
Collaborator

@hardbyte hardbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but the new option needs to be documented somewhere.

if total_rec == 0:
num_comparisons += len(dp1_recs) * len(dp2_recs)

found_matches.update(set(dp1_data[rec] for rec in dp1_recs).intersection(dp2_data[rec] for rec in dp2_recs))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks faster than 3 nested for loops!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly... For the 2Mx2M test file I've got, it reduces from tens of minutes to just a few seconds.

@codecov
Copy link

codecov bot commented Apr 3, 2022

Codecov Report

Merging #189 (ab3a4c2) into master (d493657) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #189      +/-   ##
==========================================
- Coverage   97.63%   97.61%   -0.02%     
==========================================
  Files          18       18              
  Lines         634      630       -4     
==========================================
- Hits          619      615       -4     
  Misses         15       15              

@wilko77 wilko77 merged commit 78d56a3 into master Apr 4, 2022
@wilko77 wilko77 deleted the modest_evaluation branch April 4, 2022 04:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants