Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Configuration of LSHMinHash threshold #24

Open
rocketraman opened this issue Apr 4, 2019 · 0 comments
Open

[QUESTION] Configuration of LSHMinHash threshold #24

rocketraman opened this issue Apr 4, 2019 · 0 comments

Comments

@rocketraman
Copy link

I see that the algorithm is based on the MMDS book by Ullman et al. However, your implementation seems to use a fixed THRESHOLD value of 0.5, whereas in the book they describe the THRESHOLD as a chosen value at which documents should be regarded as a "similar pair". From section 3.4.3:

Choose a threshold t that defines how similar documents have to be in order for them to be regarded as a desired “similar pair.” Pick a number of bands b and a number of rows r such that br = n, and the threshold t is approximately (1/b) 1/r. If avoidance of false negatives is important,
you may wish to select b and r to produce a threshold lower than t; if speed is important and you wish to limit false positives, select b and r to produce a higher threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant