# Locality Sensitive Hashing for semantic similarity ###### [![forthebadge](http://forthebadge.com/images/badges/made-with-python.svg)](http://forthebadge.com) vs 3.x LSH (Locality Sensitive Hashing) is primarily used to find, given a large set of documents, the near-duplicates among them. It can use hamming distance, jaccard coefficient, edit distance or other distance notion. You can read the following tutorials if you want to understand more about it: - [Ravi Kumar's work](https://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf) - [Matti Lyra's work](https://mattilyra.github.io/2017/05/23/document-deduplication-with-lsh.html) - [Insideops](https://insideops.wordpress.com/2015/07/30/similarity-search-and-hashing-for-text-documents/) Although LSH is more to duplicated documents than to semantic similar ones, in this approach I make an effort to use LSH to calculate semantic similarity among texts. For that, the algorithm extracts, using TFIDF, the text's main tokens (or you can pre-calculate them and pass as param). Also, in this approach I use MinHash (which uses Jaccard similarity) as the Similarity function. **The overall aim is to reduce the number of comparisons needed to find similar items. LSH uses hash collisions to capture objects similarities.** The hash collisions come in handy here as similar documents have a high probability of having the same hash value. The probability of a hash collision for a minhash is exactly the Jaccard similarity of two sets. See [this tutorial](tutorial.ipynb) to see how use this LSH! Run as following to install dependencies: ``` python3 -m pip install -r requirements.txt ```