Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using jiebaR package (SimHash algorithm) #66

Open
remibacha opened this issue Oct 22, 2018 · 3 comments
Open

Using jiebaR package (SimHash algorithm) #66

remibacha opened this issue Oct 22, 2018 · 3 comments

Comments

@remibacha
Copy link

remibacha commented Oct 22, 2018

Hello

Here are 2 texts I would like to check for near duplicate thanks to the SimHash algorithm (jiebaR package):

 library(jiebaR)
 coder <- "Simhash detects near duplicates and not exact duplicates"
 codel <- "SimHash is a technique for quickly detect near duplicates"

I have create a worker called "simhasher":

 simhasher = worker("simhash", topn = 5)
 simhasher <= codel

Then I have computed the distance:

 distance(codel, coder, simhasher)

Here is the result:

 $distance
 [1] 22

 $lhs
 11.7392      11.7392      11.7392      11.7392      11.7392 
 "duplicates"  "technique"    "SimHash"     "detect"    "quickly" 

 $rhs
 23.4784      11.7392      11.7392      11.7392 
 "duplicates"    "Simhash"    "detects"      "exact" 

I need you help on 3 things:

  1. the distance is 22. The bigger the distance is, the more the 2 texts are different. Here texts seems REALLY close, so I was expected the distante to be smaller... Can you please explain me this result?

  2. What are the figures above the words in lhs and rhs ? (e.g: 11.7392 , 23.4784)

  3. I also checked the worker I have created :

    simhasher <= codel

And here is the result I discovered:

 $simhash
 [1] "12382334418040220206"

 $keyword
 11.7392      11.7392      11.7392      11.7392      11.7392 
 "duplicates"  "technique"    "SimHash"     "detect"    "quickly" 

What is the simhash here and why do I need to create it before to run the distance function? This part is not really clear to me and not really explained inside the package documentation.

Can you please help me? This package seems really powerfull but I feel like I only understand 5% of it.

@BruceZhaoR
Copy link

@remibacha
the jiebaR::distance first use TF-IDF calculate the keywords, then use these keywords to generate 64bits hash code, last, calucuate the hamming-distance between the hash codes.
Here is an example:

library(jiebaR)
#> Loading required package: jiebaRD
simhasher_5 = worker("simhash", topn = 5)
keyword_1 <- c("Simhash", "duplicates")
keyword_2 <- c("Simhash", "quickly")
simhash_1 <- vector_simhash(keyword_1, simhasher_5)
simhash_1
#> $simhash
#> [1] "144150442997195320"
#> 
#> $keyword
#>      11.7392      11.7392 
#>    "Simhash" "duplicates"
simhash_2 <- vector_simhash(keyword_2, simhasher_5)
simhash_2
#> $simhash
#> [1] "1730138795753340968"
#> 
#> $keyword
#>   11.7392   11.7392 
#> "Simhash" "quickly"

tobin(simhash_1$simhash)
#> [1] "0000001000000000001000000001000001101101000100000010001000111000"
tobin(simhash_2$simhash)
#> [1] "0001100000000010101100000001000101101101000000000000000000101000"
# hamming-distance
simhash_dist(simhash_1$simhash, simhash_2$simhash)
#> [1] 11
vector_distance(keyword_1, keyword_2, simhasher_5)
#> $distance
#> [1] 11
#> 
#> $lhs
#>      11.7392      11.7392 
#>    "Simhash" "duplicates" 
#> 
#> $rhs
#>   11.7392   11.7392 
#> "Simhash" "quickly"

# only one keyword "Simhash"
simhasher_1 <- worker("simhash", topn = 1)
simhash_1 <- vector_simhash(keyword_1, simhasher_1)
simhash_1
#> $simhash
#> [1] "1883542797686548280"
#> 
#> $keyword
#>   11.7392 
#> "Simhash"

simhash_2 <- vector_simhash(keyword_2, simhasher_1)
simhash_2
#> $simhash
#> [1] "1883542797686548280"
#> 
#> $keyword
#>   11.7392 
#> "Simhash"

tobin(simhash_1$simhash)
#> [1] "0001101000100011101100000011000111101111010110100010011100111000"
tobin(simhash_2$simhash)
#> [1] "0001101000100011101100000011000111101111010110100010011100111000"
# hamming-distance
simhash_dist(simhash_1$simhash, simhash_2$simhash)
#> [1] 0

vector_distance(keyword_1, keyword_2, simhasher_1)
#> $distance
#> [1] 0
#> 
#> $lhs
#>   11.7392 
#> "Simhash" 
#> 
#> $rhs
#>   11.7392 
#> "Simhash"

Created on 2018-10-23 by the reprex package (v0.2.0).

hamming_distance: https://en.wikipedia.org/wiki/Hamming_distance

You can modify the user dict in jiebaRD, ?USERPATH, ?edit_dict, which can change the weight of word's TF-IDF.

@remibacha
Copy link
Author

Thanks for this example, really helpfull ! But I still don't get what the figures above the words in lhs and rhs are (e.g: 11.7392). Can you please explain it?

@BruceZhaoR
Copy link

@remibacha jiebaR is design for Chinese Text Segment, it has a default idf dict which only contains Chinse words. Maybe the default idf weight for English word is 11.7392. So, the tf-idf = tf * idf. Here is an example:

IDFPATH
#> [1] "E:/R/R-3.5-library/jiebaRD/dict/idf.utf8"
keys = worker("keywords", topn = 2)
keys <= "Simhash is quick, Simhash ia fast"
#> 23.4784   11.7392 
#> "Simhash"    "fast" 

If you want to get a more accuary tf-idf weight, you need to train the Corpus yourself. The get_idf function may help you. Then you can use worker("keywords", idf = "path to your idf.dict", ....)

Suppose you have many Englisth corpus, you can use these corpus to trian idf, then, use worker("simhash", ...) to generate every doc's simhash value, last, you can use simhash_dist_mat to get the distance of the docments.

There is stringdist package, which can calculate various string distances based on edits
(Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (q-
gram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An
implementation of soundex is provided as well. Distances can be computed between
character vectors while taking proper care of encoding or between integer
vectors representing generic sequences. This package is built for speed and
runs in parallel by using 'openMP'. An API for C or C++ is exposed as well

I think the main trick is to hash the keyword and weight to the simhash code, and it is pretty fast for calculating hamming-distance, which can used for de-duplicate docs. for more, you can read https://github.com/yanyiwu/simhash/blob/master/README_EN.md the author's cppjieba is the soure of jiebaR. Some introductions: https://github.com/seomoz/simhash-cpp/#architecture and https://yanyiwu.com/work/2014/01/30/simhash-shi-xian-xiang-jie.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants