Skip to content
suntong edited this page Jul 3, 2017 · 7 revisions

Having forked from mfonda/simhash, go-dedup/simhash has been through a serious of interface changes, and is now stabilized and would soon be released as version 2.

Version 1

For the old mfonda/simhash interface, please use

import "gopkg.in/go-dedup/simhash.v1"

instead.

Check out http://gopkg.in/go-dedup/simhash.v1 for rest of the details. E.g., the source code for version 1 is available at https://github.com/go-dedup/simhash/tree/v1.0, etc.

Version 1 Diagram

Here is the type & function-dependency diagram for version 1:

v1

The key characteristic of version 1 is that most functions are provided as package functions.

Version 2

Version 2 makes use heavily of the Go's interface feature, to make it easy to extend simhash across different languages, because when talking about similarity checking, the written language being checking is another import aspect that cannot be avoided.

The goal of version 2 is to have different languages to have a very similar user interface (API). As a comparison, in version 1, to get a simhash for plain English, you do

hash = simhash.Simhash(simhash.NewWordFeatureSet(d))

However, to get a simhash for UTF strings, you do

hash = simhash.Simhash(simhash.NewUnicodeWordFeatureSet(d, norm.NFC))

It get the job done, but not ideal.

This is what version 2 is addressing.

Version 2 Diagram

Here is the type & function-dependency diagram for version 2:

v2

The key characteristics of version 2 are,

  • most of simhash related functions are provided as method(/member) functions of SimhashBase type(/class).
  • and also very importantly, the UnicodeWordFeatureSet related functions no longer exist in above diagram.

Now, where are the UnicodeWordFeatureSet related functions? They have been refactored into a different package:

v2utf

So what's the advantages of such maneuver?

Version 2 Unified Interface

The goal of version 2 is to have different languages to have a unified user interface (API).

We can see that the usage (API) difference is minimum:

$ diff -wU 1 example_test.go simhashEng/example_test.go
--- example_test.go     2017-07-02 10:13:25.000000000 -0400
+++ simhashEng/example_test.go  2017-07-02 10:10:38.000000000 -0400
@@ -2,3 +2,3 @@
 
-package simhash_test
+package simhashEng_test
 
@@ -8,2 +8,3 @@
        "github.com/go-dedup/simhash"
+       "github.com/go-dedup/simhash/simhashEng"
 )
@@ -14,3 +15,3 @@
        hashes := make([]uint64, len(docs))
-       sh := simhash.NewSimhash()
+       sh := simhashEng.NewSimhash()
        for i, d := range docs {

Thus, it is very easy to switch from the generic similarity checking to English specific one.

And so will it be to switch to a UTF specific one. The demo code is here, and the difference is minimum too:

$ diff -wU 1 example_test.go simhashUTF/example_test.go
--- example_test.go     2017-07-02 10:13:25.000000000 -0400
+++ simhashUTF/example_test.go  2017-07-02 10:11:50.000000000 -0400
@@ -2,3 +2,3 @@
 
-package simhash_test
+package simhashUTF_test
 
@@ -8,2 +8,4 @@
        "github.com/go-dedup/simhash"
+       "github.com/go-dedup/simhash/simhashUTF"
+       "golang.org/x/text/unicode/norm"
 )
@@ -14,3 +16,3 @@
        hashes := make([]uint64, len(docs))
-       sh := simhash.NewSimhash()
+       sh := simhashUTF.NewUTFSimhash(norm.NFKC)
        for i, d := range docs {
@@ -25,9 +27,9 @@
        // Output:
-       // Simhash of ...

Note that to get a simhash, you just need to do

hash = simhash.Simhash(simhash.NewWordFeatureSet(d))

regardless whether is plain English, or UTF, or any other languages.

Chinese Handling

Even with the provided UnicodeWordFeatureSet related functions, the result of similarity checking on Chinese text is very bad. But thanks to version 2's architecture, it is very easy to extend simhash to deal with Chinese:

$ diff -wU 1 example_test.go simhashCJK/example_test.go
--- example_test.go     2017-07-02 10:13:25.000000000 -0400
+++ simhashCJK/example_test.go  2017-07-02 23:07:44.000000000 -0400
@@ -2,3 +2,3 @@
 
-package simhash_test
+package simhashCJK_test
 
@@ -8,2 +8,3 @@
        "github.com/go-dedup/simhash"
+       "github.com/go-dedup/simhash/simhashCJK"
 )
@@ -14,5 +15,9 @@
        hashes := make([]uint64, len(docs))
-       sh := simhash.NewSimhash()
+       sh := simhashCJK.NewSimhash()
        for i, d := range docs {
-               hashes[i] = sh.GetSimhash(sh.NewWordFeatureSet(d))
+               fs := sh.NewWordFeatureSet(d)
+               // fmt.Printf("%#v\n", fs)
+               // actual := fs.GetFeatures()
+               // fmt.Printf("%#v\n", actual)
+               hashes[i] = sh.GetSimhash(fs)
                fmt.Printf("Simhash of '%s': %x\n", d, hashes[i])
@@ -25,9 +30,9 @@
        // Output:
-       // Simhash of ...

With above, now the problem has been fix. Check the result here.

NB, in the above diff output, the hashes[i] = sh.GetSimhash(sh.NewWordFeatureSet(d)) was literately the same. It looks different only because of my debugging attempt.