This package provide a simple way to "Nrigram Indexing" in input document. It is refer from an article - Google Code Search.
Here is the introduction what is "trigram indexing" and how Google Code Search use it for search but it is in Chinese :) .
This package base on my another project Trigram, but it got better performance(~3 times). (refer Benchmark)
It has done as follow:
- Replace operation data in slice rather than map in intersect operation
- But still remain using map in query
This package using trigram indexing to get all trigram in input string (what we call document).
Here is some trigram rule as follow:
- It will not transfer Upper case to Lower case. (follow code search rule)
- Includes "space"
go get github.com/kkdai/ngram
package main
import (
"fmt"
. "github.com/kkdai/ngram"
)
func main() {
//Currently is support Twogram, Trigram and Fourgram
ti := NewNgramIndex(Trigram)
ti.Add("Code is my life") //doc 1
ti.Add("Search") //doc 2
ti.Add("I write a lot of Codes") //doc 3
//Print all trigram map
fmt.Println("It has ", len(ti.TrigramMap))
for k, v := range ti.TrigramMap {
fmt.Println("trigram=", k, " obj=", v)
}
//Search which doc include this code
ret := ti.Query("Code")
fmt.Println("Query ret=", ret)
// [1, 3]
}
Still working to improve the query time.
//Original benchmark in trigram
BenchmarkAddTwogram-4 200000 7151 ns/op
BenchmarkAddTrigram-4 300000 6713 ns/op
BenchmarkAddFourgran-4 300000 5813 ns/op
BenchmarkDeleteTwogram-4 500000 4591 ns/op
BenchmarkDeleteTrigram-4 500000 3695 ns/op
BenchmarkDeleteFourgram-4 500000 3297 ns/op
BenchmarkQueryTwogran-4 10000 8361813 ns/op
BenchmarkQueryTrigran-4 10000 7650419 ns/op
BenchmarkQueryFourgram-4 10000 6975925 ns/op
//Optimize result
BenchmarkAddTwogram-4 300000 5737 ns/op
BenchmarkAddTrigram-4 500000 4795 ns/op
BenchmarkAddFourgran-4 500000 4158 ns/op
BenchmarkDeleteTwogram-4 20000 167246 ns/op
BenchmarkDeleteTrigram-4 20000 148756 ns/op
BenchmarkDeleteFourgram-4 20000 128022 ns/op
BenchmarkQueryTwogran-4 10000 2461910 ns/op
BenchmarkQueryTrigran-4 10000 2276625 ns/op
BenchmarkQueryFourgram-4 10000 2172323 ns/op
BTW: Here is benchmark for https://github.com/dgryski/go-trigram for my improvement record:
BenchmarkAdd-4 1000000 1063 ns/op
BenchmarkDelete-4 100000 140392 ns/op
BenchmarkQuery-4 10000 474320 ns/op
- Google Code Search (using Go)
- Trigram Algorithm
- https://github.com/dgryski/go-trigram
- Regular Expression Matching with a Trigram Index
- Approximate string-matching algorithms, part 2
It is one of my project 52.
This package is licensed under MIT license. See LICENSE for details.