Textar

Document search using Suffix Array and Van Embde Boas Trees.

This was final project of the subject Advanced Data Structure (CSCI 6057) in supervision of professor Dr. Meng He

Use Case

If you have levelled documents in JSON format, then you can use this SDK to index all documents and search through it efficiently. Most suitable use-case (and this project is tested on) is website search. On the build time of the static website or documentation site, you can omit the index data in JSON which can then be used by the Golang server using this SDK.

import "github.com/Marvin9/textar/pkg"

func main() {
    index := pkg.NewDictionaryIndex(
		[]pkg.Dictionary{
			{
				Id: "home-page",
				Indexes: []pkg.Index{
					{
						Type:   "h1",
						String: "This is heading",
					},
					{
						Type:   "p",
						String: "This is paragraph",
					},
				},
			},
		},
	)

	s := index.SearchWithOpts("paragraph", pkg.SearchOpts{
		IncludeParentRelationShip: true,
	})

	s.Display(pkg.SearchResultOpts{
		PrefixLength: 10,
		SuffixLength: 10,
		ShowParent:   true,
	})
}

Mechanism

We have a string "S" of length "n". This string is divided into blocks, each denoted as "Bi". These blocks represent different parts of the text, such as paragraphs, headings, or subheadings. Each block "Bi" contains a substring of "S" starting from index "j", denoted as "Bi[j:]", along with an offset indicating the index of its first character in the original string "S", and a type "T".

Let "Di" represent the document that contains the string "S" and its blocks.

This simplification allows us to focus on the fundamental structure of the text and how it's organised into blocks within documents.

We have "n" Documents with String "S" and "m" Blocks:

"i" = 0...n "D0, D1, D2, ..., Dn" For each document "Di":

"j" = 0...m "B0, B1, B2, ..., Bm" For each block "Bj":

"Bj.T" represents the type of the block (e.g., paragraph, heading, subheading) "Bj.Offset" is calculated as: "Bj.Offset = j > 0 ? Bj-1.Offset + length(Bj-1.String) : 0" Within each document "Di", we have substrings:

"k" = j...o "DiSj, DiSj+1, ..., DiSo"

We'll create a suffix array index for each dictionary "Di" and a Van Emde Boas tree for each dictionary "Di". Then, for each search query "Q" of string "Str", we'll search for all occurrences of "Str" in all dictionaries "Di=0...n" using the suffix array index of dictionary "Di". For each matching substring "Substr" in "Di" and block "Bj", we'll find the first block "Bk" where "k < j" and "Bk.T != Bj.T", utilising Van Emde Boas trees.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
brute		brute
pkg		pkg
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Textar

Use Case

Mechanism

About

Releases

Packages

Languages

Marvin9/textar

Folders and files

Latest commit

History

Repository files navigation

Textar

Use Case

Mechanism

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages