GitHub - mikepk/sextant: Simple Python Vector Space Model search engine

Sextant

Sextant is a very simple vector space search engine written in Python. Its primary purpose is to quickly determine similarity between "bags of words" which can either be documents or tag collections. It's a relatively straightforward implementation of a traditional vector space model for similarity scoring. Even though it uses sparse vectors for term collections it's currently not very memory efficient (~40K documents averaging ~900 terms each takes about 200MB on my server). A vector similarity computation of that collection takes about one second on a virtual 1GHz machine.

It can be used with both normalized raw term frequency weighting or term frequency / inverse document frequency weighting.

Requirements

numpy

Future

Use a leader / follower vector search to improve memory utilization and comparison speed
Add memory paging of vector collections to reduce the requirement for the whole collection to be in memory
Allow multiple document collections
Improve memory usage
Add additional weighting modes

License

Sextant is distributed under the MIT license. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
Daemon.py		Daemon.py
LICENSE		LICENSE
README.rst		README.rst
requirements.txt		requirements.txt
stopwords.py		stopwords.py
vec_test.py		vec_test.py
vecsearch.py		vecsearch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sextant

Requirements

Future

License

About

Releases

Packages

Languages

License

mikepk/sextant

Folders and files

Latest commit

History

Repository files navigation

Sextant

Requirements

Future

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages