Ruby Vector Space Model (VSM) with tf*idf weights

Calculates the similarity between texts using a bag-of-words Vector Space Model with Term Frequency-Inverse Document Frequency (tf*idf) weights. If your use case demands performance, use Lucene or similar (see below).

Usage

require 'tf-idf-similarity'

corpus = TfIdfSimilarity::Collection.new
corpus << TfIdfSimilarity::Document.new("Lorem ipsum dolor sit amet...")
corpus << TfIdfSimilarity::Document.new("Pellentesque sed ipsum dui...")
corpus << TfIdfSimilarity::Document.new("Nam scelerisque dui sed leo...")

p corpus.similarity_matrix

The following methods accept a {function: :bm25} options hash to use the Okapi BM25 ranking function instead of tf*idf:

term_frequency
inverse_document_frequency
term_frequency_inverse_document_frequency
similarity_matrix

Read the documentation at RubyDoc.info.

Optimizations

This gem will use the first available library below, for faster matrix multiplication.

GNU Scientific Library (GSL)

gem install gsl

NArray

gem install narray

NMatrix

The nmatrix gem gives access to Automatically Tuned Linear Algebra Software (ATLAS), which you may know of through Linear Algebra PACKage (LAPACK) or Basic Linear Algebra Subprograms (BLAS). Follow these instructions to install the nmatrix gem. You may need additional instructions for Mac OS X Lion.

Extras

You can access more term frequency, document frequency, and normalization formulas with:

require 'tf-idf-similarity/extras/collection'
require 'tf-idf-similarity/extras/document'

The default tf*idf formula follows the Lucene Conceptual Scoring Formula.

Why?

At the time of writing, no other Ruby gem implemented the tf*idf formula used by Lucene, Sphinx and Ferret.

rsemantic now uses the same term frequency and document frequency formulas as Lucene.
treat offers many term frequency formulas, one of which is the same as Lucene.
similarity uses cosine normalization, which corresponds roughly to Lucene.

Term frequencies

The vss gem does not normalize the frequency of a term in a document; this occurs frequently in the academic literature, but only to demonstrate why normalization is important. The tf_idf and similarity gems normalize the frequency of a term in a document to the number of terms in that document, which never occurs in the literature. The tf-idf gem normalizes the frequency of a term in a document to the number of unique terms in that document, which never occurs in the literature.

Document frequencies

The vss gem does not normalize the inverse document frequency. The treat, tf_idf, tf-idf and similarity gems use variants of the typical inverse document frequency formula.

Normalization

The treat, tf_idf, tf-idf, rsemantic and vss gems have no normalization component.

Additional adapters

Adapters for the following projects were also considered:

Ruby-LAPACK is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme.
Linalg and RNum give access to LAPACK from Ruby, but are old and unavailable as gems.

Reference

Bugs? Questions?

This gem's main repository is on GitHub: http://github.com/opennorth/tf-idf-similarity, where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
lib		lib
spec		spec
.gitignore		.gitignore
.travis.yml		.travis.yml
.yardopts		.yardopts
Gemfile		Gemfile
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
USAGE		USAGE
td-idf-similarity.gemspec		td-idf-similarity.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ruby Vector Space Model (VSM) with tf*idf weights

Usage

Optimizations

GNU Scientific Library (GSL)

NArray

NMatrix

Extras

Why?

Term frequencies

Document frequencies

Normalization

Additional adapters

Reference

Further Reading

Bugs? Questions?

About

Releases

Packages

License

nimnes/tf-idf-similarity

Folders and files

Latest commit

History

Repository files navigation

Ruby Vector Space Model (VSM) with tf*idf weights

Usage

Optimizations

GNU Scientific Library (GSL)

NArray

NMatrix

Extras

Why?

Term frequencies

Document frequencies

Normalization

Additional adapters

Reference

Further Reading

Bugs? Questions?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages